Wishing you good luck and great success in year 2016.

Here is small puzzle game for you:

PF 2016 Puzzle Game

http://ysofters.com/pf2016

Source code is available at GitHub.

You can play the game on tablet, phone or computer.

Based on: Enigma game, Kiwi.JS, Font Awesome.

Limitation for users with Windows 10 with touch displays: Mouse does not work, you can use only touch. Bug was already reported to authors of Kiwi.JS framework.

You can enjoy also PF games from previous years: PF 2015, PF 2014, PF 2013,PF 2012,PF 2011, PF 2010

Opting for a distributed solution of a problem is mostly motivated by two goals: resilience to failures (via redundancy) and better response time. In this post we will focus on the former. More specifically, we will investigate the conditions necessary to maintain a consistent distributed key-value store.

CAP

From the theoretical side, any possible solution is limited by the so-called CAP theorem. Informally, the theorem states that in an asynchronous network no database can remain

  • Consistent and
  • Available when
  • Partitioning of the network is possible.

A realistic environment of a distributed system must assume that server failures are possible and thus that there is no reasonable fix bound on the time needed to exchange messages between two servers. And while some may find the formulation of the theorem ambiguous, its inevitable corollary is that a distributed database cannot guarantee consistency of replicas and availability of read/write operations at the same time.

Consensus

Leslie Lamport’s Paxos protocol demonstrates the feasibility of maintaining consistency whenever a majority of servers is functioning. The above theorem thus states that any distributed system using the protocol cannot be always available. But there are some highly-available solution apart of Paxos.

Etcd

Etcd is an always-consistent highly-available key-value data store. The consistency is maintained by a majority consensus (quorum). The quorum is established and maintained by the Raft algorithm, i.e. Raft maintains a consistent replicated log among servers, each of which executes the commands stored in the log on the key-value store. As any other consensus protocol, Raft requires that among n servers at most f=⌊(n-1)⁄2⌋ is faulty. If more than f servers become faulty Raft times out all write operations but the read operations may work if consistency is not required.

Maintaining consistency is expensive even if only the possibility of partitioning has to be appreciated. Rather than describing the algorithm itself it might be more illustrative to show performance results of running etcd in real application. The figure below depicts the transaction throughput in a 5-node cluster.

Etcd throughput in a 5-node cluster.

The script that measured the performance increased the load in three steps: first a 2500 transaction was sent every second (for 30 seconds), then 5000 for another 30 seconds and then 10000. The data in the plot then report how many transaction successfully finished each second.

Leader Election

Partitions of the cluster are handled by a coordinated replication, i.e. a leader is elected who is responsible for all communication (both within the cluster and with the outside world). In the case of such a partition that separates the leader from most of the nodes, a new leader is elected. The leader election starts whenever the communication between the leader and a follower is lost for more than a randomly chosen time period called election timeout. After the timeout, the follower begins the election by broadcasting the request for vote. Everyone votes for the server that send the request first and the new leader is the server with a majority of votes. Together with the transaction throughput, the performance of leader election are the most important metrics to measure for a consensus protocol.

Etcd leader election time.

The graphs above show the time to detect and replace a crashed leader. Each line represents 1,000 trials (except for 100 trials for “150–150 ms”) and corresponds to a particular choice of election timeouts; for example, “150–155 ms” means that election timeouts were chosen randomly and uniformly between 150 ms and 155 ms. The steps that appear on the graphs show when split votes occur (the cluster must wait for another election timeout before a leader can be elected).

Etcd leader election -- number of packets.

Finally, above is the distribution of the number of packets needed to elect a leader. The figure  shows the number of packets sent to elect a leader in the 150–200 ms case. The minimum number of packets is six, representing a single node dispatching requests to the other four nodes and receiving votes from the first two. The plot clusters around multiples of eight: traces under eight represent the first node timing out and winning the election, traces between 8 and 16 represent two nodes timing out before a leader is elected.

Problem Definition

Let us assume that our servers need to share certain configuration. The servers can be restarted, thus becoming temporarily unavailable, and the network can have arbitrarily long delays. The stored data should be available whenever at least one server is alive. Some data must be consistent but other data suffice in possible stale version.

Case 1: Consistency of Logs

Some servers store logs data as a single continuous string. It is required that the logging information is extracted from this string, started from the last extraction point up until the last log. Each log must be accepted for exactly once. We can thus generalise the problem to implementing the following function:


string getLastLog( Server s, string rawLogs )

  • the string rawLogs is a sequence of log information, i.e. <(p1t1, ..), .. , (pntn, ..)>
  • the function returns a suffix lastLogs of rawLogs such that lastLogs=<(pi, ti, ..), .. , (pn, tn, ..)>, where ti-1<ts and ti >= ts, and ts is the time-stamp of the last retrieval
  • the function mus be thread-safe in accepting each log at most once regardless the number of concurrent calls of the function

Case 2: Quorum-less access

It is required that the configuration data that need not be consistent can be retrieved and stored even when the quorum cannot be established.

  • The get operation returns a value that was either last written by another node (while under quorum), or by this node (possibly without quorum).
  • The set operation remember the the new value and stores is when the quorum is reestablished.

Candidate: Etcd

The Etcd has limits in solving both the above cases:

  1. The CAS operation is implemented as a set operation which specifies the expected previous value that returns either a Boolean value, an exception, or an error. Yet when exception/error is returned we have no guarantees regarding the value in the data store, e.g. the request may have timed out, but afterwards was accepted by Raft and subsequently reflected in the data store.
  2. Etcd does allow get to operate without quorum but only if the leader is working (any or indeed all other servers may fail). The set operation is not allowed without quorum.

Requirements for Sufficient Solution

Three interface function are required and while the concrete implementation can vary we specify it exactly to avoid confusion.

type Response = { bool valid, bool timeout, string value }

Response get( string key, bool consistent, bool blocking )

  • consistent=true the returned value was stored on a majority of servers at the time when this request was accepted by the consensus protocol
  • consistent=false the returned value is the most recent value stored on the server that first received this request (other servers may have different value)
  • blocking=true the function returns only after the desired data is retrieved
  • blocking=false the function may timeout

void set( string key, string value, bool consistent, bool blocking )

  • consistent=true the new value is accepted by a majority of servers and until another set for the same key is requested, the value will be returned by any consistent get
  • consistent=false the new value is accepted by at least the server that first received this request, until another set is requested at this server, the value will be returned; furthermore once quorum is established the consistent set guarantees will be assumed

Response swap( string key, string new_value, string old_value, bool blocking )

  • consistent by default
  • blocking=true the function returns only after the quorum is established, the function returns true if the old_value was stored under key at the time of accepting this request on a majority AND the new_value will be returned by any consistent get until the value is rewritten; most importantly no other value can be returned by get between old_value and new_value. If old_value was not stored under key, the function does not modify the data store and returns false.
  • blocking=false same guarantees as above but the function may timeout without modifying the data store.

Any distributed data-store conforming to the above requirements would be sufficient to solve the above problem. Yet the cost of consistency is high and only a few data-stores offer adequate protection from partitioning. Etcd seems the closest candidate but its functionality needs to be extended.

While caring about security of our code is arguably important, it is not enough for building a secure product. Vulnerabilities might also arise from a 3rd party component. Handling of those vulnerable libraries is thus another essential aspect of building a secure product.

Database of vulnerabilities

A simple way for managing a large number of 3rd party libraries might be using a vulnerability database. There is well-known National Vulnerability Database that contains information about many vulnerabilities. This is also what OWASP Dependency Check uses. Let me introduce it.

NVD aims to contain all known vulnerabilities of publicly available software. An entry about a vulnerability has some CVE number. (CVE means “Common Vulnerabilities and Exposures”, so CVEs are not limited to vulnerabilities, but it is not so important for now.) You can look at some example of CVE. Various details might be available, but the level of details may depend on multiple factors. For example, a not-yet-publicly-disclosed vulnerability might have a CVE number, but its description will be obviously not very verbose.

CVE numbes are usually assigned to CPEs (“Common Platform Enumeration”). A CPE is an identifier of vulnerable software. For example, the mentioned CVE-2005-1234 contains information that it affects cpe:/a:phpbb_group:phpbb-auction:1.0m and cpe:/a:phpbb_group:phpbb-auction:1.2m.

How OWASP Dependency Check works?

In a nutshell, it scans a project (JARs, POM files, ZIP files, native libraries, .NET assemblies, …) and tries to assign a CPE to each entity. Note that there is much of heuristics involved. Obviously, there is also much what can go wrong.

When CPEs are assigned, it looks for vulnerabilities (CVEs) assigned to those CPEs. A scan is written to a result file. Multiple formats are supported. The two most prominent formats are HTML (for direct reading) and XML (for further processing). We use the XML output in order to process multiple reports of multiple projects and assign them to particular teams. (One team is usually responsible for multiple projects.)

Integration with projects

Maven

OWASP Dependency Check has a mature Maven plugin. It basically works out-of-box and you can adjust probably any parameter supported by OWASP Dependency Check. We didn’t have to modify the project, as it can be run by mvn org.owasp:dependency-check-maven:check and the configuration can be adjusted by passing -Dproperty=value parameters.

Gradle

There is also some plugin for Gradle. Unfortunately, it is not as mature as the Maven plugin. We had to modify it in order to make it working in our environment. I’d like to merge those changes with the original project. Even after the modifications, it is still not ideal. The most prominent issue is that it includes test dependencies. Excluding them in the Groovy plugin is not as simple as with Maven plugin, because Gradle can have many configurations and each of those configurations might have different dependencies. I have no simple clue how to distinguish important configurations from others. However, this was not a really painful issue, as these dependencies are considerably less common and usually don’t have any known vulnerability in NVD, as they usually aren’t touched by untrusted input.

How we scan our projects?

Running those scans manually on multiple sub projects and evaluating all the reports manually would take too much time. So, we have developed a tool for automating some of the manual work.

First, scans are run automatically every day. There is no need to run it manually on every single subproject. Results of these scans are stored in XML format. Some adaptations specific to our environment are applied. For example, scans are able to run without any connection to the Internet. (Maven plugin can be configured without any modification, while Gradle plugin required to be modified for that.) Of course, we have to download all the vulnerability database separately, outside of the scan process.

Second, our tool downloads the results from our Bamboo build server and processes them. There are some sanity checks that should warn us if something breaks. For example, we check freshness of the vulnerability database.

Third, issues related to particular subprojects are automatically assigned to corresponding teams. Issues can be prioritized by severity and number of occurrences.

While ODC is a great tool, there are some issues connected to using it. I’ll discuss some of them there.

Complexity of library identifiers

How many ways of identifying a library do you know? Maybe you will suggest using groupId and artifactId and version (i.e. GAV) by which you can locate the library on a Maven repository. (Or you can pick some similar identifier used on another platform than Java.) This is where we usually start. However, there are multiple other identifiers:

  • GAV identifier: as mentioned above.
  • file hash: ODC uses SHA1 hash for lookups in Maven database. Note that there might be multiple GAV identifiers for one SHA1 hash, so there is 1:1 relation. Moreover, when we consider snapshot, we theoretically get m:n relation.
  • CPE identifier: This one is very important for ODC, as ODC can’t match vulnerabilities without that. Unfortunately, there is no exact algorithm for computation of CPE from GAV or vice versa. Moreover, multiple GAVs might be assigned to one CPE. For example, Apache Tomcat consists of many libraries, but all of them have just one CPE per version. Unfortunately, the ODC heuristic matching algorithm might also assign multiple CPEs to one GAV in some cases.
  • GA identifier: This is just some simplification of GAV identifier, which misses the version number. There is nothing much special about this identifier for ODC, but we have to work with that, too.
  • intuitive sense: For example, when you mention “Hibernate”, you probably mean multiple GAs at the same time.

Note that this complexity is not introduced by ODC. The complexity is introduced by the ecosystem that ODC uses (e.g. CPEs) and by the ecosystem ODC supports (e.g. Maven artifacts).

Bundled libraries

While überJARs might be useful in some cases, they are making inspection of all the transitive dependencies harder. While ODC can detect some bundled libraries (e.g. by package names or by POM manifests if included), this is somehow suboptimal. Moreover, if ODC finds a bundled dependency, it might look like a false positive at first sight.

Libraries without a CPE

Some libraries don’t have a CPE. So, when ODC can’t assign any CPE, it might mean either there is no CPE for the library (and hopefully no publicly known vulnerability) or ODC has failed to assign the CPE. There is no easy way to know that is the case.

False positives

False positives are implied by the heuristics used mainly for detecting CPE identifier. We ca’t get rid of all of them, until a better mechanism (e.g. CPE identifiers in POMs) is developed and widely used. Especially the latter would be rather a long run.

False positives can be suppressed in ODC in two ways. One can suppress assignment of a specific CPE to a specific library (identified by SHA1 hash). If a more fine-grained control is needed, one can suppress assignment of a single vulnerability to a specific library (identified by SHA1 hash). Both of them make sense when handling false positives.

Handling a CPE collision

For example, there is NLog library for logging in .NET. ODC assigns it cpe:/a:nlog:nlog. This might look correct, but this CPE has been used for some Perl project and there is some vulnerability for this CPE that is not related to the logging library.

If one suppressed matching the CPE, one could get some false negatives in future, as vulnerabilities of the NLog library might be published under the same CPE, according to a discussion post by Jeremy Long (the author of the tool).

In this case, we should not suppress those CPEs. We should just suppress CVEs for them.

Obviously mismatched CPEs

CPE might be also mismatched obviously. While the case is similar to the previous one, the CPE name makes it obvious that it does not cover the library. In this case, we can suppress the CPE for the SHA1 hash.

Missing context

Vulnerability description usually contains a score. However, even if the score is 10 out of 10, it does not imply that the whole system is vulnerable. In general, there are multiple other aspects that might change the severity for a particular case:

  • The library might be isolated from untrusted input.
  • The untrusted input might be somehow limited.
  • The library might run in sandboxed environment.
  • Preconditions of the vulnerability are unrealistic for the context.

None of them can be detected by ODC and automatic evaluation of those aspects is very hard, if it is not impossible.

Conclusion

We have integrated OWASP Dependency Check with our environment. We have evaluated the quality of reports. The most fragile part is matching a library to a CPE identifier, which may lead to both false negatives (rarely seen) and false positives (sometimes seen). Despite those issues, we have found ODC to be useful enough and adaptable to our environment.