This post summarizes my master’s thesis at a high level. More info on the thesis is here.
(Note: This post is a work-in-progress. I will update it in April/May 2014, as the thesis work is concluded)
When reasoning about the availability of a service in a microservice architecture, the basis for discussion often are the opinions of the engineers involved. Information used are the assumptions made while building the service, experience from operations (often visualized through graphs) and the occasional diagram (of dataflow or dependencies). This data tends to be (by its nature or the way it is selected) subjective and makes it difficult to get a holistic view.
In my thesis I tried to investigate a more structured approach, that should deliver more objective data.
My starting point was to find out which services exist and how these depend on each other. This can be modeled as a directed graph of dependencies. This graph allows the construction of faul trees for individual service failures. When these are annotated with failure probabilites, it is possible to calculate the probability of failure for the service.
I investigated to which extent it is possible to run this process automatically in an existing microservice architecture. I did this in a case study with a company from Berlin.
I model dependencies between applications (service providers and service consumers) as a directed graph. Nodes are applications and edges are dependencies.
A dependency from
A->B implicates, that for A to work correctly, B has to work correctly. Following is a very simplified example of a web application:
Web -> MySQL Web -> Recommendations Recommendations -> ElasticSearch
So how might this graph be generated? Following are some of the approaches I tried.
Manual annotation The dependencies might be written down by the engineers themselves. Given that each application has a repository, a practical implementation of this approach could be to host a dependency annotation file in each application’s repository. The problem with this method is, that the correctness of the graph depends on people maintainting the annotation files. Thus there might always be a (actually existing or at least feared) gap between reality and the graph. It’s not a technical, but a human problem. In my case study, I did the manual annotation for most applications.
From source code I did not find a fitting method to derive service dependencies from source code. One approach assumed that all service dependencies are encapsulated in shared libraries. That is not the case (think http calls via standard library), but even if it would be true, we’d still need to map the actual service to the library, like “shared library mysql2 encapsulates a service dependency to application MySQL”. Another approach depended on how service discovery happens. Given service discovery happens via static identifiers in the code, we could parse these out. We then know the service dependency, if it is possible to derive the service name from that static identifier. An example is service discovery via DNS, where it is also assumed that the domain names follow a specific schema. For my case, that schema should include the service name, like
<service>.internal.example.com. In my case study, the source code mostly did not include service discovery identifiers, but got them assigned through the environment.
From application deployment environment Given that the application gets its service discovery identifiers from the environment, we can use the same mechanism as in the previous section to detect the service dependency. The environment (also called configuration) is usually compiled during the deploy of the application. Examples for this are via chef (with configuration files read by the application) or via a deployment system (with environment variables read by the application). In my case study, only some services went through service discovery but a majority of services was addressed through “physical addresses” (hostname and port). I attribute the low coverage of service discovery to the fact that it was only recently introduced in the company and sees slow adoption by the engineers.
From network connections Given that the applications are deployed and communicate with each other on the network, we might use that traffic to identify dependencies. For example we might capture network connections via sockstat or netstat on application hosts. A connection might then look like this:
source_pid source_ip:port -> destination_ip:port. To determine the source application, we may use the source process id. In my case study, I was able to derive the application name from the process id through an implementation detail of the deployment system. To determine the destination application, we need inverse service discovery, which turns an
ip:port pair into a semantic service name. As mentioned before, in my case study I found a low adoption of service discovery, therefore this approach yielded sparse resuls as well.
To conclude, only the manual annotation resulted in a usable graph.
More approaches do exist: For example using a traceing system like Google’s Dapper that tracks all network calls and is annotated with application identifiers might allow the extraction of service dependencies.
Before we construct the Fault Tree, we have to set some assumptions: As failure semantics we assume, that there is no fault tolerance. The failure of a service leads to the failure of all applications that depend on it. Also, an application might have two reasons for failure: either because one of the services it depends on fails, or because of “inherent” failure, e.g. a bug in the software.
Let’s construct a fault tree, based on our previous dependency graph. The node, whose failure we are interested in, becomes the top event (in our example
Web). From there, we create the “inherent” basic event for that service and an intermediate event for each dependency. All are connected to the top event via an
OR-gate. We recursively continue this process for all dependencies. Here is the fault tree graphic:
Due to the assumed failure semantics and the resulting algorithm, every application constitutes a single point of failure. Also the fault tree is significantly larger than the dependency graph. Thus, I conclude that such a fault tree alone is not a meaningful tool for modeling availability. But what if the Fault Tree had failure probabilities on it?
Basic events in a fault tree may get failure probabilities assigned. Based on these and the structure of the fault tree, the failure probability of the top event can be calculated.
I investigated two approaches:
Both approaches seem to be able to support monitoring the availability threats an application faces from its dependent applications. This in turn could aid architectural design decisions.
Generating the dependency graph is the core of this modeling process. I showed several approaches that should enable the automated generationg of that graph from an existing microservice architecture. In my case study, these were inhibited by a heterogenous environment, especially in regards to the use of service discovery.
The structural fault tree seems to have less usefulness in practice than the dependency graph. On the other hand, a fault tree with failure rates might be a helpful tool in monitoring changes to the availability of an application.
The thesis operates under strong assumptions regarding application failure propagation. Extending it with fault tolerance mechanisms will be an interesting future work, as well as doing a case study in a more homogenous environment.