i'm johan. i do things with computers. i'm from berlin

Bookmarklet: Better links to GitHub files

Update: GitHub has this build in already. Press y and you'll get the canonical URL. Nice. 30 minutes of my life wasted to get that insight ...

When you are sending links around to files on GitHub, these links usually include the branch name. If the branch changes, your link might break. If you link to linenumbers, the link to that line might be useless when the file changes and lines move around. Here is a bookmarklet that transforms a link to a file from going to a branch to the specific current HEAD commit. This allows to keep links consistent, even when files change.





Either drag this bookmarklet to your bookmarks: Transform link to GitHub commit

Or create a new bookmark and copy the code below in there:

    var a = $('.commit-title a');
    if (a && a.length > 0) {
        s = a[0].href.split('/');
        w = window.location.pathname.split('/');
        w[4] = s[s.length-1];
        window.location.pathname = w.join('/');
    } else {
        alert('Not a Github file site or bookmarklet is broken');

Determining which AWS CloudFront POP edge location is used

Yesterday I had to debug a problem that seemingly involved a faulty AWS CloudFront POP (point-of-presence or also called edge location). Requests to some objects timed out, but the problem seemed to be limited to a certain geographic region. To debug the problem further, it was crucial to find out which POP was used. In the following example I assume that is configured with a CloudFront configuration and the requests are made from the client's location experiencing the problems.

$ nslookup

Non-authoritative answer: canonical name =

This gives us the IP address of the CloudFront POP in question. Next we can feed the IP into a reverse dns lookup.

$ dig -x +short

The fra50 part is the intersting one. AWS CloudFront follows a convention to name POPs after IATA airport codes. You can now either do a full-text search for the airport code on this Wikipedia page or do a Google search for <airport-code> airport code. This reveals that the example request from above was full-filled by the POP in Frankfurt, Germany.

How a map is served at Mapbox

Some weeks ago I gave a talk at the Berlin DevOps Meetup titled "Globally serving maps from 8 datacenters"*.

Below are the slides. They were made with the great Deckset app and lot's of gifs and videos (which sadly didn't make it into the pdf export).

Also during the talk I showed a video of our internal mapbox-cli tool which makes deploying with CloudFormation as easy as a git push. Questions in the comments are welcome.

*This has bugged me since I gave the talk: We serve from 8 AWS regions, with each region having 1, 2 or 3 availability zones. Each availability zone is an own datacenter, since they are said to be physically and logistically separate. Thus we serve from 1 + 4*2 + 3*3 = 18 datacenters.

Joining Mapbox

This week I joined the team of Mapbox. Mapbox builds great tools to create fantastic maps of the world, your country or your neighbourhood. The maps are highly customizable, with beautiful styles and your own data sets. Put them into your mobile app or display them on your website. Foursquare, Pinterest and others are already using it. And you should really have a look at the great guides and get started. Below is a satellite map of Berlin-Kreuzberg with my co-working space co.up marked up:

At Mapbox I'll be working with their infrastructure team, making sure that the maps are processed and served reliably and fast. Mapbox runs on AWS and I'm very keen on learning more about the paradigms of infrastructure engineering that entails. I'll stay in Berlin, but you'll likely be able to meet me often in Washington, DC and San Francisco.

Maps have been fascinating to me for a long time. Back in the days when I got my first mobile phone, I was intruiged by the fact that I would have access to a searchable map of our planet at any time. Maps are the user interface to the world around us and a key part of how we discover, plan and move. Also they are a prime means of revealing information. Therefore they have an imense impact on our lives. Still, working with maps is cumbersome and this is what Mapbox is changeing.

Mapbox as a company was build with two important aspects in mind: Open Source and OpenStreetMap.

The Mapbox Github org currently has over 100 public projects (like the core map design tool Mapbox Studio and sqlite3 bindings for Node.js). Combined with that, Mapbox internally organizes mainly around Github, working much like an internal Open Source project. All information is in repositories, wikis and issues. And after spending a week with it, I couldn't be happier. The level of transparancy is amazing. My Github feed truly is the pulse of what everyone in the company is up to.

Mapbox is a huge consumer and contributor to OpenStreetMap, the Wikipedia for map data. Mapbox has worked on better tooling (like the id editor) and has an own team contributing data (like buildings in NYC). And as outlined very well here, the world needs OpenStreetMap.

So all of these aspects made it clear to me that Mapbox is a company I want to be a part of. I'm happy I get the chance to do so. And if you have any questions about maps, cloud infrastructure or working at Mapbox, keep them coming. I'm happy to answer.

(Update 13th Oct 2014: The welcome blog post on is online here)

Master's Thesis: On Dependability Modeling in a Deployed Microservice Architecture

Master's Thesis Cover

Today I handed in my Master's Thesis for IT-Systems Engineering. Seven months of doubt and hard work, breakthroughs and setbacks finally come to a conclusion. It's a good feeling.

I posted about the core content of the thesis earlier already. The final title is "On Dependability Modeling in a Deployed Microservice Architecture" and here is the final abstract:

The microservice architectural style is a common way of constructing software systems. Such a software system then consists of many applications that communicate with each other over a network. Each application has an individual software development lifecycle. To work correctly, applications depend on each other. In this thesis, we investigate how dependability of such a microservice architecture may be assessed. We specifically investigate how dependencies between applications may be modeled and how these models may be used to improve the dependability of a deployed microservice architecture. We evaluate dependency graphs as well as qualitative and quantitative fault trees as modeling approaches. For this work the author was embedded in the engineering team of “SoundCloud”, a “Software as a Service" company with 250 million monthly users. The modeling approaches were executed and qualitatively evaluated in the context of that real case study. As results we found that dependency graphs are a valuable tool for visualizing dependencies in microservice architectures. We also found quantitative fault trees to deliver promising results.

You can download the whole thesis here (pdf/118 pages/1.5 MB). The Latex source is on Github. Both the thesis itself and the source are licensed under the Creative Commons BY-NC-ND 4.0 License.

Thanks to Peter Tröger for sending me on the journey for this thesis. Thanks also to all engineers at SoundCloud, especially Peter Bourgon for embedding me in his team and Alexander Grosse for allowing me to do the research with the company. Also thanks to nxtbgthng for letting me use their office and Lea and Hannes for proof-reading.

Modeling Dependencies in a Microservice Architecture

This post summarizes my master's thesis at a high level. More info on the thesis is here.

(Note: This post is a work-in-progress. I will update it in April/May 2014, as the thesis work is concluded)


When reasoning about the availability of a service in a microservice architecture, the basis for discussion often are the opinions of the engineers involved. Information used are the assumptions made while building the service, experience from operations (often visualized through graphs) and the occasional diagram (of dataflow or dependencies). This data tends to be (by its nature or the way it is selected) subjective and makes it difficult to get a holistic view.

In my thesis I tried to investigate a more structured approach, that should deliver more objective data.

My starting point was to find out which services exist and how these depend on each other. This can be modeled as a directed graph of dependencies. This graph allows the construction of faul trees for individual service failures. When these are annotated with failure probabilites, it is possible to calculate the probability of failure for the service.

I investigated to which extent it is possible to run this process automatically in an existing microservice architecture. I did this in a case study with a company from Berlin.

Building a Dependency Graph

I model dependencies between applications (service providers and service consumers) as a directed graph. Nodes are applications and edges are dependencies.

A dependency from A->B implicates, that for A to work correctly, B has to work correctly. Following is a very simplified example of a web application:

Web -> MySQL
Web -> Recommendations
Recommendations -> ElasticSearch

Directed Graph

So how might this graph be generated? Following are some of the approaches I tried.

Manual annotation The dependencies might be written down by the engineers themselves. Given that each application has a repository, a practical implementation of this approach could be to host a dependency annotation file in each application's repository. The problem with this method is, that the correctness of the graph depends on people maintainting the annotation files. Thus there might always be a (actually existing or at least feared) gap between reality and the graph. It's not a technical, but a human problem. In my case study, I did the manual annotation for most applications.

From source code I did not find a fitting method to derive service dependencies from source code. One approach assumed that all service dependencies are encapsulated in shared libraries. That is not the case (think http calls via standard library), but even if it would be true, we'd still need to map the actual service to the library, like "shared library mysql2 encapsulates a service dependency to application MySQL". Another approach depended on how service discovery happens. Given service discovery happens via static identifiers in the code, we could parse these out. We then know the service dependency, if it is possible to derive the service name from that static identifier. An example is service discovery via DNS, where it is also assumed that the domain names follow a specific schema. For my case, that schema should include the service name, like <service> In my case study, the source code mostly did not include service discovery identifiers, but got them assigned through the environment.

From application deployment environment Given that the application gets its service discovery identifiers from the environment, we can use the same mechanism as in the previous section to detect the service dependency. The environment (also called configuration) is usually compiled during the deploy of the application. Examples for this are via chef (with configuration files read by the application) or via a deployment system (with environment variables read by the application). In my case study, only some services went through service discovery but a majority of services was addressed through "physical addresses" (hostname and port). I attribute the low coverage of service discovery to the fact that it was only recently introduced in the company and sees slow adoption by the engineers.

From network connections Given that the applications are deployed and communicate with each other on the network, we might use that traffic to identify dependencies. For example we might capture network connections via sockstat or netstat on application hosts. A connection might then look like this: source_pid source_ip:port -> destination_ip:port. To determine the source application, we may use the source process id. In my case study, I was able to derive the application name from the process id through an implementation detail of the deployment system. To determine the destination application, we need inverse service discovery, which turns an ip:port pair into a semantic service name. As mentioned before, in my case study I found a low adoption of service discovery, therefore this approach yielded sparse resuls as well.

To conclude, only the manual annotation resulted in a usable graph.

More approaches do exist: For example using a traceing system like Google's Dapper that tracks all network calls and is annotated with application identifiers might allow the extraction of service dependencies.

Constructing a Fault Tree

Before we construct the Fault Tree, we have to set some assumptions: As failure semantics we assume, that there is no fault tolerance. The failure of a service leads to the failure of all applications that depend on it. Also, an application might have two reasons for failure: either because one of the services it depends on fails, or because of "inherent" failure, e.g. a bug in the software.

Let's construct a fault tree, based on our previous dependency graph. The node, whose failure we are interested in, becomes the top event (in our example Web). From there, we create the "inherent" basic event for that service and an intermediate event for each dependency. All are connected to the top event via an OR-gate. We recursively continue this process for all dependencies. Here is the fault tree graphic:

Directed Graph

Due to the assumed failure semantics and the resulting algorithm, every application constitutes a single point of failure. Also the fault tree is significantly larger than the dependency graph. Thus, I conclude that such a fault tree alone is not a meaningful tool for modeling availability. But what if the Fault Tree had failure probabilities on it?

Putting numbers on a Fault Tree

Basic events in a fault tree may get failure probabilities assigned. Based on these and the structure of the fault tree, the failure probability of the top event can be calculated.

I investigated two approaches:

  • Historical availability data I wrote about the problem of measuring availability here. In my case study, collecting availability data was difficult, since there is no conclusive measurement regime for internal services in place.
  • Code churn There is some evidence 1 that code churn might be a reasonable metric to predict defects in code. I'm still evaluating this approach.

Both approaches seem to be able to support monitoring the availability threats an application faces from its dependent applications. This in turn could aid architectural design decisions.


Generating the dependency graph is the core of this modeling process. I showed several approaches that should enable the automated generationg of that graph from an existing microservice architecture. In my case study, these were inhibited by a heterogenous environment, especially in regards to the use of service discovery.

The structural fault tree seems to have less usefulness in practice than the dependency graph. On the other hand, a fault tree with failure rates might be a helpful tool in monitoring changes to the availability of an application.

The thesis operates under strong assumptions regarding application failure propagation. Extending it with fault tolerance mechanisms will be an interesting future work, as well as doing a case study in a more homogenous environment.


  • 1 Nachiappan Nagappan, Thomas Ball. Use of relative code churn measures to predict system defect density. 2005. pdf

On Measuring the Availability of Services

This post is part of a series of posts in the context of my master's thesis in computer science. Check this post for an overview.


Given we have a software system that is running as a Microservices Architecture (as recently summarized by Lewis & Fowler 1). Similar to their definitions, I define the following:

  • A component is a unit of software that is independently replaceable and upgradeable.
  • A service is a component, that provides an interface for other out-of-process components to connect to, via a mechanism such as a web service request or a remote procedue call.
  • At runtime, each service might have many instances. That might be on the scale of only one instance to hundreds or thousands.

We assume these instances run in one network on many hosts. Services might depend on each other. For the context of this post we assume all communication happens via HTTP, but all ideas here should be independent of protocol.


When discussing the dependability of a software system, availability is a common aspect to evaluate. Let's look at one availability definition (from 2):

Availability is the readiness for a correct [behavior]. It can be also defined as the
probability that a system is able to deliver correctly its service at any given time.

Behavior here is seen as fullfilling the expectation of the user, which usually is captured in a specification. Given we have a request/response style communication, the specification would include all possible requests and their valid responses. If the service behaves in a way not specified, we speak of a failure of the service. The specification might also include failures (for example an HTTP response with a 500 status code).

When speaking about the availability of a service in practice, we usually would like to reduce that into one number. This comes out of the desire to compare availability, for example how a certain change impacts the availability of a service. In the definition above, availability is defined as a probability. When measuring availability, we usually base it on historical data with this formula:

Availability = Uptime / (Uptime + Downtime)

This will give a number between 0 (always down) and 1 (always up). Interpreted as a percentage, this yields the famous x-nines, like 99.99% ("4 nines") availability. It is important to note, that availability is always defined over a period of time (called mission time), for example for the last 24 hours or the last calendar month. This implies that we may look at availability only in hindsight, based on historical uptime and downtime data.

Let's look at an example of a day:

Mission Time = 24h
Uptime = 23h 50m = 85800s
Downtime = 0h 15m = 900s

Availability = 85800s / (85800s + 900s) ≈ 0,993055 ≈ 99.3 %

One assumption from the above definition is, that a service at any given time might be either up or down (if we'd allow both at the same time, we might get availability numbers over 1). So the next question becomes, how do we practically measure this?

How do we do time?

In the above definition of availability, we used absolute numbers for representing uptime and downtime. But how do we get these numbers?

The usual representation for this is a time series. It assumes a fixed interval of time. To each interval (or its end therefore) we assign the current availability state. For example, the interval could be 1 second. For each second, we would save if the service was up or down. To calculate the uptime, we sum up all seconds with state up within our mission time.

Here is an example: we have a time interval of 1 second and look at the availability for a mission time of 8 seconds. The time series might then look like this:

Mission time = 8s

Time series (u=up/d=down):
time |1|2|3|4|5|6|7|8|

#up    = 6
uptime = 6s
#down    = 2
downtime = 2s

availability = 6s / (6s + 2s) = 0.75 = 75%

Next, we will look at the actual acts of measuring.


A Heartbeat is a periodic message, signaling the current state of operation. In our context, it usually involves a client (which gives the heartbeat) and a server (which collects the heartbeat). Heartbeat gives us a classic time series: a server notes the client as up when it sees a valid heartbeat message for a given period and down when none at all or only a failure heartbeat message is seen.

There are two communication patterns for the heartbeat:

  • In a push-based heartbeat, the client reports to the server. Thus, the client has to implement the logic for sending that heartbeat message, based on the heartbeat protocol of the server. An example is an HTTP POST to an endpoint on the server.
  • In a pull-based heartbeat, the server requests the client regularly. The server might either query a dedicated heartbeat endpoint on the client or use an existing endpoint defined in the specification.

A problem with a pull-based Heartbeat is, that it only assures the correctness of a subset of functionality by the client. If the heartbeat endpoint works, it is not verified that the whole client adhers to the whole specification. A failure example is, that each service the client exposes, might have different external dependencies. For example endpoint A might depend on a database and endpoint B on an external API. If the database is unreachable, endpoint A will fail whereas endpoint B will work as expected. Depending on which endpoint would be used, the availability measurement would deliver different results.

Especially for web services, there are a multitude of companies doing heartbeats for you. An older list can be found here. A more sophisticated example is Runscope Radar, which does heartbeats by running a whole test suite against the service, therefore verifying the specification.

How do we do time with events?

Time series are based on regular time intervals. This means each interval may only get assigned one value. This is no good to us if we want to work with event data, which is a common case when request/response communication is involved. There will likely be many clients doing request within each time interval.

To solve this problem, we may aggregate the events for a time period. As an example, let's use HTTP status codes, which have the nice property that they include codes for failure.

time |  1|  1|  3|  4|  4|  6|  7|  7|  9|
code |200|500|404|200|500|200|500|500|500|


period[0-4]status[200] = 2
period[0-4]status[404] = 1
period[0-4]status[500] = 2

period[5-9]status[200] = 1
period[5-9]status[404] = 0
period[5-9]status[500] = 3

In this example, we summed up the status codes for each period as a counter. Each status code represents an own time series over these counts.

To use this for availability purposes, we need to condense all these time series to one number. The actual formula for this highly depends on the use case, especially on which behavior is expected and which is not. For the given example we might say that we see the service failed if there are more status codes >= 500 than status codes < 500. For the previous example, period[0-4] would be up and period[5-9] would be down.

As a benefit over heartbeats, this method is based on the actual interaction with the service, therefore providing real-world testing of the specification.

So how do we get hold of these counts?

Count on the service

Responses are captured on the service instances, usually within the instance process. This has the problem that the service might fail in a way that no counts are collected anymore. An example is a kernel panic.

Count on the clients

Responses are captured on the service clients. This might happen either within the instance process or on the network path (for example on a load balancer like HAProxy). This will also detect crash failures of the service.

Both counting methods should gather their data in a central place, given that they will have to run on instances of which we have many. One example for a program doing that aggregation is statsd. It first aggregates counts in each instance process (via a shared library statsd client), then aggregates these aggregates on a statsd server, which eventually writes the time series to a database like graphite.

Inherent problem of measuring

Whenever we measure the availability of a system, we are actually measuring many things at the same time:

  • The availability of the measured system (for example a service)
  • The availability of the communication medium (for example network, with switches on the way)
  • The availability of the measuring system (for example the heartbeat system)

In a perfect world, we assume the measuring system and the communication medium to be perfect and never break. In practice, they do fail and their failures might impact the correctness of the gathered data, especially when they are not detected and thus are assumed to be a failures of the measured service.

Other ways of measuring availability

I'm sure there are more commonly used ways of measuring; please add in the comments.


My Master's Thesis

In December 2014 I started working on my master's thesis in IT-Systems Engineering. By April 2014, I started writing. Thoughts have the downside of being hard to verify. So I need more people to read this and give feedback, so I can better fight the neverending battle against doubt. And the most scalable way seem to be public blog posts.

This post is a Table of Contents for all the posts around the thesis.

content posts

meta posts

  • The Story of the Thesis (to be written)
  • Lessons learned (to be written)

Accessing the GitHub API with Golang

So you want to access the Github API with Go? This post should give you some pointer on how to do that. It is deliberatly entry level aimed at API beginners (mostly because I want to get better at writing posts for novices).

In this post, I'm not using the go-github client library, but instead all interaction is done with go's standard net/http library. I do that to show the exact interaction happening with the API (and because using SDKs has its own problems).

The first question you have to answer: What do you want to access?

  1. Public data
  2. Private data your Github account has access to (like private repos)
  3. Private data from other people's Github accounts

The third case is a bit more complicated, since it requires OAuth 2 for an authentication flow. I'll cover that in another blog post. So let's focus on the first two cases for now.

Reading the Github Docs

At you'll find the official Github API docs. Go to Documentation or Reference and you'll get infos on all the HTTP endpoints that are available via the API. Find an endpoint you are interested in. Some endpoints may only work when you are logged in. The docs tell you (sometimes) if authentication is needed. Sometimes they don't (note to self: submit a pull request that adds that information to the Github docs). If you are not sure, just see if that information is publicly available on the Github webpage, since the API and the website basically have the same public/private restriction.

So you found an endpoint to query? For example the user endpoint is a nice starter.

When working with remote APIs, a good idea is to query them "by hand". Everyone's favorite tool for that is the curl, use it on your command line like this:

$ curl -i
HTTP/1.1 200 OK
Date: Sat, 29 Mar 2014 22:39:51 GMT
  "login": "freenerd",
  "id": 25713,

The -i flag shows the headers of the Response.

Accessing public data on Github

So let's do the http request:

// request http api
res, err := http.Get("")
if err != nil {

// read body
body, err := ioutil.ReadAll(res.Body)
if err != nil {

if res.StatusCode != 200 {
  log.Fatal("Unexpected status code", res.StatusCode)

log.Printf("Body: %s\n", body)

The current body is just a string, so we still have to parse it in order to access the data within. The Github API returns json, so let's use Go's encoding/json library.

// parse json
type jsonUser struct {
  Name string `json:"login"`
  Blog string `json:"blog"`
user := jsonUser{}

err = json.Unmarshal(body, &user)
if err != nil {

log.Printf("Received user %s with blog %s", user.Name, user.Blog)

First we create a new struct that will be filled with the data from the json object. Then we Unmarshal the string into the struct. If the naming of the keys in the json and in the struct does not match, use the json:"<name>" annotation to fix it.

That's it for accessing public data. Check the whole code here.

When playing around with the Github API like this, you might all of a sudden be stopped in your work, instead of returning meaningful results, the Github API only returns 403 status codes with a body talking about API rate limit exceeded. To prevent abuse of their API, Github only allows a certain number (60 at time of writing) of unauthorized API calls per hour from one IP. But thankfully, the docs on that rate limiting also point to the solution: more calls per hour (5000 at time of writing) if you do authorized calls, so let's try that next ...

Authorized calls to private data on Github

Github has several ways of doing authenticated calls. Even though it might seem tempting, please don't use Basic Auth with your username and password, for many reasons, but mostly because your passwords should be secret and using them anywhere in code opens the door for making mistakes which might lead to their exposure.

Instead, we opt for using a personal access token that you can create from within the Github web application here. When generating the token, make sure you give it the appropriate scope for the endpoint you want to query.

Once you have your token (a 40-character string), you can use it with basic auth. For example, I want to look at my ssh keys on Github (which requires the read:public_key scope).

Let's first do the request via curl (where becomes your token and the -u flag sets up basic auth):

$ curl -u <token>:x-oauth-basic
HTTP/1.1 200 OK
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4983
    "id": 483166,

Again, I've removed some of the output. But note the X-RateLimit-Remaining header, which tells us how many calls we still have left for the next hour. Also the username:password for basic auth is <token> as username and the string x-oauth-basic as password.

Let's do all this in Go. To use basic auth, we have to create an http request object, which is then executed via an http client.

// request http api
req, err := http.NewRequest("GET", "", nil)
if err != nil {
req.SetBasicAuth("<token>", "x-oauth-basic")

client := http.Client{}
res, err := client.Do(req)
if err != nil {

log.Println("StatusCode:", res.StatusCode)

// read body
body, err := ioutil.ReadAll(res.Body)
if err != nil {

log.Printf("Body: %s\n", body)

I'll leave the json parsing to you, its similar to the way we did it before. The full code can be found here.

Setting up Jekyll on Github Pages with a custom domain on

Recently I've moved my blog to Jekyll. This included moving my domain, which is now registered with and hosted on github pages. Here is some experience from my move:

  • Switching .de domains is fast The last time I moved a domain, I had to fax (as in fax machine) a signed KK-Antrag and wait for days until the switch happened. Today, one only has to enter an auth code and wait for the DNS TTL to switch to the new registration, boom, done.
  • Think of your mail I'm running my mail through my domain. While switching registration, I also got new mail servers. These have to be configured. Do that before the switch. Also FYI: Gandi allows to forward email (for incoming mail) and still have a mailbox (for outgoing mail) for the same email address.
  • The CNAME influences all your domain When you change the CNAME file in your Github Pages repo, all other Github Pages from all your other projects will also be redirected to your custom domain. Example: The honeypot repo pages used to live under but now are at This is cool. But remember to not re-use these paths in your blog though.
  • No apex domains with Gandi I couldn't get them to work, since Gandi does not support ALIAS records. So I'm running everything under the www. subdomain now.

Github has good documentation on using a custom domain. Still, for completeness, here is the important bit of my gandi dns zone file:

@ 10800 IN A
@ 10800 IN A
www 10800 IN CNAME