ASP.NET 5 Docker language stack with Kestrel

This blog post presents a Docker Language Stack for creating and running ASP.NET 5 (née vNext) apps. It’s based on my work last week to run ASP.NET 5 on Google Container Engine.

I the interim, the ASP.NET team has released their own Docker image. It’s not really up to spec for being a Docker language stack though, so I forked it, added what was missing and published it on Docker Hub.

Other people already sent PRs to add onbuild support to the ASP.NET repo, but there’s apparently some uncertainty about how ASP.NET 5 apps are going to get built, so they’re holding off on merging. I hope that eventually the work presented here will get folded into the official repo, just like it happened with the Mono stack I created a month ago. That’s the base for what’s now the official Mono Docker language stack, which, incidentally, is what the ASP.NET docker image derives from!

How to use

Using the onbuild image is pretty simple. To run HelloWeb sample, clone that repo and add this Dockerfile in the HelloWeb dir, next to the project.json:

FROM friism/aspnet:1.0.0-beta1-onbuild
EXPOSE 5004

Now build the image:

docker build -t my-app .

And finally run the app, exposing the site on port 80 on your local machine:

docker run -t -p 80:5004 my-app

Note that the -t option is currently required when invoking docker run. This is because there’s some sort of bug in Kestrel that requires the process to have a functional tty to write to – without a tty, Kestrel hangs on start.

Google Container Engine for Dummies

Last week, Google launched an alpha version of a new product called Google Container Engine (GKE). It’s a service that runs pre-packaged Docker images for you: You tell GKE about images you want to run (typically ones you’ve put in the Docker Registry, although there’s a also a hack to run private images) and how many instances you need. GKE will spin them up and make sure the right number is running at any given time.

The GKE Getting Started guide is long and complicated and has more JSON than you shake a stick at. I suspect that’s because the product is still alpha, and I hope the Google guys will improve both the CLI and web UIs. Anyway, below is a simpler guide showing how to stand up a stateless web site with just one Docker image type. I’m also including some analysis at the end of this post.

I’m using a Mono/ASP.NET vNext Docker image, but all you need to know is that it’s an image that exposes port 5004 and serves HTTP requests on that port. There’s nothing significant about port 5004 – if you want to try with an image that uses a different port, simply substitute as appropriate.

In the interest of brevity, the description below skips over many details. If you want more depth, then remember that GKE is Kubernetes-as-a-Service and check out the Kubernetes documentation and design docs.

Setup

  1. Go to the Google Developer Console and create a new project
  2. For that project, head into the “APIs” panel and make sure you have the “Google Container Engine API” enabled
  3. In the “Compute” menu section, select “Container Engine” and create yourself a new “Cluster”. A cluster size of 1 and a small instance is fine for testing. This guide assumes cluster name “test-cluster” and region “us-central1-a”.
  4. Install the CLI  and run gcloud config set project PROJECT_ID (PROJECT_ID is from step 1)

Running raw Pod

The simplest (and not recommended) way to get something up and running is to start a Pod and connect to it directly with HTTP. This is roughly equivalent to starting an AWS EC2 instance and connecting to its external IP.

First step is to create a JSON-file somewhere on your system, let’s call it pod.json:

{
  "id": "web",
  "kind": "Pod",
  "apiVersion": "v1beta1",
  "desiredState": {
    "manifest": {
      "version": "v1beta2",
      "containers": [
        {
          "name": "web",
          "image": "friism/aspnet-web-sample-web",
          "ports": [
            { "containerPort": 5004, "hostPort": 80 }
          ]
        }
      ]
    }
  },
  "labels": {
    "name": "web"
  }
}

What you should care about is the Docker image/repository getting run (friism/aspnet-web-sample-web) and the port mapping (the equivalent of docker run -p 80:5004). With that, we can tell GKE to start a pod for us:

$ gcloud preview container pods --cluster-name test-cluster --zone us-central1-a \
    create web --config-file=/path/to/pod.json
...
ID                  Image(s)                       Host                Labels              Status
----------          ----------                     ----------          ----------          ----------
web                 friism/aspnet-web-sample-web   <unassigned>        name=web            Waiting

All the stuff before “create” is boilerplate and the rest is saying that we’re requesting a pod named “web” as specified in the JSON file.

Pods take a while to get going, probably because the Docker image has to be downloaded from Docker Hub. While it’s starting (and after), you can SSH into the instance that’s running your pod to see how it’s doing, eg. by running sudo docker ps. This is the SSH incantation:

$ gcloud compute ssh --zone us-central1-a k8s-test-cluster-node-1

The instances are named k8s-<cluster-name>-node-1 and you can see them listed in the Web UI or with gcloud compute instances list. Wait for the pod to change status to “Running”:

$ gcloud preview container pods --cluster-name test-cluster --zone us-central1-a list
ID                  Image(s)                       Host                              Labels              Status
----------          ----------                     ----------                        ----------          ----------
web                 friism/aspnet-web-sample-web   k8s-<..>.internal/146.148.66.67   name=web            Running

The final step is to open up for HTTP traffic to the Pod. This setting is available in the Web UI for the instance (eg. k8s-test-cluster-node-1). Also check that the network settings for the instance allow for TCP traffic on port 80.


And with that, your site should be responding on the external ephemeral IP address of the host running the pod.

As mentioned in the introduction, this is not a production setup. The Kubernetes service running the pod will do process management and restart Docker containers that die for any reason (to test this, try ssh’ing into your instance and docker-kill the container that’s running your site – a new one will quickly pop up). But your site will go down in case there’s a problem with the pod, for example. Read on for details on how to extend the setup to cover that failure mode.

Adding Replication Controller and Service

In this section, we’re going to get rid of the pod-only setup above and replace with a replication controller and a service fronted by a loadbalancer. If you’ve been following along, delete the pod created above to start with a clean slate (you can also start with a fresh cluster).

First step is to create a replication controller. You tell a replication controller what and how many pods you want running, and the controller then tries to make sure the correct formation is running at any given time. Here’s controller.json for our simple use case:

{
  "id": "web",
  "kind": "ReplicationController",
  "apiVersion": "v1beta1",
  "desiredState": {
    "replicas": 1,
    "replicaSelector": {"name": "web"},
    "podTemplate": {
      "desiredState": {
         "manifest": {
           "version": "v1beta1",
           "id": "frontendController",
           "containers": [{
             "name": "web",
             "image": "friism/aspnet-web-sample-mvc",
             "ports": [{"containerPort": 5004, "hostPort": 80 }]
           }]
         }
       },
      "labels": { "name": "web" }
      }},
  "labels": {"name": "web"}
}

Notice how it’s similar to the pod configuration, except we’re specifying how many pod replicas the controller should try to have running. Create the controller:

$ gcloud preview container replicationcontrollers --cluster-name test-cluster \
    create --zone us-central1-a --config-file /path/to/controller.json
...
ID                  Image(s)                       Selector            Replicas
----------          ----------                     ----------          ----------
web                 friism/aspnet-web-sample-mvc   name=web            1

You can now query and see the controller spinning up the pods you requested. As above, this might take a while.

Now, let’s get a GKE service going. While individual pods come and go, services are permanent and define how pods of a specific kind can be accessed. Here’s service.json that’ll define how to access the pods that our controller is running:

{
  "id": "myapp",
  "selector": {
    "app": "web"
  },
  "containerPort": 80,
  "protocol": "TCP",
  "port": 80,
  "createExternalLoadBalancer": true
}

The important parts are selector which specifies that this service is about the pods labelled web above, and createExternalLoadBalancer which gets us a loadbalancer that we can use to access our site (instead of accessing the raw ephemeral node IP). Create the service:

$ gcloud preview container services --cluster-name test-cluster--zone us-central1-a create --config-file=/path/to/service.json
...
ID                  Labels              Selector            Port
----------          ----------          ----------          ----------
myapp                                   app=web             80

At this point, you can go find your loadbalancer IP in the Web UI, it’s under Compute Engine -> Network load balancing. To actually see my site, I still had to tick the “Enable HTTP traffic” boxes for the Compute Engine node running the pod – I’m unsure whether that’s a bug or me being impatient. The loadbalancer IP is permanent and you can safely create DNS records and such pointing to it.

That’s it! Our stateless web app is now running on Google Container Engine. I don’t think the default Bootstrap ASP.NET MVC template has ever been such a welcome sight.

Analysis

Google Container Engine is still in alpha, so one shouldn’t draw any conclusions about the end-product yet (also note that I work for Heroku and we’re in the same space). Below are a few notes though.

Google Container Engine is “Kubernetes-as-a-Service”, and Kubernetes is currently exposed without any filter. Kubernetes is designed based on Google’s experience running containers at scale, and it may be that Kubernetes is (or is going to be) the best way to do that. It also has a huge mental model however – just look at all the stuff we had to do to launch and run a simple stateless web app. And while the abstractions (pods, replication controllers, services) may make sense for the operator of a fleet of containers, I don’t think they map well to the mental model of a developer just wanting to run code or Docker containers.

Also, even with all the work we did above, we’re not actually left with a managed and resilient capital-S Service. What Google did for us when the cluster was created, was simply to spin up a set of machines running Kubernetes. It’s still on you to make sure Kubernetes is running smoothly on those machines. As an example, a GKE cluster currently only has one Master node. This is the Kubernetes control plane node that accepts API input and schedules pods on the GCE instances that are Kubernetes minions. As far as I can determine, if that node dies, then pods will no longer get scheduled and re-started on your cluster. I suspect Google will add options for more fault-tolerant setups in the future, but it’s going to be interesting to see what operator-responsibility the consumer of GKE will have to take on vs. what Google will operate for you as a Service.

Mono Docker language stack

I couple weeks ago, Docker announced official pre-built Docker images for a bunch of popular programming languages. Each stack generally consists of two Dockerfiles: a base Dockerfile that installs system dependencies required for that language to run, and an onbuild Dockerfile that uses ONBUILD instructions to transform app source code into a runnable Docker image. As an example of the latter, the Ruby onbuild Dockerfile runs bundle install to install libraries specified in an app’s Gemfile.

Managing system dependencies and composing apps from source code is very similar to what we do with Stacks and Buildpacks at Heroku. To better understand the Docker approach, I created a language stack for Mono, the open source implementation of Microsoft’s .NET Framework.

UPDATE: There’s now a proper official Docker/Mono language stack, I recommend using that.

How to use

A working Docker installation is required for this section.

To turn a .NET app into a runnable Docker image, first add a Dockerfile to your app source root. The sample below assumes a simple console app with an output executable name of TestingConsoleApp.exe:

FROM friism/mono:3.10.0-onbuild
CMD [ "mono", "./TestingConsoleApp.exe" ]

Now build the image:

docker build -t my-app .

The friism/mono images are available in the public Docker Registry and your Docker client will fetch them from there. Docker will then execute the onbuild instructions to restore NuGet packages required by the app and use xbuild (the Mono equivalent of msbuild) to compile source code into executables and libraries.

The Docker image with your app is now ready to run:

docker run my-app

If you don’t have an app to test with, you can experiment with this console test app.

Notes

The way Docker languages stacks are split into a base image (that declares system dependencies) and an onbuild Dockerfile (that composes the actual app to be run) is perfect. It allows each language to get just the system libraries and dependencies needed. In contrast, Heroku has only one stack image (in several versions, reflecting underlying Linux distribution versions) that all language buildpacks share. That stack is at once both too thick and too thin: It includes a broad assortment of libraries to make supported languages work, but most buildpack maintainers still have to hand-build dependencies and vendor in the binaries when apps are built.

Docker has no notion of a cache for ONBUILD commands whereas the Heroku buildpack API has a cache interface. No caching makes the Docker stack maintainer’s life easier, but it also makes builds much slower than what’s possible on Heroku. For example, Heroku buildpacks can cache the result of running bundle install (in the case of Ruby) or nuget restore (for Mono), greatly speeding up builds after the first one.

Versioning is another interesting difference. Heroku buildpacks bake support for all supported language versions into a single monolithic release. What language version to use is generally specified by the app being built in a requirements.txt (or similar) file and the buildpack uses that to install the correct packages.

Docker language stacks, on the other hand, support versioning with version tags. The app chooses what stack version to use with the FROM instruction in the Dockerfile that’s added to the app. Stack versions map to versions of the underlying language or framework (eg. FROM python:3-onbuild gets you Python 3). This approach lets the Python stack, for example, compile Python 2 and 3 apps in different ways without having a bunch of branching logic in the onbuild Dockerfile. On the other hand, pushing an update to all Python stack versions becomes more work because the tags have to be updated individually. There are tradeoffs in both the Docker and Heroku buildpack approaches, I don’t know which is best.

Docker maintains a free, automated build service that churns out hosted Docker images for everyone to use. For my Mono stack, Docker Hub pulls updates from the GitHub repo with the Dockerfiles and builds the relevant tags into images. This is very convenient for stack maintainers. Heroku has no hosted service for building buildpack binaries, although I have documented a (Docker-based) approach to scripting this work.

(Note that, while Heroku buildpacks are wildly successful, it’s an older standard that predates Docker by many years. If it seems like Docker has gotten more things right, it’s probably because that project was informed by Heroku’s experience and by the passage of time).

Finally, and unrelated to Docker and Heroku, the Mono Project now has an APT package repository. This is pretty awesome, and I sincerely hope that the days of having to compile Mono from source are behind us. I don’t know if the repo is quite stable yet (I had to download a key without using SSL, the mono-devel package is versioned 3.10.0-0xamarin1 and the package fails to declare a dependency on udev), but it made the Mono Docker stack image a lot simpler. Check out the diff going from 3.8.0 (compiled from source) to 3.10.0 (installed from APT repo).

WebP, content negotiation and CloudFront

AWS recently launched improvements to the CloudFront CDN service. The most important change is the option to specify more HTTP headers be part of the cache key for cached content. This lets you run CloudFront in front of apps that do sophisticated Content Negotiation. In this post, we demonstrate how to use this new power to selectively serve WebP or JPEG encoded images through CloudFront depending on client support. It is a follow-up to my previous post on Serving WebP images with ASP.NET. Combining WebP and CloudFront CDN makes your app faster because users with supported browsers get highly compressed WebP images from AWS’ fast and geo-distributed CDN.

The objective of WebP/JPEG content negotiation is to send WebP encoded images to newer browsers that support it, while falling back to JPEG for older browser without WebP support. Browsers communicate WebP support through the Accept HTTP header. If the Accept header value for a request includes image/webp, we can send WebP encoded images in responses.

Prior to the CloudFront improvements, doing WebP/JPEG content negotiation for images fronted with CloudFront CDN was not possible. That’s because CloudFront would only vary response content based on the Accept-Encoding header (and some other unrelated properties). With the new improvements, we can configure arbitrary headers for CloudFront to cache on. For WebP/JPEG content negotiation, we specify Accept in list list of headers to whitelist in the AWS CloudFront console.

Specifying Accept header in AWS console
CloudFront configuration

Once the rest of the origin and behavior configuration is complete, we can use cURL to test that CloudFront correctly caches and serves responses based on the Accept request header:

curl -v http://abcd1234.cloudfront.net/myimageurl > /dev/null
...
< HTTP/1.1 200 OK
< Content-Type: image/jpeg
...

And simulating a client that supports WebP:

curl -v -H "Accept: image/webp" http://abcd1234.cloudfront.net/myimageurl > /dev/null
...
< HTTP/1.1 200 OK
< Content-Type: image/webp
...

Your origin server should set the Vary header to let any caches in the response-path know how they can cache, although I don’t believe it’s required to make CloudFront work. In our case, the correct Vary header value is Accept,Accept-Encoding.

That’s it! Site visitors will now be receiving compact WebP images (if their browsers support them), served by AWS CloudFront CDN.

Building Heroku buildpack binaries with Docker

This post covers how binaries are created for the Heroku Mono buildpack, in particular the Mono runtime and XSP/fastcgi-mono-server components that are vendored into app slugs. Binaries were previously built using a custom buildpack with the build running in a Heroku dyno. That took a while though and was error-prone and hard to debug, so I have migrated the build scripts to run inside containers set up with Docker. Some advantages of this approach are:

  • It’s fast, builds can run on my powerful laptop
  • It’s easy to get clean Ubuntu Lucid (10.04) image for reproducible clean-slate builds
  • Debugging is easy too, using an interactive Docker session

The two projects used to build Mono and XSP are on GitHub. Building is a two-step process: First, we do a one-time Docker build to create an image with the bare-minimum (eg. curl, gcc, make) required to perform Mono/XSP builds. s3gof3r is also added – it’s used a the end of the second step to upload the finished binaries to S3 where buildpacks can access them.

The finished Docker images have build scripts and can now be used to run builds at will. Since changes are not persisted to the Docker images each new build happens from a clean slate. This greatly improves consistency and reproducibility. The general phases in the second build step are:

  1. Get source code to build
  2. Run configure or autogen.sh
  3. Run make and make install
  4. Package up result and upload to S3

Notice that the build scripts specify the /app filesystem location as the installation location. This is the filesystem location where slugs are mounted in Heroku dynos (you can check yourself by running heroku run pwd). This may or not be a problem for your buildpack, but Mono framework builds are not path-relative, and expect to eventually be run out of where they were configured to be installed. So during the build, we have to use a configure setting consistent with how Heroku runs slugs.

The build scripts are parametrized to take AWS credentials (for the S3 upload) and the version to build. Building Mono 3.2.8 is done like this:

$ docker run -v ${PWD}/cache:/var/cache -e AWS_ACCESS_KEY_ID=key -e AWS_SECRET_ACCESS_KEY=secret -e VERSION=3.2.8 friism/mono-builder

It’s now trivial to quickly create and upload to S3 binaries for all the versions of Mono I want to support. It’s very easy to experiment with different settings and get binaries that are small (so slugs end up being smaller), fast and free of bugs.

There’s one trick in the docker run command: -v ${PWD}/cache:/var/cache. This mounts the cache folder from the current host directory in the container. The build scripts use that location to store and cache the different versions of downloaded source code. With the source code cache, turnaround times when tweaking build settings are even shorter because I don’t have to wait for the source to re-download on each run.

Running .NET apps on Docker

This blog post covers running simple .NET apps in Docker lightweight containers using Mono. I run Docker in a Vagrant/VirtualBox VM on Windows. This works great and is fast. Installation instructions are available on the Docker site.

Building the base image

First order of business is to create a Docker image that has Mono installed. We will use this as the base image for containers that actually run apps. To get the most recent Mono version (3.2.6 at the time of writing) I use packages created by Timotheus Pokorra installed on a Ubuntu 12.04 LTS Docker image. Here’s the Dockerfile for that:

FROM ubuntu:12.04
MAINTAINER friism

RUN apt-get -y -q install wget
RUN wget -q http://download.opensuse.org/repositories/home:tpokorra:mono/xUbuntu_12.04/Release.key -O- | apt-key add -
RUN apt-get remove -y --auto-remove wget
RUN sh -c "echo 'deb http://download.opensuse.org/repositories/home:/tpokorra:/mono/xUbuntu_12.04/ /' >> /etc/apt/sources.list.d/mono-opt.list"
RUN apt-get -q update
RUN apt-get -y -q install mono-opt

Here’s what’s going on:

  1. Install wget
  2. Add repository key to apt-get
  3. remove wget
  4. Add openSUSE repository to sources list
  5. Install Mono from there

At first I did all of the above in one command since these steps represent the single logical step of installing Mono and it seems like they should be just one commit. Nobody will be interested in the commit after wget was installed, for example. I ended up splitting it up into separate RUN commands since that’s what other people seem to do.
With that Dockerfile, we can build an image:

$ docker build -t friism/mono .

We can then run a container using the generated image and check our Mono installation:

vagrant@precise64:~/mono$ docker run -i -t friism/mono bash
root@0bdca65e6e8e:/# /opt/mono/bin/mono --version
Mono JIT compiler version 3.2.6 (tarball Sat Jan 18 16:48:05 UTC 2014)

Note that Mono is installed in /opt and that it works!

Running a console app

First, we’ll deploy a very simple console app:

using System;

namespace HelloWorld
{
	public class Program
	{
		static void Main(string[] args)
		{
			Console.WriteLine("Hello World");
		}
	}
}

For the purpose of this example, we’ll pre-build apps using the VS command prompt and msbuild and then add the output to the container:

msbuild /property:OutDir=C:\tmp\helloworld HelloWorld.sln

Alternatively, we could have invoked xbuild or gmcs from within the container.

The Dockerfile for the container to run the app is extremely simple:

FROM friism/mono
MAINTAINER friism

ADD app/ .
CMD /opt/mono/bin/mono `ls *.exe | head -1`

Note that it relies on the friism/mono image created above. It also expects the compiled app to be in the /app folder, so:

$ ls app
HelloWorld.exe

The CMD will simply use mono to run the first executable found in the build output. Let’s build and run it:

$ docker build -t friism/helloworld-container .
...
$ docker run friism/helloworld-container
Hello World

It worked!

Web app

Running a self-hosting OWIN web app is only slightly more work. For this example I used the sample code from the OWIN/Katana-on-Heroku post.

$ ls app/
HelloWorldWeb.exe  Microsoft.Owin.Diagnostics.dll  Microsoft.Owin.dll  Microsoft.Owin.Host.HttpListener.dll  Microsoft.Owin.Hosting.dll  Owin.dll

The Dockerfile for this exposes port 5000 from the container, and CMD is used to start the web app and specify port 5000 to listen on:

FROM friism/mono
MAINTAINER friism

ADD app/ .
EXPOSE 5000
CMD ["/opt/mono/bin/mono", "HelloWorldWeb.exe", "5000"]

We start the container and map port 5000 to port 80 on the machine running Docker:

$ docker run -p 80:5000 -t friism/mono-hello-world-web

And with that we can visit the OWIN sample site on http://localhost/.

If you’re a .NET developer this post will hopefully have helped you place Docker in the context of your everyday work. If you’re interested in more ideas on how to use Docker to deploy apps, check out this Automated deployment with Docker – lessons learnt post.

Serving WebP images with ASP.NET MVC

Speed is a feature and one thing that can slow down web apps is clients waiting to download images. To speed up image downloads and conserve bandwidth, the good folks of Google have come up with a new image format called WebP (“weppy”). WebP images are around 25% smaller in size than equivalent images encoded with JPEG and PNG (WebP supports both lossy and lossless compression) with no worse perceived quality.

This blog post shows how to dynamically serve WebP-encoded images from ASP.NET to clients that support the new format.

One not-so-great way of doing this is to serve different HTML depending on whether clients supports WebP or not, as described in this blog post. As an example, clients supporting WebP would get HTML with <img src="image.webp"/> while other clients would get <img src="image.jpeg"/>. The reason this sucks is that the same HTML cannot be served to all clients, making caching harder. It will also tend to pollute your view code with concerns about what image formats are supported by browser we’re rendering for right now.

Instead, images in our solution will only ever have one url and the content-type of responses depend on the capabilities of the client sending the request: Browsers that support WebP get image/webp and the rest get image/jpeg.

I’ll first go through creating WebP-encoded images in C#, then tackle the challenge of detecting browser image support and round out the post by discussing implications for CDN use.

Serving WebP with ASP.NET MVC

For the purposes of this article, we’ll assume that we want to serve images from an URI like /images/:id where :id is some unique id of the image requested. The id can be used to fetch a canonical encoding of the image, either from a file system, a database or some other backing store. In the code I wrote to use this, the images are stored in a database. Once fetched from the backing store, the image is re-sized as desired, re-encoded and served to the client.

At this point, some readers are probably in uproar: “Doing on-the-fly image fetching and manipulation is wasteful and slow” they scream. That’s not really the case though, and even if it were, the results can be cached on first request and then served quickly.

Assume we have an Image class and method GetImage(int id) to retrieve images:

private class Image
{
	public int Id { get; set; }
	public DateTime UpdateAt { get; set; }
	public byte[] ImageBytes { get; set; }
}

We’ll now use the managed API from ImageResizer to resize the image to the desired size and then re-encode the result to WebP using Noesis.Drawing.Imaging.WebP.dll (no NuGet package, unfortunately).

public ActionResult Show(int imageId)
{
	var image = GetImage(imageId);

	var resizedImageStream = new MemoryStream();
	ImageBuilder.Current.Build(image.ImageBytes, resizedImageStream, new ResizeSettings
	{
		Width = 500,
		Height = 500,
		Mode = FitMode.Crop,
		Anchor = System.Drawing.ContentAlignment.MiddleCenter,
		Scale = ScaleMode.Both,
	});

	var resultStream = new MemoryStream();
	WebPFormat.SaveToStream(resultStream, new SD.Bitmap(resizedImageStream));
	resultStream.Seek(0, SeekOrigin.Begin);

	return new FileContentResult(resultStream.ToArray(), "image/webp");
}

System.Drawing is referenced using using SD = System.Drawing;. The controller action above is fully functional and can serve up sparkling new WebP-formatted images.

Browser support

Next up is figuring out whether the browser requesting an image actually supports WebP, and if it doesn’t, respond with JPEG. Luckily, this doesn’t involve going back to the bad old days of user-agent sniffing. Modern browsers that support WebP (such as Chrome and Opera) send image/webp in the accept header to indicate support. Ironically given that Google came up with WebP, the Chrome developers took a lot of convincing to set that header in requests, fearing request size bloat. Even now, Chrome only advertises webp support for requests that it thinks is for images. In fact, this is another reason the “different-HTML” approach mentioned in the intro won’t work: Chrome doesn’t advertise WebP support for requests for HTML.

To determine what content encoding to use, we inspect Request.AcceptTypes. The resizing code is unchanged, while the response is generated like this:

	var resultStream = new MemoryStream();
	var webPSupported = Request.AcceptTypes.Contains("image/webp");
	if (webPSupported)
	{
		WebPFormat.SaveToStream(resultStream, new SD.Bitmap(resizedImageStream));
	}
	else
	{
		new SD.Bitmap(resizedImageStream).Save(resultStream, ImageFormat.Jpeg);
	}

	resultStream.Seek(0, SeekOrigin.Begin);
	return new FileContentResult(resultStream.ToArray(), webPSupported ? "image/webp" : "image/jpeg");

That’s it! We now have a functional controller action that responds correctly depending on request accept headers. You gotta love HTTP. You can read more about content negotiation and WebP on lya Grigorik’s blog.

Client Caching and CDNs

Since it does take a little while to perform the resizing and encoding, I recommend storing the output of the transformation in HttpRuntime.Cache and fetching from there in subsequent requests. The details are trivial and omitted from this post.

There is also a bunch of ASP.NET cache configuration we should do to let clients cache images locally:

	Response.Cache.SetExpires(DateTime.Now.AddDays(365));
	Response.Cache.SetCacheability(HttpCacheability.Public);
	Response.Cache.SetMaxAge(TimeSpan.FromDays(365));
	Response.Cache.SetSlidingExpiration(true);
	Response.Cache.SetOmitVaryStar(true);
	Response.Headers.Set("Vary",
		string.Join(",", new string[] { "Accept", "Accept-Encoding" } ));
	Response.Cache.SetLastModified(image.UpdatedAt.ToLocalTime());

Notice that we set the Vary header value to include “Accept” (as well as “Accept-Encoding”). This tells CDNs and other intermediary caching proxies that if they try to cache this response, they must vary the cached value based on value of the “Accept” header of the request. This works for “Accept-Encoding”, so that different values can cached based on whether the response is compressed with gzip, deflate or not at all, and all major CDNs support it. Unfortunately, the mainstream CDNs I experimented with (CloudFront and Azure CDN) don’t support other Vary values than “Accept-Encoding”. This is really frustrating, but also somewhat understandable from the standpoint of the CDN folks: If all Vary values are honored, the number of artifacts they have to cache would increase at a combinatorial rate as browsers and servers make use of cleverer caching. Unless you find a CDN that specifically support non-Accept-Encoding Vary values, don’t use a CDN when doing this kind of content negotiation.

That’s it! I hope this post will help you build ASP.NET web apps that serve up WebP images really quickly.

Full 2012 Danish company taxes

Two weeks ago I put out a preliminary release of the 2012 taxes. The full dataset with 245,836 companies is now available in this Google Fusion Table. I haven’t done any of the analysis I did last year. Other than a list of top payers, I’ll leave the rest up to you guys. One area of interest might be to compare the new data with the data from last year: What new companies have showed up that have lots of profit or pay lots of tax? What high-paying companies went away or stopped paying anything? Who’s taxes and profits changed the most? You can find last year’s data via the blog post on the 2011 data.

Here are the top 2012 corporate tax payers:

  1. Novo A/S: kr. 3,747,186,913.00
  2. KIRKBI A/S: kr. 2,044,585,056.00
  3. NORDEA BANK DANMARK A/S: kr. 1,669,087,041.00
  4. TDC A/S: kr. 1,587,748,550.00
  5. Coloplast A/S: kr. 607,238,513.00

Ping me on Twitter if you build something cool with this data. The tax guys were kind enough to say that they’re looking into my finding that the 2011 data had been changed and address my critique of the fact that the 2011 data is no longer available. I’m still hoping for a response.

How to read the 2012 data

Each company record exposes up to 9 values. I’ve given them English column names in the Fusion Table. As an example of what goes where, check out A/S Dansk Shell/Eksportvirksomhed:

  • Profit (“Skattepligtig indkomst for 2012”): kr. 528.440.304
  • Losses (“Underskud, der er trukket fra indkomsten”): kr. 225.655.303
  • Tax (“Selskabsskatten for 2012”): kr. 132.110.075
  • FossilProfit (“Skattepligtig kulbrinteindkomst for 2012”): kr. 9.757.362.931
  • FossilLosses (“Underskud, der er trukket fra kulbrinteindkomsten”): kr. 0
  • FossilTax (“Kulbrinteskatten for 2012”): kr. 5.073.828.724
  • FossilCorporateProfit (“Skattepligtig selskabsindkomst, jf. kulbrinteskatteloven for 2012”): kr. 14.438.589.854
  • FossilCorporateLosses (“Underskud, der er trukket fra kulbrinteindkomsten”): no value
  • FossilCorporateTax (“Selskabsskat 2012 af kulbrinteindkomst”): kr. 3.609.647.450

A better way to release tax data

Computerworld has a interesting article with details from the process of making the 2011 data available. The article is mostly about the fact that various Danish business lobbies managed to convince the tax authority to bundle fossil extraction taxes and taxes on profit into one number. This had the effect of cleverly obfuscating the fact that Maersk doesn’t seem to pay very much tax on profits earned in the parts of their business that aren’t about extracting oil.

More interesting for us, the article also mentions that the tax authority spent about kr. 1 mio to develop the site that makes the data available. That seems generous for a single web page that can look up values in a database, and does so slowly. But it’s probably par for the course in public procurement.

The frustrating part is that the data could be released in a much more simple and useful way: Simply dump it out in an Excel spreadsheet (and a .csv for people that don’t want to use Excel) and make it available for download. I have created these files from the data I’ve scraped, and they’re about 27MB (5MB compressed). It would be trivial and cheap to make the data set available in this format.

Right now, there’s probably a good number of programmers and journalists (me included) that have build scrapers and are hammering the tax website to get all the data. With downloadable spreadsheets, getting all the data would be fast and easy, and it would save a lot of wasted effort. I’m sure employees at the Tax Authority create spreadsheets like this all the time anyway, to do internal analysis and reporting.

Depending on whether data-changes (as seen with last years data) are a fluke or normal, it might be necessary to release new spreadsheets when the data is updated. This would be great too though, because it would make the tracking changes easier and more transparent.

Preliminary 2012 Danish Company Tax Records

The Danish tax authority released the 2012 company tax records yesterday. I’ve scraped a preliminary data set by just getting all the CVR-ids from last year’s scrape. This leaves out any new companies that cropped up since then, but there’d still 228,976 in the set, including subsidiaries.

The rest are being scraped as I write this and should be ready in at most a few days. I’ll publish the full data as soon as I have it, including some commentary on the fact that the tax people spent kr. 1mio [dk] to release the data in the this half assed way.

In the meantime, get the preliminary data in this Google Fusion Table.

Getting ready for 2012 Danish company taxes

This is a follow-up to last year’s “Tax records for Danish companies” post which covered how I screen-scraped and analyzed 2011 tax records for all Danish companies.

I revisited the scraper source code today because the Tax Authority has made it known[dk] that they will be releasing the 2012 data set next week. As I did last year, I want to preface the critique that is going to follow and say that it’s awesome that this information is made public, and that I hope the government will continue to publish it and work on making it more useful.

First some notes regarding the article:

  • It states that, for the first time, it’s possible to determine what part of a company’s taxes are due to profits on oil and gas extraction and what are due to normal profits. That’s strange, since this was also evident from last years data. Maybe they’re trying to say (but the journalist was too dim to understand) that they have solved the problem in the 2011 data that caused oil and gas corporations to be duplicated, as evidenced by the two entries for Maersk: A.P.Møller – Mærsk A/S/ Oil & Gas Activity and A.P. MØLLER – MÆRSK A/S. Note that the two entries have the same CVR identifier.
  • It’s frustrating that announcements like this (that new data is coming next week) are not communicated publicly on Twitter or the web sites of either the Tax Authority or the Ministry of Taxation. Instead, one has to randomly find the news on some random newspaper web site. Maybe it was mentioned in a newsletter I’m not subscribed to – who knows.

Anyway, these are nuisances, now on to the actual problems.

2011 data is going away

The webpage says it beautifully:

De offentliggjorte skatteoplysninger er for indkomståret 2011 og kan ses, indtil oplysningerne for 2012 bliver offentliggjort i slutningen af 2013.

Translated:

The published tax information is for the year 2011 and is available until the 2012 information is published at the end of 2013.

Removing all the 2011 data to make room for the 2012 stuff is very wrong. First off, it’s defective that historical information is not available. Of course, I scraped the information and put it in a Fusion Table for posterity (or at least for as long as Google maintains that product). Even then, it’s wrong of the tax authority to not also publish and maintain historical records.

Second, I suspect that the new 2012 data will be published using the same URI scheme as the 2011 data, i.e.: http://skat.dk/SKAT.aspx?oId=skattelister&x={cvr-id}. So when the new data goes live some time next week, a URI that pointed to the 2011 tax records of the company FORLAGET SOHN ApS will all of a sudden point to the 2012 tax records of that company. That means that all the links I included in last year’s blog post and thought would point to 2011 data in perpetuity now point to 2012 data. This is likely going to be confusing to readers, both of my post, but also for other people following those links from all over the Internet. The semantics of these URIs are arguably coherent if they’re defined to be “the latest tax records for company X”. This is not a very satisfying paradigm though, and it would be much better if /company-tax-records/{year}/{cvr-id} URIs were made available, or if records from all years were available at /SKAT.aspx?oId=skattelister&x={cvr-id} as they became available.

The 2011 data was changed

I discovered this randomly when dusting off last years code. It has a set of integration tests, and the one for Saxo Bank refused to pass. That turns out to be because the numbers reported have changed. When I first scraped the data, Saxo Bank paid kr. 25.426.135 in taxes on profits of kr. 257.969.357. The current numbers are kr. 25.142.333 taxes on kr. 260.131.946 of profits. So it looks like the bank made a cool extra couple millions in 2011 and managed to get their tax bill bumped down a bit.

Some takeaways:

  1. Even though this information is posted almost a full year after the end of 2011, the numbers are not accurate and have to be corrected. This is obviously not only the tax authority’s fault: Companies are given ample time to gather and submit records and even then, they may provide erroneous data.
  2. It’d be interesting to know what caused the correction. Did Saxo Bank not submit everything? Did the tax people miss something? Was Saxo Bank audited?
  3. It’d be nice if these revisions themselves were published in an organised fashion by the tax authorities. Given the ramshackle way they go about publishing the other data, I’m not holding my breath for this to happen.
  4. I have no idea if there are adjustments to other companies and if so, how many. I could try and re-run the scraper on all 243,711 companies to find changes before the 2012 release obliterates the 2011 data but I frankly can’t be bothered. Maybe some journalist can go ask.

That’s it! Provided the tax people don’t change the web interface, the scraper is ready for next week’s 2012 data release. I’ll start running as soon as 2012 numbers show up and publish raw data when I have it.

Older Posts Newer Posts