Compressed string storage with NHibernate

Posted on | | 1 Comment on Compressed string storage with NHibernate

This blog post demonstrates how to use a IUserType to make NHibernate compress strings before storing them. It also shows how to use an AttributeConvention to configure the relevant type mapping.

By compressing strings before storing them you can save storage space and potentially speed up your app because fewer bits are moved on and off physical storage. In this example, compression is done using the extremely fast LZ4 algorithm so as to not slow data storage and retrieval.

The downside to compressing string stored in the database is that running ad-hoc SQL queries (such is mystring like '%foo%') is not possible.

Background

I was building an app that was downloading and storing lots HTML and for convenience I was storing the HTML in a SQL Server database. SQL Server has no good way to compress stored data so the database files grew very quickly. This bugged me because most of the content would compress well. I was using Entity Framework and started throwing around for ways to hook into how EF serializes data or for a way to create a custom string type that could handle the compression. Even with the EF6 pre-releases, I couldn’t find any such hooks.

NHibernate IUserType

So I migrated to NHibernate which lets you define custom datatypes and control how they’re stored in the database by implementing the IUserType interface. The best tutorial I’ve found for implementing IUserType is this one by Jacob Andersen. You can check out my full implementation of a compressed string IUserType on GitHub. The two most interesting methods are NullSafeGet() and NullSafeSet():

	public object NullSafeGet(IDataReader rs, string[] names, object owner)
	{
		var value = rs[names[0]] as byte[];
		if (value != null)
		{
			var deCompressor = LZ4DecompressorFactory.CreateNew();
			return Encoding.UTF8.GetString(deCompressor.Decompress(value));
		}

		return null;
	}

	public void NullSafeSet(IDbCommand cmd, object value, int index)
	{
		var parameter = (DbParameter)cmd.Parameters[index];

		if (value == null)
		{
			parameter.Value = DBNull.Value;
			return;
		}

		var compressor = LZ4CompressorFactory.CreateNew();
		parameter.Value = compressor.Compress(Encoding.UTF8.GetBytes(value as string));
	}

The actual compression is done by LZ4Sharp which is a .NET implementation of the LZ4 compression algorithm. LZ4 is notable, not for compressing data a lot, but for compressing and uncompressing data extremely quickly. A single modern CPU core can LZ4-compress at up to 300 MB/s and uncompress much faster. This should minimize the overhead of compressing and uncompressing data as it enters and leaves the database.

For SqlTypes we use BinarySqlType(int.MaxValue):

	public SqlType[] SqlTypes
	{
		get { return new[] { new BinarySqlType(int.MaxValue) }; }
	}

This causes the type to be mapped to a varbinary(max) column in the database.

Mapping

To facilitate mapping, we’ll use an Attribute:

	[AttributeUsage(AttributeTargets.Property)]
	public class CompressedAttribute : Attribute
	{
	}

And an AttributeConvention for FluentNHibernate to use:

	public class CompressedAttributeConvention : AttributePropertyConvention
	{
		protected override void Apply(CompressedAttribute attribute, IPropertyInstance instance)
		{
			if (instance.Property.PropertyType != typeof(string))
			{
				throw new ArgumentException();
			}

			instance.CustomType(typeof(CompressedString));
		}
	}

Here’s how to use the convention with AutoMap:

	var autoMap = AutoMap.AssemblyOf()
		.Where(x => typeof(Entity).IsAssignableFrom(x))
		.Conventions.Add(new CompressedAttributeConvention());

The full SessionFactory is on GitHub.

With this, we get nice, clean entity classes with strings that are automatically compressed when stored:

	public class Document : Entity
	{
		[Compressed]
		public virtual string Text { get; set; }
	}

Limitations

As mentioned in the introduction you can’t do ad-hoc SQL queries because compressed strings are stored in the database as binary blobs. Querying with NHibernate as also somewhat limited. Doing document.Text == "foo" actually works because NHibernate runs “Foo” through the compression. Queries that involve Contains() will (silently) not work, unfortunately. This is because NHibernate translates this to a like query, which won’t work with the compressed binary blob. I haven’t looked into hooking into the query engine to fix this.