Compressed UTF-8 Strings in SQL Azure

August 31, 2011  |  , ,  |  Share

Quick Summary
Save money by compressing strings in SQL Azure using UTF-8, deflate and the DotNetZip libary.
SQL Azure costs are primarily based on which size database you choose to provision. Given the premium over the cost of blob storage you’ll want to avoid storing large amounts of data that could be kept in files. However, most applications have several medium sized text fields which aren’t used as query modifiers. It might be something like user comments to a blog post or in my case, product descriptions on a shopping site. These fields are ideal candidates for compression to reduce costs. The enterprise edition of SQL Server has both row and page level compression, but unfortunately this feature hasn’t made it to SQL Azure yet.

Wayne Berry has a great post which discusses which columns are good candidates for manual compression and walks through the process of converting a column to use .NET GZipStream compression. In this post, we’ll build on that approach by using UTF-8 encoding, deflate, and the DotNetZip library to improve our compression ratio.

UCS-2 vs UTF-8

Internally SQL Server stores text using the UCS-2 encoding which uses 2 bytes to store each code point. It’s similar to UTF-16 (which supersedes UCS-2) except in the way UTF-16 handles surrogates. UTF-8 on the other hand is a variable width encoding and has the benefit that it can store ASCII characters using a single byte. This means that if your text is from a European language, then it can be stored in much less space. However, text from languages like Chinese or Japanese may require MORE space since many code points will require 3 or more bytes to store. However, unless your application primarily targets one of these languages its worth experimenting with this technique given the huge amount of space you can save. If you are using SQL Server Enterprise edition, it might even be worth combining it with page or row compression.

Gzip vs Deflate

The Gzip format actually uses deflate to compress the payload, but adds a 10 byte header with version info and an 8 byte footer with a checksum. If you are compressing an entire webpage, then the 18 byte overhead isn’t a big deal. However, for shorter strings it can have a big impact on your compression ratio. The .NET Framework includes a DeflateStream class which is easy to swap in place of the GzipStream class.

.NET Framework vs DotNetZip Library

During the compression phase, an implementor of deflate has to choose how much time to spent looking for sequences that can be compressed. Many libraries give the caller the flexibility to control the speed/space tradeoff by choosing a compression level. The .NET DeflateStream only provides a single default level and while it’s very fast, the compression ratio is poor. The DotNetZip library provides a replacement DeflateStream that offers multiple compression levels. While not as fast as the .NET framework implementation, the performance is good and the compression ratios are very good.

Since deflate is a standardized format, its possible to mix use of the two libraries as well. While the DotNetZip encoder offers much better compression at a moderate cost, in my tests the decoder is slower that the .NET framework class regardless of which encoder was used. So I’m using the DotNetZip DeflateStream to compress the data and the .NET framework’s class to decompress.

Compressing Data

To make swapping in DotNetZip easier, the library creators used the same names as the framework for their classes and enumerations. Since we are using both DotNetZip (to compress) and .NET framework (to decompress) I’ve used the fully qualified class names to disambiguate the two.

byte[] DeflateUTF8(string value)
{
    byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);
    using (MemoryStream memoryStream = new MemoryStream())
    {
        using (Ionic.Zlib.DeflateStream deflateStream =
            new Ionic.Zlib.DeflateStream(
                memoryStream,
                Ionic.Zlib.CompressionMode.Compress,
                Ionic.Zlib.CompressionLevel.BestCompression))
        {
            deflateStream.Write(utf8Bytes, 0, utf8Bytes.Length);
        }
        return memoryStream.ToArray();
    }
}

Decompressing Data

string InflateUTF8(byte[] data)
{
    using (MemoryStream decompressedStream = new MemoryStream())
    {
        using (MemoryStream compressedStream = new MemoryStream(data))
        using (System.IO.Compression.DeflateStream deflateStream =
            new System.IO.Compression.DeflateStream(
                compressedStream,
                System.IO.Compression.CompressionMode.Decompress))
        {
            deflateStream.CopyTo(decompressedStream);
        }
 
        return Encoding.UTF8.GetString(decompressedStream.ToArray());
    }
}

Small String Results

We’ll look at the results for small and large strings separately since they behave differently. I picked the sample strings at random from product descriptions on FilterPlay. First up are small strings with 200 or fewer characters. With less source data they are much more susceptible to any overhead or inefficiency. The compression rate can vary between strings of the same length and this is especially pronounced at shorter lengths. There are several other factors that will also affect compression ratios like language and intra-string repetition so you should experiment with your own data to determine which techniques are worth using.

The data in the charts is color coded to highlight the differences between UTF-16 (green), which is how .NET stores Unicode strings, and UTF-8 (blue). Within the green bars at each length, its easy to see the 18 bytes we save by using deflate instead of gzip. However, the savings in using UTF-8 make this look small by comparison. The difference between .NET and DotNetZip deflate is also quite prominent. The .NET deflate is particularly bad when dealing with shorter string lengths. With short strings that don’t offer any opportunities for compression, the DotNetZip library only penalizes you with the 2 byte overhead in deflate, while the .NET library ends up using far more space than the uncompressed data.

Large Strings Results

As string length grows, the overhead of gzip vs deflate becomes less important so I’ve omitted gzip from the results below. More source data also means more opportunities to find repeating sequences that can be compressed. Compared to the amount of space SQL server would use to store the strings, any of the compression methods provide a pretty big win. However, UTF-8 encoding and DotNetZip deflate compression can save an additional 30-45% of the storage required by a UTF-16 string compressed with the standard .NET DeflateStream. Given the high cost of storage in SQL Azure, for my application the savings were definitely worth the small computational cost incurred during each read/write.



Leave a Reply