Compressed UTF-8 Strings in SQL Azure

SQL Azure costs are primarily based on which size database you choose to provision. Given the premium over the cost of blob storage you’ll want to avoid storing large amounts of data that could be kept in files. However, most applications have several medium sized text fields which aren’t used as query modifiers. It might be something like user comments to a blog post or in my case, product descriptions on a shopping site. These fields are ideal candidates for compression to reduce costs. The enterprise edition of SQL Server has both row and page level compression, but unfortunately this feature hasn’t made it to SQL Azure yet.

Wayne Berry has a great post which discusses which columns are good candidates for manual compression and walks through the process of converting a column to use .NET GZipStream compression. In this post, we’ll build on that approach by using UTF-8 encoding, deflate, and the DotNetZip library to improve our compression ratio. Continue reading

Posted in Uncategorized | Tagged , , | Leave a comment

Performance of jQuery UI and Public CDNs

Google, Microsoft, and Yahoo all have content delivery networks (CDNs) to provide common JavaScript libraries for public use. This is a great service which helps to speed up the web by allowing many different sites to link to the same JavaScript library, increasing the chance a user will have a file in their cache. If you are currently hosting your own copy of a library like jQuery, switching to the Google CDN is usually an easy way to improve the speed of your site. Continue reading

Posted in Uncategorized | Tagged , | Leave a comment

Years and Months Between Dates

At first glance, calculating the difference between two DateTimes seems easy. Simply subtract the first date from the second and you’ll get a TimeSpan which can be used to get the total number of days elapsed. With the 4th of July right around the corner, let’s calculate the age of the United States since the Declaration of Independence:

Continue reading

Posted in Uncategorized | Tagged | Leave a comment

Changing IIS Logging Fields in Windows Azure

Windows Azure makes it pretty easy to enable IIS logging for your web roles and the logs will periodically be transfered to a storage account so you can analyze them. A typical IIS installation will include some essential fields and administrators can add others if needed. However, Windows Azure enables all of the fields by default. Logging some fields like the cookie content can dramatically increase the size of your logs. For example, the asp.net authentication cookie is around 260 bytes and the request verification is over 500. With those you are nearing 1K in log data per request! Even if you aren’t using forms authentication, smaller cookies like those from analytics providers can add up quickly. Removing unnecessary fields is a simple way to cut the size of your logs.

As an aside, not only are these cookies expensive to log, they add overhead to every request your users make. Moving all your static resources to a CDN with a different domain (or sub-domain) helps because the cookies won’t be sent for those requests. Plus your users get faster responses because they are being served by the closest edge node instead of your web server.

OK, back to logging. Since Azure logs the cookies by default, how do we go back to the IIS defaults which don’t include cookies? My initial solution was to programatically change the settings during the role startup. Here is the code to do that:

using (ServerManager server = new ServerManager())
{
	// get the site's web configuration
	string siteNameFromServiceModel = "Web";
	string siteName = String.Format(
		"{0}_{1}",
		RoleEnvironment.CurrentRoleInstance.Id,
		siteNameFromServiceModel);
	var site = server.Sites[siteName];
	
	// update the logging fields for the site
	site.LogFile.LogExtFileFlags =
		LogExtFileFlags.Date |
		LogExtFileFlags.Time |
		LogExtFileFlags.ClientIP |
		LogExtFileFlags.UserName |
		LogExtFileFlags.ServerIP |
		LogExtFileFlags.Method |
		LogExtFileFlags.UriStem |
		LogExtFileFlags.UriQuery |
		LogExtFileFlags.HttpStatus |
		LogExtFileFlags.Win32Status |
		LogExtFileFlags.ServerPort |
		LogExtFileFlags.UserAgent |
		LogExtFileFlags.HttpSubStatus |
		LogExtFileFlags.TimeTaken;

	server.CommitChanges();
}

The downside of this approach is that you have to run the role with elevated permissions and include a reference to Microsoft.Web.Administration.dll. With the introduction of startup tasks, we can change the IIS logging settings outside the role process. The task has to be run using elevated privileges, but the role can run with the standard set of permissions. Read the MSDN docs and Steve Marx’s post if you are not already familiar with startup tasks.

In my initial attempt I created a simple task that ran synchronously to change the server-wide logging defaults before the role entry point. However, the Azure site configuration overrides the defaults when creating the site. So we need to modify the site settings using a background task. Background scripts can run in parallel with the Azure site configuration so its important to loop until the site has been created. The site name Azure uses is a combination of the web project name and the role instance id so the site name can’t be hardcoded in our script. We’ll take advantage of the fact that a standard Azure web role only has a single site and just modify the settings for all sites. To change the logging fields we’ll be using IIS’s appcmd tool. Appcmd is similar to powershell in that you can pipe the output of one command to another using the /xml and /in command line switches. If a site is modified it will output the following line:

SITE object "SITE_1_NAME" changed
SITE object "SITE_2_NAME" changed
SITE object "SITE_3_NAME" changed

If the site hasn’t been created by the Azure config script, then the output will be blank. So our batch file will pipe the output of appcmd to the FIND command which will look for the “SITE object” string. If it isn’t found then we’ll sleep for 10 seconds and then retry. Here is the full batch file:

@ECHO OFF

echo Configuring IIS logging

rem The set of fields that will be logged by IIS
rem The logfields variable should be on a single line
SET logfields=Date,Time,ClientIP,UserName,ServerIP,Method,UriStem,UriQuery,
    TimeTaken,HttpStatus,Win32Status,ServerPort,UserAgent,HttpSubStatus

rem The appcmd executable
SET appcmd=%windir%\system32\inetsrv\appcmd

rem Retrieve all the sites, /xml flag allows output to be piped to next command
SET getallsites=%appcmd% list sites /xml

rem Set the logging fields for each site in the xml input (triggered by /in flag)
SET setlogging=%appcmd%  set site /in /logFile.logExtFileFlags:%logfields%

rem Look for the string that indicates logging fields were set
SET checkforsuccess=find "SITE object"

:configlogstart
%getallsites% | %setlogging% | %checkforsuccess%
IF NOT ERRORLEVEL 1 goto configlogdone

echo No site found, waiting 10 secs before retry...
TIMEOUT 10 > nul
goto configlogstart

:configlogdone
echo Done configuring IIS logging

NOTE: I broke the “SET logfields…” line for readability but it should be a single line in your batch file

Here is the list of available fields in IIS logging (look at the logExtFileFlags section) if you’d like to deviate from the normal IIS defaults.

Posted in Uncategorized | Tagged , | 7 Comments

Serving GZip Compressed Content from the Azure CDN

zipper

GZip Compression

There are two big reasons why you should compress your content: time & money. Less data transferred over the wire means your site will be faster for users and bandwidth costs will be lower for you. Compressing the content will require some CPU, but web servers are often IO-bound and have cycles to spare. Plus, modern web servers do a good job of caching compressed content so even this cost is minimal. It’s almost always worth it to enable compression on your web server.

IIS supports compression of both dynamic (generated on-the-fly) and static content (files on disk). Both types of compression can yield big improvements, but for this post we’ll focus on static content and more specifically style sheets (.css) and JavaScript (.js). Of course there are many other types of static content like images and videos. However, compression is built into many of these file formats so there is little benefit in having the web server try to compress them further.

Content Delivery Networks (CDNs)

Compression reduces the total amount of data your server needs to send to a user. But it doesn’t reduce latency. Users that live far from your server will have longer response times because the data has to travel through more network hops. CDNs reduce latency by dispersing copies of the data throughout the world so its closer to end-users. Luckily there is a wide selection of CDNs that are easy to use and don’t require any long-term commitment. At FilterPlay we use the Azure CDN since it’s integrated with Azure Blob storage which we use extensively.

Compression and the Azure CDN

Unfortunately many CDNs do not support automatic gzip compression of content. This includes popular CDNs such as Amazon CloudFront as well as Windows Azure. We can work around this limitation by storing a secondary copy of the content that has been gzipped. It’s important to also provide the uncompressed version of the file because there are a small number of users with an old browser or old anti-virus software that doesn’t support compression. Other times users are behind a proxy server which strips the Accept-Encoding header.

Here is the code to create a gzip copy for every css and js file inside an Azure blob container:

/// <summary>
///   Finds all js and css files in a container and creates a gzip compressed
///   copy of the file with ".gzip" appended to the existing blob name
/// </summary>
public static void EnsureGzipFiles(
    CloudBlobContainer container,
    int cacheControlMaxAgeSeconds)
{
    string cacheControlHeader = "public, max-age=" + cacheControlMaxAgeSeconds.ToString();

    var blobInfos = container.ListBlobs(
        new BlobRequestOptions() { UseFlatBlobListing = true });
    Parallel.ForEach(blobInfos, (blobInfo) =>
    {
        string blobUrl = blobInfo.Uri.ToString();
        CloudBlob blob = container.GetBlobReference(blobUrl);

        // only create gzip copies for css and js files
        string extension = Path.GetExtension(blobInfo.Uri.LocalPath);
        if (extension != ".css" && extension != ".js")
            return;

        // see if the gzip version already exists
        string gzipUrl = blobUrl + ".gzip";
        CloudBlob gzipBlob = container.GetBlobReference(gzipUrl);
        if (gzipBlob.Exists())
            return;

        // create a gzip version of the file
        using (MemoryStream memoryStream = new MemoryStream())
        {
            // push the original blob into the gzip stream
            using (GZipStream gzipStream = new GZipStream(memoryStream, CompressionMode.Compress))
            using (BlobStream blobStream = blob.OpenRead())
            {
                blobStream.CopyTo(gzipStream);
            }

            // the gzipStream MUST be closed before its safe to read from the memory stream
            byte[] compressedBytes = memoryStream.ToArray();

            // upload the compressed bytes to the new blob
            gzipBlob.UploadByteArray(compressedBytes);

            // set the blob headers
            gzipBlob.Properties.CacheControl = cacheControlHeader;
            gzipBlob.Properties.ContentType = GetContentType(extension);
            gzipBlob.Properties.ContentEncoding = "gzip";
            gzipBlob.SetProperties();
        }
    });
}

Get the full source code for the CloudBlobUtility class which includes the utility methods referenced in the snippet above:

You may be wondering why we didn’t use the standard .gz extension for our gzipped copies. We need to use .gzip because Safari doesn’t correctly handle files with the .gz extension. Yes, its strange.

In addition to telling the client how long it should cache the data, the max-age in the cache control header also determines how long an CDN edge node will cache the data. I’d recommend that you use a large expiration and assume that once released, the data will never change. Include a version in the filename and rev your URLs when you need to make an update.

The max-age must be a 32-bit integer because that’s the largest value IE supported before version 9. Just wanted to mention that in case you were considering changing the cacheControlMaxAgeSeconds parameter type. I’ve seen cases in the wild where the max-age was even larger, probably because a developer assumed that bigger was better. Don’t worry, the max value of a 32-bit int means your content won’t expire for about 68 years.

If you are using jQuery, YUI, or another popular JavaScript library, don’t serve it from your own CDN account. Instead use the copy that Google, Microsoft or Yahoo provides. Your users probably already have a copy in their cache.

Detecting GZip Support

Once you have both uncompressed and gzipped versions of your content in the cloud, you need to modify your pages to vary the URL based on whether the client supports gzip compression. Keep in mind that if you are using output caching, you’ll need to vary the cache by the Accept-Encoding header. The following helper method checks the current request’s headers and appends the .gzip extension if compression is supported:

/// <summary>
///   Appends .gzip to the end of the url if the current request supports gzip
/// </summary>
/// <example>
///   Asp.net Razor syntax: @Cdn.GZipAwareUrl("http://cdn.domain.com/script.js")
/// </example>
public static string GZipAwareUrl(string url)
{
    HttpContext context = HttpContext.Current;
    if (context != null)
    {
        HttpRequest request = context.Request;
        if (request != null)
        {
            string encoding = request.Headers["Accept-Encoding"];
            if (encoding != null && encoding.Contains("gzip"))
            {
                return url + ".gzip";
            }
        }
    }

    return url;
}
Posted in Uncategorized | Tagged , | 11 Comments

Throttling Web API Calls

Sign outside Wallington, England (CC-BY by anemoneprojectors)

From Amazon to Zillow, there are thousands of sites which provide access to data via an API. At FilterPlay we use lots of e-commerce APIs to retrieve product data and update prices used in our comparison engine. Our back-end system updates millions of items every day and fortunately many of these API calls and updates can be parallelized. Most APIs provide limits to help ensure the service remains available and responsive. It’s important to throttle your API calls to stay within the limits defined by each service.

Using a Semaphore to Limit Concurrency

Every programming language provides synchronization primitives to control access to a resource shared by multiple threads. For example, the lock keyword in C# will restrict execution of a block of code to a single thread at any one moment in time. A semaphore can be used to give multiple threads concurrent access to a resource. However, most web APIs also include a time window. For example, the BestBuy e-commerce API specifies that developers only make 5 calls every second. Since a web request can finish in under a second, its not enough to limit the number of calls using a semaphore. The following example illustrates the use of a semapahore which is set to only allow 5 concurrent workers. We’ll create 6 worker threads which perform 300ms of “work” after entering the semaphore:


static void DoWork(int taskId)
{
    DateTime started = DateTime.Now;
    Thread.Sleep(300);  // simulate work
    Console.WriteLine(
        "Task {0} started {1}, completed {2}",
        taskId,
        started.ToString("ss.fff"),
        DateTime.Now.ToString("ss.fff"));
}

static void StandardSemaphoreTest()
{
    using (SemaphoreSlim pool = new SemaphoreSlim(5))
    {
        for (int i = 1; i <= 6; i++)         
        {
            Thread t = new Thread(new ParameterizedThreadStart((taskId) =>
            {
                pool.Wait();
                DoWork((int)taskId);
                pool.Release();
            }));
            t.Start(i);
        }
        Thread.Sleep(2000); // give all the threads a chance to finish
    }

    // Task 1 started 51.229, completed 51.540
    // Task 2 started 51.229, completed 51.540
    // Task 3 started 51.258, completed 51.558
    // Task 4 started 51.258, completed 51.558
    // Task 5 started 51.260, completed 51.560
    // Task 6 started 51.540, completed 51.840
}

Note that Task 6 starts immediately after Task 1 is completed and exits the sempahore. The simulated work only takes 300ms so all six workers easily finish in under a second, exceeding our limit of 5 per second. One solution would be to sleep for a second after every request. However, blocking a worker after its done using the resource isn’t a good idea. In our simple example thats not obvious because the thread exits after performing its work on the shared resource. However, in a real scenario you’ll call a web API to obtain some data and then process the results. It’s important that you don’t do the post-processing while holding a lock. We also shouldn’t block that work just to ensure a subsequent caller doesn’t exceed our limit. The solution is to couple a semaphore with a time span which must elapse before the caller can acquire a lease on the resource. I created a TimeSpanSemaphore class which internally uses a queue of time stamps to remember when the previous worker finished.

Don’t Forget the Transit Time

Its important to explain why we need to track time stamps from the moment when each action completes. My initial implementation simply reset a lock pool after each time period had elapsed. That may work perfectly for some throttling scenarios, but for web APIs we have to remember that we’re trying to obey a limit that is enforced on a remote server. BestBuy, Twitter, or Amazon don’t care whether you only send a certain number of requests per second, they can only observe how many requests per second they receive from your application. The variable time it takes a request to arrive on the remote server can cause you to violate the limits if you only use the time when requests are sent. Here’s an example:

Time (sec) Event
0.000 Requests 1-5 are sent to the server
0.700 The requests all arrive at the server
0.800 All requests return with data
1.000 1 second has elapsed so request 6 is sent to the server
1.100 Request 6 arrives much faster than the first 5 requests

The remote server sees that 6 requests arrived between 0.700 and 1.100 when only 400 ms have elapsed, violating the API limits.

Using the TimeSpanSemaphore

Instead of exposing the Wait() and Release() methods publicly, our TimeSpanSemaphore class provides a Run method which accepts an Action delegate and supports cancellation tokens. We also ensure that the lock is released if an exception occurs. Here is our previous example using the TimeSpanSemaphore class instead of the standard semaphore:


using (TimeSpanSemaphore throttle = new TimeSpanSemaphore(5, TimeSpan.FromSeconds(1)))
{
    for (int i = 1; i <= 6; i++)          
    {                    
        Thread t = new Thread(new ParameterizedThreadStart((taskId) =>
        {
            throttle.Run(
                () => DoWork((int)taskId),
                CancellationToken.None);
        }));
        t.Start(i);
    }
    Thread.Sleep(2000); // give all the threads a chance to finish
}

// Task 2 started 53.276, completed 53.576
// Task 1 started 53.276, completed 53.576
// Task 3 started 53.276, completed 53.576
// Task 4 started 53.278, completed 53.579
// Task 5 started 53.279, completed 53.579
// Task 6 started 54.598, completed 54.898

You can see that like the first example, the first 5 requests all start and complete around the same time. However, task 6 waits a full second after the completion of the first task before starting. In theory we wouldn’t have to wait a full second if we knew the how the request lifetime was spent (travel time to/from + server processing). However, we don’t know exactly when the remote server will decide to count the request. Its safer to assume the entire lifetime of the request was spent travelling to the server, but the next request might arrive instantly. This means you can’t use the API to max capacity, but its better to err on the side of caution rather than exceed the limits. The effect will be larger if the concurrent worker count or timespan are small. If you really need every last API call, you could adjust the count and/or timespan to account for this delay.

One tip for those who are making many concurrent API calls. By default .net only allows 2 concurrent requests per hostname. You can set the DefaultConnectionLimit of the ServicePointManager when you initialize your program (only needs to be set once) or in the .config file.

Show Me the Code

The source code for the TimeSpanSemaphore class is available in my GitHub repository JFLibrary. Over time I’m planning to add more utility code that I frequently use. I’d love to hear your feedback, bug reports or a different solution you’ve used to rate limit API calls.

Posted in Uncategorized | Tagged , | 15 Comments