Monday, 30 April 2012

55% of sites don’t use Cache-Control: max-age

Whilst referencing the excellent HTTP archive recently I came across this rather interesting little stat - http://httparchive.org/interesting.php#max-age.

Over 50% of the assets served by the sites that are crawled by the archive, do NOT set a Cache-Control: max-age value.

My initial thought was that this is pretty shocking really.

When I actually thought about it further however I realised that in all of the Health Checks that we run for customers, this is something that comes up time and time again.

One of the most common recommendations that we make to site owners / developers is to set a ‘far future’ expiration date for static content, or content that is unlikely to change for some time.

The benefits are twofold (and well documented) – repeat visitors get a much faster experience and the sites saves on bandwidth costs because it’s serving less content.

So why is this value so high?  Should we be surprised that it IS so high?

It possibly serves to highlight a lack of knowledge or understanding around cache control in general.  We often see it (it’s fair to say that over 50% of the health checks that we have completed have made this recommendation in some form or another), so perhaps further education is the solution?

Given all of the ‘noise’ generated by WPO in the recent months I was sure that this number would have considerably reduced since Steve first posted this.

It’s reduced by 1%.

Monday, 23 April 2012

Why Histograms are cool!

Quite a while ago we added Histograms into our monitoring portal, and ever since then I’ve always found them to be incredibly useful!

I was recently talking to a prospect (who didn’t have as much of a passion for stats as we do at SC Towers) about the value of Histograms and they were questioning in what situation they should be using them.  We had been talking about longer term analysis of performance data, specifically identifying when and to what degree their page performance varied from the average.

Quite naturally, they had been using the Daily Average report, showing mean, min and max values.  This report is useful but the one thing it doesn’t allow you to investigate is the spread of the test results which had strayed beyond the mean.

Let me give you an example.

Using the Daily Average Speed report to look at performance data over a 7 day period gives you the following:


This gives a good indication of the mean variation throughout the week, and shows us just how bad (by bad I mean how far from the average) some of the results have been, but it doesn’t tell us how many of the results were on or near to the max.

By using a Histogram we can determine exactly what the spread of results is.  The histogram below was generated for the same page monitor over the same time period.

 
Here we can see that although the majority of test results have been placed in the 8-9 seconds ‘silo’, which is in line with the average reported above, almost 40% of the total number of results returned are outside of this average (either because the page returned quicker than expected or because it was much slower to respond with a full page load).

The gamma distribution on the histogram below is much more severe, but could quite easily be produced using the same or similar data from the Average Speed report at the top of this blog. 

 
This shows us that averages can hide a lot of sins!

So, why are histograms cool?  Well, in my mind anything that gives me some extra information is cool.  So the fact that these histograms have shown me the variance in these test results is a good thing.

They’re also cool because they can show things like this.  The Histogram below is definitely what you don’t want to see being produced by your platform, as it is highlighting massive performance variations, in this case at different times of the day.  During the quiet periods the site is relatively stable, but then as things get busier the site starts to slow down resulting in the ‘rock horns’ effect.

Some Rock Horns!
A Histogram!















And anything that makes the rock horns is cool by me!

Monday, 2 April 2012

80/20 Follow up observations

A few weeks ago we looked at the 80/20 argument as presented by Steve Souders, and reasoned that the backend time might actually be greater if you take the response times into account for every object on the page, rather than just the initial request.

Of course, as Pat Meenan rightly pointed out, it should be about understanding the full picture (front and backend) of your own performance.  To my mind that full picture extends to identifying performance bottlenecks across the entire page download.

In last week’s blog post we showed how the backend response times can amount to a significant number, skewing the 80/20 assumption.  This quick calculation was based on those backend times being consistent.

We have observed that those requests are not always consistent, and that variance at the object level will have a knock-on effect on total page downloads and therefore user experience.
 
In the example below, the data start time for a single static asset served on a simple page is varying between 0.20 and 0.40 seconds.  This may not look like much but given that (for example) the average number of images served per page is currently 54 (http://httparchive.org/trends.php#bytesImg&reqImg) this variance can start to mount up.


In this instance we can see that the data start time (and to a lesser extent SSL connection time, although at least this measurement is consistent) is causing the slow object download time. 
 
Sometimes the cause of the variation isn’t quite as clear however.

In the 2 waterfall graphs below, some of the components on the page seem to take a varying amount of time to download for each test.  For example, at times bannerInfo-bg.png takes 1.2 seconds, and yet earlier in the day it only took 0.102 seconds.

Sample Waterfall 1

Sample Waterfall 2



We can also recreate this in the real world using HTTP Watch.

First Test
Second Test
 
By far the largest variation is for the Receive time.  Over 0.647 difference to download an object that is under 3Kb.  What could be causing this variation?

So this has now become a great example of when it would be useful to start cross-referencing internal measurements (e.g. from  APM solutions to determine how long it takes the application to start sending the data or from network tools to try and understand why some objects take longer than expected to be transmitted) against external performance monitoring data.

Armed with this end-to-end view we can build up a picture and some understanding of the performance of systems serving nominally static content and whether it is truly variable in and of itself or whether it appears to be affcted by unfavourable external conditions such as the poor performance of intervening networks.

Although the real world test was conducted using a local office network where there could be local contention, it has been repeated countless times in the ‘clean room’ environment of the monitoring service as well, so we should be able to discount local causes as the reason for the varations.

The important take away from all of the examples above is that you need to have full visibility of the entire application delivery process, and that without accurate measurements you have no idea of a) the current situation and b) if optimisation work / bug fixing /general maintenance is having a positive or negative impact.

Whilst there is a danger of getting carried away with just the numbers (one of Deming’s “7 Deadly Diseases of Management” was to run a company on visible figures alone- http://en.wikipedia.org/wiki/W._Edwards_Deming) measurement has to be one of the cornerstones of good web performance management.

Wednesday, 14 March 2012

Querying the 80 / 20 rule

First of all, some required reading.  Go here and read this, then come back.

Essentially my pitch is this.  It is not 20% backend, it is 20% response to initial request time.  That’s not quite as catchy granted but it depends on how you look at it.

Whilst I accept that the definition of ‘backend’ time is generally the “time it takes for the server to get the first byte back to the client”, I don’t understand why this terminology should only be applied to the initial request.  In every instance when a page is required to download additional content, the application has to carry out some work, and deliver this content to the client.  This is additional backend time, it’s just not the initial backend time.

Of course, for sites that are dynamically generating content on the fly there will be a more significant overhead – but even sites that serve static content (mostly images, but JS and CSS files aren’t usually dynamically generated per user either) can present an element of back end processing time, and this time should be taken into account when discussing the 80 / 20 rule.  I still agree with Steve’s quote that “… the longest pole in the tent was frontend performance…” but there also needs to be some perspective around this.  If the backend cannot serve the content quickly enough then any optimisation work that has been conducted will be, if not in vain, then at least not as effective as it perhaps could have been.

Furthermore, plenty of systems serve “static” objects (JS, CSS , images etc.) out of a CMS, and it may well take time to either locate the contents or retrieve them from database / remote file systems etc.  All of this will look like backend time to the client, and should be considered so given that it lies within the control of the host.

How best to demonstrate this?  With a picture!

 
This is an excerpt from a waterfall graph for www.amazon.com.  The initial request time includes 0.21 seconds of DNS lookup time, 0.07 seconds of connection time and 0.13 seconds of data start time.  That’s 0.41 seconds of ‘backend’ time before the content starts to load.  Given that the total load time for this page was 4.7 seconds that’s apparently 90/10 in favour of frontend time.

However, almost without exception every other object beneath this initial request also carries elements of connect time and data start time.

Whilst it would be incorrect to stitch all of these separate backend times together (given the parallel download of many of these objects) I can have a go at roughly calculating the processing time by selecting the first individual object from each ‘wave’ of parallel object downloads.  The total recorded backend time is now approximately 3.25 seconds.

90/10 has now become more like 70/30 in favour of backend – we now have a shorter pole!
Here at Site Confidence we often debate the true meaning of the 80/20 rule.  If all ‘static’ content served by your website is served how static content was born to be served then brilliant – go sort out your 80!

However, if you’re stuck with a legacy CMS, an unusual CDN implementation or simply have never paid that much attention to your strategy for serving static content you’ve still got 2 poles to shorten.
 
The 80 is very important, and we’d never argue with Steve about this but that ‘20’ might be bigger than you think.

Monday, 12 March 2012

We're hiring!

Site Confidence are always on the lookout for talented people and right now we've got quite a few positions to fill.  

If you are interested in any of these then please get in touch with Steve using the contact details at the bottom.

If you think you know of someone then please pass these details on.

Professional Services:
1 x Senior Technical Consultant: http://www.blueglue.eu/job_detail.php?jid=1557
Join our passionate (and growing) performance team!

Development:
1 x Senior PHP Developer / Architect: http://www.blueglue.eu/job_detail.php?jid=1427

Web Operations:
2 x Linux Systems Administrator: http://www.blueglue.eu/job_detail.php?jid=1535

Sales:
1 x Business Development Executive: http://www.blueglue.eu/job_detail.php?jid=1417
We're looking for focussed New Business Development Managers to help us spread the word. Ideally you'll have sold SAAS with a strong background in solution sales.

The man to contact is Steve James - drop him an email or call him on 01183 094 206.

Friday, 9 March 2012

So what does happen to the Apple store on big announcement days?

Following on from Josh Bixby’s recent observation of the apple.com home page through Webpagetest (http://bit.ly/wLhWFY), I thought it might be handy to take a slightly more detailed look into exactly what happens to the Apple UK Store when they make their big announcements

The following performance report shows the time taken for a full page download on http://store.apple.com/uk for the period 7th – 8th March.

 
Given that the announcement of the “new iPad” was made at approximately 6 p.m. (UK time), this report shows that something significant was happening to the site as early as 15:00.  The fact that it happened pretty much bang on 15:00 could suggest that this was intentional behaviour – however the actual error displayed would suggest otherwise.  From 15:00 to approximately 19:45 the site was consistently returning an HTTP 503 (Service Unavailable).

Request
GET /uk HTTP/1.1
Host: store.apple.com
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Accept-Encoding: gzip
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-gb; SiteCon/8.8.14)

Header
HTTP/1.1 503 Not Found
Date: Wed, 07 Mar 2012 19:41:56 GMT
Server: Apache/1.3.41-ps_webdav_01 (Darwin)
location: http://store.apple.com
Last-Modified: Wed, 07 Mar 2012 11:08:01 GMT
ETag: "4d89a-e080-4f574191"
Accept-Ranges: bytes
Content-Length: 57472
Content-Type: text/html
x-frame-options: sameorigin
Pragma: no-cache

That’s it, no helpful error messages or fluffy fail whales here.

In some of these instances the 503 wasn’t even returned immediately, users were expected to wait for a period of time for the back end to ‘catch up’ before the error was presented to frustrated fanbois.



In the graph above, the light blue element represents Data Start time, the time taken for the first byte of data to be returned to the client.


This wasn’t an isolated spike either. Here is a report showing an accumulated view of the page download presenting DNS, Connect, SSL Connect, Request Sent, Data Start and Content Download as individual metrics.

  

The store page came up very briefly between 20:00 and midnight, but more often than not users were instead presented with the familiar “we’re busy updating the store” page.

 
This wasn’t consistent however. Sometimes the store page was displayed giving users a tease of the new fondle slab but again the back end times may have put off all but the most determined users.

So why is this important?  Well as far as I’m concerned, given the almost limitless resources available to Apple I’d have liked to have seen them handle this promotion a little more gracefully.

Their goal should at the very least have been to ensure that the performance graph stayed green, which would have meant not returning HTTP 503 error codes and not allowing poor back end response times to affect the end users experience.  After all, Apple have been rather busy of late promoting their cloud services to their customers so perhaps they should have thought about temporarily pointing their users at some of this elastic capacity in order to retain a seamless experience?
 
In Apple’s case I’m sure that their brand won’t have been damaged by this.  Their users tend to be rather forgiving.


Other sites may not be as lucky.