In an earlier post we talked about the Yahoo! browser cache usage experiments (http://www.yuiblog.com/blog/2007/01/04/performance-research-part-2/) to determine your optimum javascript strategy.
One throw away line in there is worth further discussion:
“Since the status codes are recorded in the apache access logs, we are able to determine the empty and full cache measurements by analyzing the logs.”
Ah, yes, “analyzing the logs”. They make it sound so easy!
Log analysis is vital to any e-business and is a “must have” capability of any web operations team but my experience shows that it can be a right royal PITA.
BTW speaking of web log analysis don’t let anyone fob you off with “but we can get all the statistics we need from Webtrends/Omniture/Google Analytics”.
No, you can’t.
Most webcrawling bots won’t trigger the javascript analytics tags so the vast majority of crawling (or worse, content scraping) activity goes un-recorded. This can blow a major hole in your capacity planning depending on the level of crawling activity (FYI it can be 20-30% of your traffic, depending on your site!). HTTP 500 errors? In the logs. HTTP 404 “Page not found” errors often go un-recorded unless you tag your custom 404 page and log the referring page. How many re-directs are you serving (HTTP 301/302)? Yup, in the logs. Malicious activity probing for vulnerable URLs? In the logs (especially if you don’t have an IDS/IPS system).
So what do you need to analyse the logs?
- A method to collect the logs. Normally you zip them after the nightly log rotation and pull them back to a central location. Personally I like to use Repliweb for this, since I like to use it for my web content distribution anyway.
- LOTS of SAN storage to store the logs. We normally keep 13 months worth of logs for audit purposes and “year on year” comparisons. Terabytes, basically.
- An analysis and reporting tool and a server to run it on… preferably one with a bit of grunt if you want your reports anytime soon…
- (optionally) a database into which to import the logs for reporting purposes depending on your choice of tool above.
So… what tools can you use? Lots is probably the short answer but there are 3 that I am most familiar with – Microsoft’s Log Parser, Sawmill (“cos a Sawmill processes logs…”) and Webalyzer.
Log Parser is great for those quick queries and questions you want answered e.g. “how many requests came from this IP” or indeed “what is the breakdown between 200 and 304 response codes for browsercachetest.jpg”. SQL syntax makes it easy to understand and there is a very active community out there to help get you started.
Sawmill is a commercial product that offers a more complete analytics solution (but to be fair the price is pretty reasonable… less that £1000 will definitely get you on your way). Great if you have pre-defined reports you want to run regularly, and hence importing and storing the data in a database is a sensible idea. Note that Sawmill processes a LOT of different file formats (800+) so you can use it for a lot more than just web server logs
Webalyzer is an open-source solution that I haven’t used personally but I have heard good reports about from some customers. If you have any experience with it (good or bad) please let us know in the comments.
Good luck and enjoy the data your log analysing provides!
-Steve
0 comments:
Post a Comment