I m doing some work to analyse the access logs from a Catalyst web application. The data is from the load balancers in front of the web farm and totals about 35Gb per day. It s stored in a Hadoop HDFS filesystem and I use MapReduce (via Dumbo, which is great) to crunch the numbers.
The purpose of the analysis is try to establish a usage profile -- which actions are used most, what the average response time for each action is, whether the response was served from a backend or cache -- for capacity planning, optimisation and to set thresholds for monitoring systems. Traditional tools like Analog will give me the most-requested URL or most-used browser but none of that s useful for me. I don t need to know that /controller/foo?id=1984
is the most popular URL; I need to know what hit rate and response time for all hits to /controller/foo
is so I can see if there s room for optimisation or caching and try to estimate what might happen if hits for this action suddenly double.
I can easily break the data down into requests per action per period via MapReduce. The problem is displaying it in a digestable form and picking out important trends or anomalies. My output is of the form:
( 2009-12-08T08:30 , /ctrl_a/action_a ) (2440, 895)
( 2009-12-08T08:30 , /ctrl_a/action_b ) (2369, 1549)
( 2009-12-08T08:30 , /ctrl_b/action_a ) (2167, 0)
( 2009-12-08T08:30 , /ctrl_b/action_b ) (1713, 1184)
( 2009-12-08T08:31 , /ctrl_a/action_a ) (2317, 790)
( 2009-12-08T08:31 , /ctrl_a/action_b ) (2254, 1497)
( 2009-12-08T08:31 , /ctrl_b/action_a ) (2112, 0)
( 2009-12-08T08:31 , /ctrl_b/action_b ) (1644, 1089)
i.e., the keys are time periods and the values are tuples of (action, hits, cache hits)
per time period. (I don t have to stick with this; it s just what I have so far.)
There are about 250 actions. They could be combined into a smaller number of groups but plotting the number of requests (or response time, etc) for each action over time on the same graph probably won t work. Firstly it ll be way too noisy and secondly the absolute numbers don t matter too much -- a 100 req/min rise in requests for a often-used, lightweight, cacheable response is much less important than a 100 req/min rise in a seldom-used but expensive (maybe hits the DB) uncacheable response. One the same graph we wouldn t see the changes in requests for the little-used action.
A static report isn t much good -- a huge table of numbers is hard to digest. If I aggregate by the hour we might miss important minute-by-minute changes.
Any suggestions? How re you handling this problem? I guess one way would be to somehow highlight significant changes in the rate of requests or response time per action. A rolling average and standard deviation might show this, but could I do something better?
What other metrics or figures could I generate?