Building Up Reporting Tools

While working with Lee Newton over at b5media I was able to watch him build up some server tools over time that were invaluable to diagnosing exactly what was going on on the server.

Now I find I need to make some of my own. Here’s how I am doing it.

For now I am going to concentrate on access logs.

Where these logs are varies server by server, but if you are running a standard cPanel setup, chances are you can find a directory named /usr/local/apache/domlogs with files in it named after your domain name. In this case I picked one of the sites I host: nakedpastor.com

So if I do a:

cd /usr/local/apache/domlogs
tail -10 nakedpastor.com

I will get the last 10 lines of the access log file

Here’s one example:

24.555.555.27 – – [09/Aug/2010:20:13:45 -0400] “GET /wp-content/uploads/2010/08/IMG_0001.jpg HTTP/1.1” 304 – “http://www.nakedpastor.com/” “Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_1_3 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7E18 Safari/528.16”

Someone on an iPhone is looking at a picture on the site (which David would get more Google Juice from if he had named it better).

The trick is going to be to break down that line into important bits of information that will help me diagnose how my server is being used. For example, I might want to know if one IP address is flooding me. I might want to know if I am getting a HUGE number of requests for one particular file or if I am serving a large number of errors. If I ran this on a combined log file I, I would want to know if one domain was getting all of the traffic. Top referrers might be a fun thing to look at too. There are lots of little bits of info in there that could be helpful.

My tool chest includes:

tail – request a certain number of lines from the END of the file as shown above. During testing I will use -15 to get the last 15 lines but when I make this live, I’ll want to look at the last several thousand at least.
head – requests the number of lines at the top of a file. In this case it gives me the most pertinent results
sort – Will put the most important results at the top
cut – probably not helpful initially as the fields are not fixed width
awk – Used to parse the lines into chunks so I can see what is important. See also here
grep – Used to search for text
uniq – Used with -c uniq counts the number of occurrences of each variance
| – The piping symbol used to send the results of one command right into the next.

Let’s go after something simple first. The IP address. I want to take the last 100 lines of the error log, get the the ip address which will be the first word in the line, count how many times each ip address is used, sort it numerically by count and return the top 10. In bash, that is pronounced as:

tail -10000 /usr/local/apache/domlogs/nakedpastor.com | awk ‘{print $1}’ | sort | uniq -c | sort -nr | head -10

You can find examples of that line lots of places out there. In fact I copied and pasted that from another site. Their line used a “tail -n” instead of “head”, but it did the same thing You may want to note that you have to call sort before you call uniq in order for unique to work right..

For the rest of the examples I’m going to use the awk command to break down the line into separate fields separated by quotes or spaces. this line prints the number 7 because there are 7 fields separated by quotes:

tail -5 nakedpastor.com|awk -F ‘”‘ ‘{c=NF; print c}’

If I search for quotes and then search for space, I can get the result code

tail -15 nakedpastor.com|awk -F ‘”‘ ‘{print $3}’|awk ‘{print $1}’

Or the requested page:

tail -15 nakedpastor.com|awk -F ‘”‘ ‘{print $2}’|awk ‘{print $2}’

Or the referrer:

tail -15 nakedpastor.com|awk -F ‘”‘ ‘{print $4}’|awk ‘{print $1}’

Or the agent:

tail -15 nakedpastor.com|awk -F ‘”‘ ‘{print $6}’|awk ‘{print $1}’

So using these examples I can get the 20 most popular agent/OS combos:

tail -1000 /usr/local/apache/domlogs/nakedpastor.com | awk -F ‘”‘ ‘{print $6}’| sort | uniq -c | sort -nr | head -20

So, there you have some tools to use the next time you want to see what is going on on your server. Along with free -m and top you can get some neat info.

Just remember that you are getting snapshots. When I looked a little bit ago. Knowmore.com’s bot took up 1390 of the 10000 lines I was looking at. That’s over 10% of the traffic going to just that bot. HOWEVER if I looked at 5000 lines, they didn’t appear at all. Looking at 100000 lines they only appeared 1394 times. So, don’t just use one sample size. It can be misleading.

I will probably take these lines and combine them into some .bachrc functions along the line of what I’d discussed in my Three helpful additions to your .bashrc post.