Seeing a real-time breakdown of web traffic by vhost
Occasionally our servers are hit by traffic spikes. Since we typically host a number of websites per server, we need a way to quickly determine which site is receiving the bulk of incoming requests. (Then we can improve caching on that site, perhaps.) In order to see a real-time indication of what vhosts are being requested, we use the following awk script:
histo.awk
# creates a histogram of values in the first column of piped-in data function max(arr, big) { big = 0; for (i in cat) { if (cat[i] > big) { big=cat[i]; } } return big } NF > 0 { cat[$1]++; if (!start) { start = $6 } end = $6 } END { printf "from %s to %s\n", start, end maxm = max(cat); for (i in cat) { scaled = 60 * cat[i] / maxm; printf "%-25.25s [%8d]:", i, cat[i] for (i=0; i<scaled; i++) { printf "#"; } printf "\n"; } }
Which can be used like this:
watch 'tail -n 100 /var/log/apache2/access_log | awk -f histo.awk | sort -nrk3'
which will give a histogram of the occurence of vhosts in the last 100 lines of the apache log, updating every 2 seconds, sorted with the most frequent vhosts at the top. (Note that this assumes you are using an apache log format which includes the vhost as the first column.) It looks something like this:
Every 2.0s: tail -n 100 /var/log/apache2/access_log | awk -f histo.awk | sort -nrk3 Thu Oct 1 09:51:41 2009 www.dogwoodinitiative.org [ 49]:############################################################ www.wildliferecreation.or [ 24]:############################## www.earthministry.org [ 14]:################## blogs.onenw.org [ 3]:#### www.tilth.org [ 2]:### www.oeconline.org [ 2]:### www.audubonportland.org [ 1]:## oraction.org [ 1]:## oeconline.org [ 1]:## dogwoodinitiative.org [ 1]:## bandon.onenw.org [ 1]:## 209.40.194.148 [ 1]:## from [01/Oct/2009:09:51:21 to [01/Oct/2009:09:48:40
(Another useful variant of this is to produce a histogram of requests by IP address, which can help determine what to block in a DOS attack.)