Access.log statistics in Excel with Python
Wednesday, July 25, 2012Tweet
In this blog post I am going to demonstrate how to pull and parse access log files (such as access.log generated by Apache or Nginx) and present simple statistics (aka "web analytics") in MS Excel (on Windows). I am going to use Python with DataNitro/IronSpread.
Why use Python and Excel instead of more traditional tools, such as shell scripting on unix-like systems or google analytics? There are a couple reasons. First, there is pretty much no decent scripting language available on Windows, so if you do not have immediate access to some unix (or at least cygwin), you're out of luck. In a projects I worked on, the server provided only static (html, css etc.) files without any scripting language, so I could not easily write my own page access' counters/statistics. Additionally, I wanted to figure out the number of times a certain (static) file has been downloaded over a period of time and visualize hourly/daily/weekly statistics on a graph - that is not something you can accurately measure with vanilla web analytics.
Most popular web servers (Apache, nginx etc.) log all file access attempts in a standard file, such as "access.log". It is really simple to generate charts with Excel and most people are already familiar with Excel spreadsheets. Also, I did not want my script to require any technical knowledge, so that non-techies (business people, marketing, sales team etc.) could just press a button and see the results of their work (so, uhm..., no black and white shell terminals).
OK, let's get down to code. My log file comes in the following (default) format:
IP - - [DD/MM/YYYY:H:M:S +Timezone] "HTTP REQUEST" error_codes/browser_info/etc.
To measure hourly, daily or weekly uniques all I need is the IP and the access date/time (super simple to extract with Python and the builtin datetime module). To avoid code duplication I wrote a base class to keep track of uniques within the "time delta" interval, and three derived classes to define different time deltas (hourly/daily/weekly). Each of them defined its own function to display the numbers in a (slightly) different way on a spreadsheet. The fun part is putting all this data to Excel, just write:
Cell(spreadsheet_name, y, x).value = 1234
And DataNitro will take care of converting Python data types into internal Excel data types, without a need to do any COM calls magic voodoo or (God forbid!) writing DLLs for Excel. Everything just works. Here's the entire function for daily statistics:
1 2 3 4 5 6 7 8 9 10 11
def printCurStats(self): avg = 0.0 if len(self.stats) > 0: avg = self.totCount/float(len(self.stats)) #defined by DataNitro when run in Excel Cell("Daily", self.ycoord, 1).value = self.curDate.strftime("%Y/%b/%d") Cell("Daily", self.ycoord, 2).value = len(self.stats) self.sumCount += len(self.stats) Cell("Daily", self.ycoord, 3).value = self.sumCount Cell("Daily", self.ycoord, 4).value = avg self.ycoord += 1
After running this code, my spreadsheet will look like this:
Although the "numbers" are already there, it's desirable to add a couple columns' headers and text formatting to make everything more readable. Lastly, I'd like to automatically adjust the width of all columns, so the excess of white space is eliminated. It's called "autofit" in DataNitro, see code:
1 2 3 4 5 6 7 8 9 10 11
Cell("Daily", 1,1).value = "Date" Cell("Daily", 1,2).value = "#" Cell("Daily", 1,3).value = "sum" Cell("Daily", 1,4).value = "ratio" #sets Cells font to bold for x in range(1, 5): Cell("Daily", 1, x).font.bold = True #adjust width autofit()
Resulting visual effect:
After adding a couple charts to visualize everything, I ended up with the following:
Not bad for a quick hack! One remaining caveat is how to get the access.log file. Right now I just manually copy it from the server using scp in cygwin/git bash, but that doesn't exactly address all my needs (see intro).
In the next iteration I'm going to use paramiko_scp module (or similar) for Python to do this automatically and generate Excel charts in real-time - I'm going to describe this in the next post.
P.S. If you are looking for advanced statistics, you might want to check out segment.io.