I don't have Trackbacks on here. I figured I would code the functionality for it at some point in the future, and I had set aside this weekend to do that, but after reading through the information on the spec and implementations, I decided that I didn't want to deal with spam filtering. I don't have spam filtering set up for the comments 2 and I didn't really want to have to set up spamassassin just to deal with trackbacks.

So I decided instead that I wanted to use refbacks, which isn't really a protocol; it's essentially just a listing of the top referrers for a particular page. Unfortunately, the referrers are a server-side logging detail, while my applications, being all WSGIey, don't really have any awareness of the webserver.

Logging to a Database

I don't know how I heard of this, but somehow I knew that Apache could log to a database. Since my application data is stored in a database, I could write regular Python code to determine things like referrers and user-agents, as long as I could get that info from the same database.

I wanted three main things, which I wasn't able to find in any existing piece:

  • Support for Postgres. I am not an SQL purist1, but I've been wanting to learn how to use PostgreSQL, and I consider this project part of that effort.

  • Relatively simple install. Ideally, it won't require anything besides Apache configuration changes. I'd be willing to install a Apache module, but not, say, recompile Apache.

  • Reliability. I don't want it to my server if I screw up. Chances are that I will be screwing up a lot, so I want some sort of decent fallback.

Modules?

Apache does have a mod_log_sql and a mod_log_mysql. mod_log_mysql, as the name implies, only works with MySQL. mod_log_sql, on the other hand, can work with PostgreSQL, but only if I also wanted to figure out DBI. Which I don't.

Pipes!

What did look hopeful, howver, were a couple articles which used pipes to send the Apache logs through a Perl script which inserted it into a Postgres DB. It wasn't exactly what I wanted - the articles are fairly mod_perl specific, and I don't know enough Perl to adapt them to my situation.

The Script

Using what I could glean from the Perl scripts as a template, I set forth to write a similar one in Python.

Learning to Fork

The only think I really knew about fork is that only Unix machines have it. (As such, this will only work for Apache running on Unix-like machines).

Fortunately, Python has as good documentation as ever. Looking through the documentaion for os.fork(), and a couple of good examples, I got a good idea of what I wanted.

Parsing the Logs

The next major piece I needed was a function that would parse the Apache logfile and give me Python objects. Now, Python is nice and fast for an interpreted language, but Apache is faster; I don't want any badly-written regex (which I am likely to be guilty of writing) to be a bottlneeck. Luckily, the log_reader module can do this for me, and in C++.

So I downloaded that and ran "python setup.py install". I had it parse a test line, and it goes right into Python data types - DateTime objects and everything. Excellent. The more code others will write for me, the better!

Getting it into Postgres

Well, so far there is still no database table. Open up a psql connection to the right database and issue:

  1. CREATE TABLE "access_log" (
  2. ip varchar(40), -- ipv6 addys can be 40 chars. Let's be forward-looking.
  3. method varchar(6), -- length "DELETE" == 6
  4. path text, -- Todo: find out what the maximum path length can be in HTTP.
  5. protocol text, -- Todo: Same.
  6. referer text, -- Todo: Ditto.
  7. size integer,
  8. status integer,
  9. datetime timestamp,
  10. useragent text
  11. );

You'll notice that I left out ident, timezone, and username columns. I don't care about them, but if you do, of course you'd want columns for those.

Finally, I combined the functions I had so far into this file. I wrote one final function to do the actual insertion of data into the database, and uploaded it to my server.

That's the file you'll want if you want to do the same thing on your server. Keep in mind that you'll need to modify the psycopg2 DSN on the line that starts "dbconn = psycopg2.connect(..." to include your connection information.

Configuring the Server

At this point, Apache still isn't doing anything special. It's still writing to the default access.log. To change this, find where in your Apacheconf you do the logging (I do mine in the VirtualHost configuration, but it might be in httpd.conf or apache2.conf) and find this line:

CustomLog /var/log/apache2/access.log combined

Change it to be like so:

CustomLog "|python /path/to/apachelogs2postgres.py" combined

You can find more information on piping logs in the Apache docs, but for us that's all we need.

Now issue "/etc/init.d/apache2 reload", and enjoy!

What Next?

You'll notice I don't have refbacks on this blog. This blog is still on shared hosting, and I'm weighing whether or not to move it onto my VPS. The above all happened on my VPS, so you won't see that deployed here for some time.

The next step, I think, is to build some SQLAlchemy queries (not necessarily models!) that do things like get the top referrers and user agents for a particular path.

The apachelogs2postgres.py script is commented up with things that I'm still unsure about, and hopefully reasonably easy to follow. Advice and/or patches are, of course, welcome.


  1. My first experience with Postgres was pretty negative as I didn't know what I was doing and tried to authenticate against local user accounts. I messed that up, like I do for most things the first time around. Once bitten, twice shy; I'm running PostgreSQL once again, but using "regular" authentication. So far, so good. 

  2. Since I don't have a CAPTCHA, you might think I get a lot of comment spam. I get a fair amount; the DB indicates about 120 over the last couple of weeks. However, I have some client-side Javascript that modifies the POST location, as well as some server-side heuristics to guess whether the POST request is coming from a zombie machine. For example, many of them send a user-agent string of IE4. IE4? Come on, I can't be expected to believe that. Using the POST location and a few other pieces of information, a comment is either published automatically or simply recorded to the DB for me to override manually if there was a false positive. So although there's no CAPTCHA, unobtrusively preventing spam is possible - but not necessarily always effective.