Sunday, March 7, 2010

Querying the Apache log using RawDev

Introduction


The RawDev (data) model is designed to query any relational data in a uniform way. In this example I will show this how that works querying the Apache log.

Synopsis

<?php
RResource::construct('ApacheLog', './access_log')->setFilter("request_method=''")->copy()->sort('remote_host')->display(); 
?>
Read an apache log, copy the filtered results (invalid requests) to memory, sort it by IP Address, and display the results.


The results

Three Paragraph RawDev Model Theory


The Apache log relates to a Resource in the RawDev model. Resources are a set of records that can reside anywhere. Just like the Apache log that is located as a file on a file system.

The different natures of a resource in RawDev.

The nature of a resource is either: (1) Traversable, (2) Array Accessible, or (3) both.
  • Traversable: A first record, next record, and last record are defined. Examples are: Apache log, csv file, and MySql table. Note t
  • Array Accessible: There are one or more primary/unique keys through which a specific recordcan be retrieved (*). Examples are: Wikipedia words, Alexa domain statistics, and Oracle Table.
  • Both: Any relational database table such as PostgreSQL, SQLite, MS SQL Server offers both Traversable as well as Array Accessible access.

(*) Note that traversable resources such as the Apache log could be "array accessible" by line number, however in this case this is not useful.

In our example the Apache log resource is "Traversable" but not "Array Accessible". We can easily go to the first record (open file), read the next record (read) and detect the last record (end of file). However, we cannot naturally find a specific record based on a unique key. First, the log does not have a unique key. Second, even if it had, we can't find it in the file efficiently (without doing a full resource scan).

The nature of a resource immediately impacts the kinds of actions you can do on the resource as displayed in the table below:

TraversableArray AccessibleBoth
FilterGetSort
PageUpdate?
Select/CopyDelete?
Insert?Insert?

A resource that is both traversable and array accessible inherits all the actions and can also be sorted. The question mark indicates that it's possible if the resource supports it.

Note that a traversable resource such as a csv file cannot be sorted, updated, or deleted. An easy way to get around this is to copy a filtered selection or all records to a resource (such as memory) that does support these actions and then copy the results back (if needed).

The Apache Log plugin

The Apache Log is a traversable resource and therefore needs to implement 4 methods:
(1) The constructor with specific properties to the resource

(2) The open method that opens the resource

(3) The fetch method that reads and returns the next records. A record is a hash with the column name as the key. Returns null if end of resource.

(4) The close method that closes the resource.

class RApacheLog extends RQuerySupportResource {

  var $fileName;
  var $format;

  var $titles;
  var $handle;

  function __construct($fileName, $format="%h %l %u %t \"%r\" %>s %b", $timeZone='America/New_York') {
    $this->fileName = $fileName;
    $this->format = $format;
    $this->titles= RApacheLogParser::getTitles($format);
    date_default_timezone_set($timeZone);
  }

  function fetch() {
    if ($line = trim(fgets($this->handle))) return RApacheLogParser::parseLine($this->titles, $line);
  }

  function open() {
    if (!($this->handle = fopen($this->fileName, "r"))) throw new RException('model_csv_cannot_open', 'Cannot open file [%s].');
  }

  function close() {
    fclose($this->handle);
  }

}

Above you can see the implementation for the ApacheLog resource. The constructor accepts the file path, the log format string (e.g. "%h %l %u %t \"%r\" %>s %b"), as well as the timezone as arguments (it will convert the web access date time to the timezone you specify).

Note that the plugin relies on a ApacheLog parser that (a) converts the format string into titles and (b) converts a log line (string) into a hash with the titles as keys and the line values as values.

The Example


require_once('rawdev/RawDev.php');
require_once(RAWDEV_LIB.'/Model/Model.php');
require_once(RAWDEV_LIB.'/Model/plugins/apache/ApacheLog.php'); 

$apache = new RApacheLog(dirname(__FILE__).'/access_log'); # creates the apache resource
$apache->setFilter("request_method=''"); # sets the filter to suspicious requests

$memory = $apache->copy(); # copies the filtered resource to memory (default resource)
$memory->sort('remote_host'); # sorts the results by IP address
$memory->display(); # displays the results

?>

This example shows (a) the creation of the apache resource and how the filter is set (think where clause) and (b) how the filtered results are copied into memory, sorted and displayed. The results are displayed below. Note that the copy command accepts any resource that supports it. By default the Memory Resource is chosen.

The example was executed on a log file just shy of 2 MB. Note that during the copy process all records are actually scanned and filtered, this took about 3 seconds on a Macbook.



Conclusions

If you do apache log processing, this can be a useful utility to (a) quickly scan a file or (b) store the results in a persistent database for ad-hoc queries. Of course this can all be done using RawDev. These ad-hoc queries can easily be exported to a database table, csv file, excel database using one line of code.

No comments:

Post a Comment