Throughput Monitor




Audience

This document and utility is mostly interest of system administrators, who need to keep eye on their system. One object of interest of system administrators is system's resource usage or usage frequency. Too high frequency usually means moment for intervention. Because this can mean abuse or attack.



Introduction

Throughput monitor is a log analyzer. In general notation it is a event counter per timeframe - in short frquency monitor. If it detects too high event rate, utility notifies about it. If rate drops below predefined value, also notify with statistics gathered meanwhile. It can analyze past logs or realtime logs. Events are distinguished per observation object.



Logs and input format

Throughput monitor can analyze every single-line log (1 event per line). Good examples are of this kind of log producers are syslog and apache. Log lines are filtered through regex. Log can contain random crap, except on line we are interested in. Usually most of log is useless. The interesting lines must be somewhat consistent and regex must be able to detect:

  • timestamp in format: Day Mon dd hh:mm:ss yyyy
  • observerion object like username or hostname, etc


Usage example

postmaster@host:~ >throughput_monitor2 -hwm 20 -lwm 10 -tf 1800 -i logs/mail.log -m hist -ok conf/relay_ok.cf
Sun Sep  8 03:10:03 2002 hwm unknown[63.240.11.107]
Sun Sep  8 03:40:32 2002 lwm unknown[63.240.11.107] max=42 count=25 duration=1829/00:30:29 interval=00:01:13
Sun Sep  8 04:05:14 2002 hwm unknown[64.106.148.10]
Sun Sep  8 04:09:57 2002 lwm unknown[64.106.148.10] max=26 count=7 duration=283/00:04:43 interval=00:00:40
Sun Sep  8 06:01:40 2002 hwm unknown[207.90.33.18]
Sun Sep  8 06:31:22 2002 lwm unknown[207.90.33.18] max=26 count=8 duration=1782/00:29:42 interval=00:03:42
Sun Sep  8 06:51:40 2002 hwm unknown[64.106.148.20]
Sun Sep  8 07:21:25 2002 lwm unknown[64.106.148.20] max=25 count=8 duration=1785/00:29:45 interval=00:03:43
Sun Sep  8 07:26:07 2002 hwm tomts6.bellnexxia.net[209.226.175.26]
Sun Sep  8 07:54:56 2002 lwm tomts6.bellnexxia.net[209.226.175.26] max=21 count=2 duration=1729/00:28:49 interval=00:14:24
Sun Sep  8 10:00:49 2002 hwm unknown[63.240.11.107]
Sun Sep  8 10:31:15 2002 lwm unknown[63.240.11.107] max=46 count=27 duration=1826/00:30:26 interval=00:01:07
Sun Sep  8 11:28:41 2002 hwm unknown[63.240.11.107]
Sun Sep  8 11:59:46 2002 lwm unknown[63.240.11.107] max=51 count=33 duration=1865/00:31:05 interval=00:00:56
Sun Sep  8 12:20:47 2002 hwm lsmail6.oin60.com[66.236.248.115]
Sun Sep  8 12:48:34 2002 hwm unknown[64.106.148.10]
Sun Sep  8 12:50:47 2002 lwm lsmail6.oin60.com[66.236.248.115] max=28 count=9 duration=1800/00:30:00 interval=00:03:20
Sun Sep  8 13:25:23 2002 lwm unknown[64.106.148.10] max=33 count=17 duration=2209/00:36:49 interval=00:02:09
Sun Sep  8 14:25:25 2002 hwm unknown[207.90.33.18]
Sun Sep  8 14:54:35 2002 lwm unknown[207.90.33.18] max=26 count=7 duration=1750/00:29:10 interval=00:04:10
Sun Sep  8 17:07:29 2002 hwm unknown[207.90.33.18]
Sun Sep  8 17:36:49 2002 lwm unknown[207.90.33.18] max=26 count=7 duration=1760/00:29:20 interval=00:04:11
postmaster@host:~ >

This example scans through mail log file and detects mail senders and keeps count of mails relayed. If one passes more than 30 mails, output of hwm is generated. Same for lwm, except we see additional statistics. Time frame is here 1800 seconds, which is 30 minutes. Trusted sites are not counted and they are listed in conf/relay_ok.cf



Output

Output can be only hwm or lwm line. If amount of events per timeframe rises above hwm, then hwm line is generated. The format of hwm line is following:

Day Mon dd hh:mm:ss yyyy hwm obj
  • Day Mon dd hh:mm:ss yyyy - timestamp
  • hwm - fixed string
  • obj - can be username or hostname or whatever user's regex matches as observation object

Similarly lwm line is generated when rate drops below lwm. Also additional statistics is included. The format of lwm line is following:

Day Mon dd hh:mm:ss yyyy lwm obj max=zz count=xx duration=sec/hh:mm:ss interval=hh:mm:ss
  • Day Mon dd hh:mm:ss yyyy - timestamp
  • lwm - fixed string
  • obj - can be username or hostname or whatever user's regex matches as observation object
  • max=zz - the peak amount (zz) of event in timeframe
  • count=xx - count of events while rate is over hwm. May be smaller than max. Because when rate crosses hwm line, count is initalized to 1, but max at the same moment is hwm.
  • duration=sec/hh:mm:ss - how long object were in hwm state. Duration is given in both forms: seconds for automation and hhmmss for humans.
  • interval=hh:mm:ss - average delay between events. Equals to duration/count.

One may ask, why do i need lwm output when detecting spam? Because statistics included gives feedback about parameters to tune and also we get quick info about incident.

Know that the utility may complain about timetravel when log is not linear. In the other words, every next matched log entry must have timestamp, which is greater or equal then previous one of the same object. Unfortunately this is too common problem with logs in general.

Time Frame

... is amount of time. For example 1800 seconds (=30min). It means the utility remember events from now to 1800 seconds backwards. What happened before that is not simply interst. The interest is what is happening now? How can we achieve this? One way to do it is to remember certain amount of events and see how frequent they are. If there were more events than hwm during from now to timeframe seconds backwards we raise eyebrow and become concerned. Beaware, the shorter timeframe is the more sensitive observer becomes, the longer the timeframe is the more smoother observer becomes.



Watermarks

hwm and lwm denote High WaterMark and LowWaterMark. Those are limits which represent our tolerance towards certain activity. Hwm is used when rate increases and lwm is used when rate decreases. Why do we need two watermarks and not one? Because sometimes rate oscillates around some level and one watermark means too many output (noise). Two watermarks gives us a way to eliminate oscillation effect. Lwm must be less or equal than hwm.



Observation Objects

The configurable log matching regex gives us name of an object. It can be username, hostname, ip address or whatever thing. So one can observe many objects in one log by running instance of analyzer. If there is only one object to observe - regex must return constant.



Commandline Arguments

Following table gives us short overview of commandline arguments of throughput monitor. Every flag takes exactly one parameter. All columns are self-explaining, except maybe first. First column denotes how many times can or must be a flag specified (see under the table).


?-d"match"debug event matching
*-eemailadmin Email
1-hwmnumberHigh WaterMark (events per time frame)
1-ifilenameInput file name, or 'stdin'.
1-lwmnumberLow WaterMark (events per time frame)
1-m{"now"|"hist"}Realtime monitoring (tail, if -i file) or history analyzer mode.
*-okfilenameobj OK file
1-tfnumberTime Frame (seconds)
  • ? - zero or one, in the other words optional argument
  • 1 - exactly one, in the other words mandatory argument
  • * - zero or more

Full description follows:

-d "match"
One can turn on debugging of regex matching by '-d match'. Regex tuning could be much simpler with this.
-e email
When specified, hwm and lwm messages are sent to email. If multiple emails are given with multiple -e arguments then multiple emails recieve notifications. -e option is not compatible with '-m hist'.
-hwm number
High watermark. Must be positive number. When rate rises above hwm, notify.
-i filename
Input file specification. Can be 'stdin' for stdin.
-lwm number
Low watermark. Must be positive number. When rate drops below lwm, notify. Lwm must be smaller or equal than hwm.
-m {"now"|"hist"}
'-m now' means the input file is tail'ed with system tail. '-m now' and '-i stdin' are not compatible options. '-m now' means realtime analysis, any out of frame events are immediately droppped. '-m hist' means analyzing old logs. Also note that when '-m hist' finishes, then unreported lwm events are reported with possibly future timestamps.
-ok filename
Reads in file of "ok" objects. One object per line. These objects are not checked - their activity is trusted and not object of system administrators concern. This option can be specified multiple times - all objects are used.
-tf number
Specified time frame size in seconds. Must be positive integer.


Requirements

  • Perl 5.005_03
  • 'mail' command or its equivalent
  • 'tail' command or its equivalent
  • knowledge about regex-s

Configuration

Where do i put my regex? In the beginning of the perl script there is a section 'user conf'. Adjust parameters there. Also, depending of location of perl you may need to edit first line of the script - the path of the perl.

$mail = "/usr/bin/mail";
Where your 'mail' program lives?
$tail = "/usr/bin/tail";
Where your 'tail' program lives?
$mail_subj_pfx = "$ENV{HOST}: $ENV{USER}";
Mail subject prefix. Modify this, when you want to something else than 'host: username' in the beginning of subject of mails. Alos, if you encounter 'Use of uninitialized value at ./throughput_monitor2 line 48.' at the beginning of the script output, then modify this line. It means your environment variables have a bit different names like 'HOSTNAME' instead of 'HOST' or isnt defined at all. If you dont need any prefix put leave the string empty - "".
$regex_obj = "...";
Describes object, like username, ip, etc. Its a regex. It is used for matching -ok files.
$regex_event = "...";
Describes event line in log. Be sure to include $regex_obj variable in this regex - its for your own comfort.


Regex examples

There is a predefined veriable $regex_timestamp. be sure not to modify it. Also it gives more readeble event-matchig regexes.

Postfix, spam. Following shows how to detect spam from postfix logs. Although postfix has many tools to restrict spam, it cannot detect and avoid all of it. Here are log entry to matched and regex-s to match it.

Sep  8 00:00:42 host postfix/smtpd[36125]: 4951361063: client=hostname[123.123.123.123]
my $regex_obj = "[a-zA-Z0-9._-]+\[[0-9.]+\]";
my $regex_event = "^($regex_timestamp) [a-z0-9.-]+ postfix/smtpd\[[0-9]+\]: [A-F0-9]+: client=($regex_obj)";

Cucipop, CPU overhead. Next example is to detect cucipop entries to detect frequent mailbox-checkers - mailbox scan takes too much cpu! Log entry and regex-s follow:

Sep  2 08:13:07 host cucipop[22733]: username 123.123.123.123 3, 0 (0), 60 (5196472)
my $regex_obj = "[a-z0-9.-]+ [0-9.]+";
my $regex_event = "^($regex_timestamp) [a-zA-Z0-9.-]+ cucipop\\[[0-9]+\\]: ($regex_obj) [0-9]+, [0-9]+ \\([0-9]+\\), [0-9]+ \\([0-9]+\\)\$";

UW Imap, CPU overhead. Almost same as above. Syslog entry and regex-s follow:

Sep  9 00:00:27 host imapd[17842]: Login user=username host=hostname [123.123.123.123]
my $regex_obj = "user=[^ ]+ host=[^ ]+ \[[0-9.]+\]";
my $regex_event = "^($regex_timestamp) [a-zA-Z0-9.-]+ imapd\[[0-9]+\]: Login ($regex_obj)";


Licence

Free.



Code

  • Clean
  • Mature
  • Commented


Download

throughput_monitor2


Feedback

I'm glad to hear about your success, failures, ideas, etc.



ragnar@uninet.ee