Code structure

The code has been developed in two distinct areas: monitoring and analysis. The data produced by the monitoring tools is consumed in an off-line mode by the analysis tools. Below we outline the structure and implementation of these tools.

Monitor

The Ranger system has 3936 nodes where a job will not share a node at simultaneously. This allows for a set of monitor collectors to run on each host and give a consistent view of what a job is doing.

Data is collected on each host one of three times:

job start,
job end, or
when a cron scripts runs (currenlty every 10 minutes)

This give us a 10 minute window for each data point and results at least 144 collections a day per host.

The collections write data onto a local ramdisk (/var/log/tacc_stats/{current,<time_since_epoch_at_start_of_day>}). Each collection writes a block of measurements as such

time jobid
montor_type device measurements
...

Each file has a set of schemas describing the measurements of a particular monitor_type at the beginning of each collection file.

At midnight these files are rotated. Between 2 and 4 am these files are archived to /scratch/projects/tacc_stats/.../host/<time_since_epoch>.tar.gz

While these facilitate simple collection, which we want to impact the system as little as possible, they require a bit of manipulation to get the data about specific jobs that is required for viewing. To capture all this information, a python script reads the monitor data and produces a nested set of dictionaries containing the data. Summary statistics are uploaded to the sql summary of the data and each job object is pickled and stored in the filesystem.

Analysis

Currently there are a few routines for viewing data but nothing for more interesting analytics. As the data views are finished and we allow internal users to see the system, we will return to discuss more analytics in the system.

Data Views

The current data views are all based on django system for providing views and models of the data. There are two types of pages in the current focus:

Bulk data pages: These pages provide a quick summary of data statistics. For example a homepage for the HPC Analyst can show the number of jobs run, distribution of memory and so forth.
Drill down data pages: These pages are finer metrics on the data that allows a user to see individual contributions. For example a list of jobs with each job giving some graphics on its particular run.

The bulk data pages will query the sql database for its views. The drill down pages may grab a collection of statistics from the sql database or the individual job files produced by the system monitors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code structure

Monitor

Analysis

Data Views

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally