Skip to content

Writing An Analytic Module

Dana Silver edited this page Jun 7, 2016 · 11 revisions

Analytic modules are one of two ways to extend MiddGuard for your investigation. They transform input data and save the changes to a database. The other type of extension is a visualization module, which creates a view of some data in the browser. As a general rule, if you're not creating a visualization, you want to write an analytic module.

Analytic Nodes

Analytic modules are the backing code for analytic nodes. In MiddGuard's front-end environment, we create nodes from modules and connect nodes to one another. Multiple nodes can be based on the same module. A node uses a combination of its module and its connections to other nodes to perform a transformation on some input and save it as output. That output becomes the input for other nodes. Each node has a one-to-one relationship to a table in MiddGuard's database.

Structure

Analytic modules only need one JavaScript file, but can be as complex as necessary. Although you don't need to, it's useful to place that file in a directory with the module's name to allow the module to grow to multiple files and stay organized. That directory lives adjacent to the main app.js.

.
├── app.js
└── tweet-timestamp-difference
    └── index.js

Contents

Our module tweet-timestamp-difference will take in information from two Twitter users' timelines and calculate the difference between the number of tweets the two people sent at each day of week/hour of the day.

We can assume we have already written modules to retrieve the tweets and aggregate them by day of the week and hour.

Inputs

inputs declare what the module takes in. Each element of the array, an input group, refers to the output from an analytic node. Input groups are named so we can refer to them later on. Each input group also enumerates the attributes it uses. Each attribute is a column in the corresponding output from the connected node.

We have two input groups, tweets1 and tweets2. Each has attributes day, hour, and count: the day of week, hour of day, and number of tweets a person sent at that day and hour.

exports.inputs = [
  {name: 'tweets1', inputs: ['day', 'hour', 'count']},
  {name: 'tweets2', inputs: ['day', 'hour', 'count']}
];

Here's an example of what the incoming data for tweets1 might look like in tabular form:

day hour count
0 0 56
0 1 34
0 2 78

Outputs

outputs enumerates the attributes for each element that this module outputs. Each element is a row inserted directly into a node's database table. Each node only has one output (its table), so there is no need here for multiple "output groups". In the front-end interface we line up input attributes (like day, hour, and count in the inputs) and output attributes like the ones below.

Our output attributes are a day of the week, an hour of the day, the number of tweets sent by two different people at that day and hour (count1 and count2), and the difference between count1 and count2.

exports.outputs = [
  'day',
  'hour',
  'count1',
  'count2',
  'difference'
];

An example of the output data for our module in tabular form is:

day hour count1 count2 difference
0 0 56 23 33
0 1 34 20 14
0 2 78 54 24

Display Name

displayName is a string containing a prettier version of the module's name in the file system. This is how the module and its nodes will be identified throughout the front-end interface.

Good display names should be short and descriptive. We'll name ours by replacing hyphens with spaces and capitalizing the words.

exports.displayName = 'Tweet Timestamp Difference';

Create Table

createTable is a function used to create the tables for nodes based on this module. The function is always passed two arguments, tableName and knex.

tableName is the name MiddGuard has assigned the node's table at runtime. Modifying it here before creating the table will make MiddGuard unable to find the table later.

knex is an instance of Knex, a SQL generator. With Knex, we can use many relational databases and write the same code to perform SQL statements. This instance of Knex is already connected to the MiddGuard database.

Its schema-building functions, like knex.schema.createTable return Promises. Creating tables in the database is an asynchronous operation so createTable must return a Promise so MiddGuard knows Knex is done creating the table.

The createTable function should closely resemble the module's outputs, since each output needs a column in the table.

exports.createTable = function(tableName, knex) {
  return knex.schema.createTable(tableName, function(table) {
    table.integer('day');
    table.integer('hour');
    table.integer('count1');
    table.integer('count2');
    table.integer('difference');
  });
};

Handle

handle is the function called to transform input data into output data. It is the core of our module.

Like createTable, handle is passed a variable defined by MiddGuard, context. context contains all the information about how a node based on this module is connected to other nodes in the same graph. See the context guide for more details and examples.

The important parts of context for our function are context.inputs and context.table. For each input our module accepts, context.inputs has a key with that input's name. These are context.inputs.tweets1 and context.inputs.tweets2. tweets1 and tweets2 have Knex database connections already assigned to the tables where tweets1 and tweets2 are stored, respectively.

context.table is the module's output. Like each of the inputs in context.inputs it has a Knex connection used to insert rows into a node's table (or run any other query on the table).

It's useful to see our inputs and outputs alongside the handle function as a reference for the context.

exports.inputs = [
  {name: 'tweets1', inputs: ['day', 'hour', 'count']},
  {name: 'tweets2', inputs: ['day', 'hour', 'count']}
];

exports.outputs = ['day', 'hour', 'count1', 'count2', 'difference'];

And the handle function, annotated.

exports.handle = function(context) {
  // We'll insert objects with each of the output attributes into
  // the `week` array, then insert `week` into `context.table`.
  var tweets1 = context.inputs.tweets1,
      tweets2 = context.inputs.tweets2,
      week = [];

  // Select everything from each of the inputs (tweets1 and tweets2).
  return Promise.join(tweets1.knex.select('*'), tweets2.knex.select('*'),
  function(tweets1, tweets2) {

    // Iterate through the hours of the day and days of the week
    _.range(24).forEach(function(hour) {
      _.range(7).forEach(function(day) {

        // Get the count of tweets from tweets1 and tweets2 at that hour and day
        // and add it to the `week` array.
        var count1 = _.find(tweets1, {hour: hour, day: day}).count;
        var count2 = _.find(tweets2, {hour: hour, day: day}).count;
        week.push({
          day: day,
          hour: hour,
          count1: count1,
          count2: count2,
          difference: Math.abs(count1 - count2)
        });
      });
    });

    // Insert everything into the table.
    return context.table.knex.insert(week);
  });
};

Putting it all together

Here's the complete contents of index.js in tweet-timestamp-difference.

.
├── app.js
└── tweet-timestamp-difference
    └── index.js

Note that we use two external dependencies, lodash and bluebird. We can require these in the module just like in any other Node.js module.

var _ = require('lodash');
var Promise = require('bluebird');

exports.inputs = [
  {name: 'tweets1', inputs: ['day', 'hour', 'count']},
  {name: 'tweets2', inputs: ['day', 'hour', 'count']}
];

exports.outputs = [
  'day',
  'hour',
  'count1',
  'count2',
  'difference'
];

exports.displayName = 'Tweet Timestamp Difference';

exports.createTable = function(tableName, knex) {
  return knex.schema.createTable(tableName, function(table) {
    table.integer('day');
    table.integer('hour');
    table.integer('count1');
    table.integer('count2');
    table.integer('difference');
  });
};

exports.handle = function(context) {
  var tweets1 = context.inputs.tweets1,
      tweets2 = context.inputs.tweets2,
      week = [];

  return Promise.join(tweets1.knex.select('*'), tweets2.knex.select('*'),
  function(tweets1, tweets2) {
    _.range(24).forEach(function(hour) {
      _.range(7).forEach(function(day) {
        var count1 = _.find(tweets1, {hour: hour, day: day}).count;
        var count2 = _.find(tweets2, {hour: hour, day: day}).count;
        week.push({
          day: day,
          hour: hour,
          count1: count1,
          count2: count2,
          difference: Math.abs(count1 - count2)
        });
      });
    });

    return context.table.knex.insert(week);
  });
};

Clone this wiki locally