Skip to content
Jan Motl edited this page Mar 5, 2016 · 5 revisions

Introduction

Predictor Factory (PF) is an application for transformation of relational data into a form of a single table suitable for data modeling (classification, regression). This process is generally known as "propositionalization" and is necessary, if we want to process relational data, which are composed of multiple tables, with algorithms that require a single table at input.

Connection to the database

The database has to support JDBC connection. Currently supported databases are:

  1. Microsoft SQL Server
  2. MySQL
  3. Oracle
  4. PostgreSQL

Just note that if you are connecting to Microsoft SQL Server, you have to permit logins with passwords (by default only Windows-authentication is permitted). Also, the database can be behind a firewall. Hence, if you are having difficulties to login to the database with Predictor Factory, try to login first with some other SQL client to rule out a misconfiguration.

Demo data

An example database with data from a bank is freely available at https://relational.fit.cvut.cz. To try Predictor Factory you don't have to download the data and setup a database - you can just directly connect Predictor Factory to the example database.

Own data

There are some requirements on the database to be processable with Predictor Factory:

  1. The tables must be linked together with foreign key constrains
  2. The database must contain a so called "Target table" with the content described in the following paragraph

Target identification

If you want to make a prediction based on your data in your database, you have to clarify yourself 3 things:

  1. Target attribute: which variable do you want to predict
  2. Target entity: for which entity
  3. Target timestamp: and when do you want to have the prediction

If we want to predict propensity of default (the probability that a customer is not going to pay off his/her loan) in the example database, target attribute is "status" in "loan" table.

Once we identify the target attribute in the database, we can move to another question: whom or what do we want to classify. In our bank example we can make the prediction at two different levels:

  1. "loan_id" (an identifier of the application for a loan)
  2. "account_id" (there could be multiple loan applications from a single account)

If we do not know for which entity to perform a prediction, it is the best to make the prediction for the smallest entity we have, because we can always aggregate predictions for smaller entities into predictions for bigger entities. Hence, we go with "loan_id".

At last, we may optionally define a time when we want to have the prediction. In our example, we want to have the prediction at the time of the loan application and we set target timestamp to "date" column. Nevertheless, if the prediction doesn't depend on history (like in prediction of toxicity of a molecule), target timestamp can be left unfilled.

Predictor Factory has a limitation that it requires that all these 3 columns are defined in a single table called Target table. If such table doesn't exist in the database, it has to be created.

Selection of the tables and columns

Since we may have a hunch about which tables and columns are useful for the prediction, and which are not, we can greatly accelerate the propositionalization by ignoring irrelevant tables and columns.

Patterns

Patterns define how to transform relational data into predictors (also known as "features"). Generally, more patterns you try, better features you get. But there is a tradeoff - more patterns mean higher run time of Predictor Factory.

If you want to quickly get some results, use just "Aggregate" and "Direct field" patterns.

Settings

If target timestamp is defined, it is necessary to revise two parameters:

  1. Window size: how long history to use to make the prediction (in GUI called "lag")
  2. Blackout: how much of the most recent history to ignore (in GUI called "lead")

Sometimes it can be practical to not use all the available history to make predictions. Sometimes it can be better to use just records for the last year because all older history is stored on some slow tapes... The length of the history to use is defined with window size.

Furthermore, it commonly takes some time until all the data gets into the data-warehouse. If the data in the data-warehouse lags behind the wall time 1 day, then if we want to get a prediction "now", data from the last day are going to be missing. To simulate the lag, blackout can be set. Note that blackout is just a convenience - the same result can be obtained by using appropriately shifted target timestamp.

Both, window size and blackout, can be set in units like: second, day, month and year.

Result

Predictor Factory produces following tables in the output schema:

  1. mainSample: contains the top features as measured with penalized Chi2
  2. journal: contains metadata about the features like the used SQL query or run time
  3. predictor*: a temporary table with a computed feature
  4. propagated_*: a temporary table which was produced by joining one of the original tables with the target table

Penalized Chi2

The best features are selected with a univariate feature selection method based on Chi2. The disadvantage of Chi2 is that it favors nominal attributes with high cardinality (attributes with many unique values). This bias is not favorable because it may lead into selection of id-like attributes that have generalization ability. Hence, the penalized Chi2 is used (the difference between Chi2 and penalized Chi2 is similar to difference between Information Gain and Information Gain Ratio).

Clone this wiki locally