-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Started
Follow the steps below and learn how to use PyCMTensor to estimate a discrete choice model. In this tutorial, we will use the London Passenger Mode Choice (LPMC) dataset (pdf). Download the dataset here and place it in the working directory.
Import the PyCMTensor package and read the data using pandas:
import pycmtensor
import pandas as pd
data= pd.read_csv("lpmc.dat", sep='\t') Optional: Using Pandas, you can make modifications and selections to the dataset using indexing operators [] and . we are going to use "travel_year"==2015:
lpmc = data[data["travel_year"]==2015]From the pycmtensor package, import the Dataset object, which stores and manages the tensors and arrays of the data variables. Denote the column name with the choice variable in the argument choice=:
from pycmtensor.dataset import Dataset
ds = Dataset(df=lpmc, choice="travel_mode")The Dataset object takes the following arguments:
-
df: Thepandas.DataFrameobject -
choice:strThe name of the choice variable found in the heading of the dataframe
Optional arguments are:
-
shuffle:[True|False]. Whether the dataset should be shuffled or not
note
The range of alternatives should begin with
0, if not, an error might be raised during model training.
Before proceeding with model training, it might be beneficial to scale your variables, especially if they have different ranges. This can help the model converge faster and might result in a better performance.
For example, to apply a scaling of 100 to the dur_walking variable using Pandas:
lpmc["dur_walking"] = lpmc["dur_walking"].div(100)
ds = Dataset(df=lpmc, choice="travel_mode")Note: Scaling is optional and might not always result in better performance. It's recommended to experiment with and without scaling to see what works best for your specific use case.
Next, we'll divide the dataset into training and validation subsets. The frac= parameter determines the proportion of data allocated to the training set, with the remainder going to the validation set.
ds.split(frac=0.8) # Allocates 80% of the data to the training set. The remaining 20% is used for validationAlternatively, you can specify the number of samples for the training set using the count parameter:
ds.split(count=3986) # Sets 3986 samples for trainingUpon execution, you should see an output indicating the number of samples in the training and validation sets.
Output:
[INFO] n_train_samples:3986 n_valid_samples:997Note
Splitting the dataset is optional. If
frac=is not given as an argument, both training and validation dataset will use the same samples.
Define the taste parameters using the Beta object from the pycmtensor.expressions module:
from pycmtensor.expressions import Beta
asc_walk = Beta("asc_walk", value=0.0, lb=None, ub=None, status=1) # status=1 do not estimate asc_walk
asc_cycle = Beta("asc_cycle", 0.0)
asc_pt = Beta("asc_pt", 0.0)
asc_drive = Beta("asc_drive", 0.0)
b_cost = Beta("b_cost", 0.0)
b_time = Beta("b_time", 0.0)
b_purpose = Beta("b_purpose", 0.0)
b_licence = Beta("b_licence", 0.0)The Beta(name=, value=, lb=, ub=, status=) object takes the following argument:
-
name: Name of the taste parameter (required) -
value: The initial starting value. Defaults to0. -
lbandub: lower and upper bound of the parameter. Defaults toNone -
status:[0|1]set to1if the parameter should not be estimated. Defaults to0.
Note
If a
Betavariable is not used in the model, a warning will be shown in stdout. E.g.
[WARNING] b_purpose not in any utility functionsU_walk = asc_walk + b_time * ds["dur_walking"]
U_cycle = asc_cycle + b_time * ds["dur_cycling"]
U_pt = asc_pt + b_time * (ds["dur_pt_rail"] + ds["dur_pt_bus"] + \
ds["dur_pt_int"]) + b_cost * ds["cost_transit"]
U_drive = asc_drive + b_time * ds["dur_driving"] + b_licence * ds["driving_license"] + \
b_cost * (ds["cost_driving_fuel"] + ds["cost_driving_ccharge"])
# vectorize the utility function
U = [U_walk, U_cycle, U_pt, U_drive]Data variables are defined as items from the Dataset object. For example, the "dur_walking" variable from the LPMC dataset is represented as ds["dur_walking"]. Additionally, you can specify composite variables or interactions using standard mathematical operators. An example of this would be the addition of "dur_pt_rail" and "dur_pt_bus", represented as ds["dur_pt_rail"] + ds["dur_pt_bus"].
Utility functions are then vectorized by placing them into a list(). The utility's position in the list corresponds to the choice variable's (zero-adjusted) index.
For advanced users, an optional method where utility functions can be defined as a 2-D tensorVariable object instead of a list is available TODO.
To define the model, we instantiate a model object using a model from the pycmtensor.models module.
from pycmtensor.models import MNL
mymodel = MNL(ds=ds, variables=locals(), utility=U, av=None)The MNL object takes the following argument:
-
ds: Represents the dataset object -
variables: This is a list (or dictionary) of declared parameter objects -
utility: This is a list of utilities that need to be estimated -
av: These are the availability conditions, listed in the same order asutility. For an example on how to specify availability conditions, refer to TODO. If not specified, defaults toNone. -
**kwargs: These are optional keyword arguments used to modify the model configuration settings.
Tip
We use
locals()as a shortcut to gather and filter theBetaobjects from the Python local environment for the argumentvariables=.
Example output:
[INFO] choice: travel_mode
[INFO] inputs in MNL: [driving_license, dur_walking, dur_cycling, dur_pt_rail, dur_pt_bus, dur_pt_int, dur_driving, cost_transit, cost_driving_fuel, cost_driving_ccharge]
[INFO] Build time = 00:00:021A basic outline of the steps involved in estimating a model:
from pycmtensor import train
from pycmtensor.optimizers import Adam
from pycmtensor.scheduler import ConstantLR
train(
model=mymodel,
ds=ds,
optimizer=Adam, # optional
batch_size=0, # optional
base_learning_rate=1.0, # optional
convergence_threshold=0.0001, # optional
max_epochs=200, # optional
lr_scheduler=ConstantLR, # optional
)The train() function is used to estimate the model until a specified convergence is achieved. This convergence is determined by the gradient norm between two full iterations over the entire training dataset. We keep track of the
The train() function takes the following required arguments:
-
model: The model object.MNLin the example above -
ds: The dataset object -
**kwargs: These are optional keyword arguments used to modify the model configuration settings. For details on possible options, refer to the configuration section in the user guide.
The other arguments **kwargs are optional, and they can be set when calling the train() function or during model specification. These optional arguments are the so-called hyperparameters of the model that modifies the training procedure.
The additional arguments, denoted as **kwargs, are optional and can be specified either when invoking the train() function or during the model definition. These optional parameters, known as hyperparameters, alter the training process.
note
An epoch in
max_epochsrefers to a complete traversal of the training dataset. An iteration is a single model update operation, typically performed on every mini-batch (whenbatch_size!= 0).
Tip
The hyperparameters can also be set with the
pycmtensor.configmodule before the training function is called. For example, to set the trainingbatch_sizeto50andbase_learning_rateto0.1:
pycmtensor.config.batch_size = 50
pycmtensor.config.base_learning_rate = 0.1Example output:
[INFO] Start (n=3986, Step=0, LL=-5525.77, Error=80.34%)
[INFO] Train (Step=0, LL=-9008.61, Error=80.34%, gnorm=2.44949e+00, 0/2000)
[INFO] Train (Step=16, LL=-3798.26, Error=39.12%, gnorm=4.46640e-01, 16/2000)
[INFO] Train (Step=54, LL=-3487.97, Error=35.21%, gnorm=6.80979e-02, 54/2000)
...
[INFO] Train (Step=189, LL=-3470.28, Error=34.70%, gnorm=8.74120e-05, 189/2000)
[INFO] Model converged (t=0.492)
[INFO] Best results obtained at Step 185: LL=-3470.28, Error=34.70%, gnorm=2.16078e-04The results are stored in the Results class object of th MNL model. The following are function calls to display the statistical results of the model estimation:
print(mymodel.results.beta_statistics())
print(mymodel.results.model_statistics())
print(mymodel.results.benchmark())beta_statistics() show the estimated values of the model coefficients, standard errors, t-test, and p-values, including the robust measures.
The standard errors are calculated using the diagonals of the square root of the variance-covariance matrix (the inverse of the negative Hessian matrix):
The robust standard errors are calculated using the 'sandwich method', where the variance-covariance matrix is as follows:
The rest of the results are self-explanatory.
Command:
mymodel.results.beta_statistics()Output:
value std err t-test p-value rob. std err rob. t-test rob. p-value
asc_cycle -3.956 0.082 -48.336 0.0 0.082 -48.495 0.0
asc_drive -2.956 0.092 -32.048 0.0 0.094 -31.382 0.0
asc_pt -1.379 0.047 -29.597 0.0 0.049 -28.017 0.0
asc_walk 0.0 - - - - - -
b_cost -0.141 0.013 -10.945 0.0 0.012 -11.474 0.0
b_licence 1.464 0.081 18.156 0.0 0.086 17.09 0.0
b_purpose 0.291 0.032 9.15 0.0 0.032 8.976 0.0
b_time -4.971 0.182 -27.333 0.0 0.195 -25.517 0.0Calling method .results.model_statistics() generates the model's performance metrics, which include the log likelihood, training and validation accuracy, rho square, Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC).
Command:
mymodel.results.model_statistics()Output:
value
Number of training samples used 3986
Number of validation samples used 997
Number of estimated parameters in the model 8
Null. log likelihood -5525.77
Final log likelihood -3414.87
Validation Accuracy 64.69%
Training Accuracy 64.40%
Likelihood ratio test 4221.80
Rho square 0.382
Rho square bar 0.381
Akaike Information Criterion 6845.74
Bayesian Information Criterion 6896.07
Final gradient norm 9.700e-04
Maximum epochs reached No
Best result at epoch 243Command:
mymodel.results.benchmark()Output:
value
Platform Windows 10.0.22631
Processor Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
RAM 15.85 GB
Python version 3.11.7
PyCMTensor version 1.10.0
Seed 100
Model build time 00:00:01
Model train time 00:00:01
epochs per sec 207.84 epoch/sPyCMTensor provides two built-in methods to compute the correlation and robust correlation matrix results.model_correlation_matrix() or results.model_robust_correlation_matrix() respectively.
Both the correlation matrix and the robust correlation matrix can provide valuable insights into the relationships between the
Command:
print(mymodel.results.model_correlation_matrix())
print(mymodel.results.model_robust_correlation_matrix())Output:
asc_walk asc_cycle asc_pt asc_drive b_cost b_time b_purpose b_licence
asc_walk 1.000 -0.280 0.135 -0.553 0.049 -0.644 0.433 0.307
asc_cycle -0.280 1.000 -0.352 -0.505 0.128 0.035 0.254 0.132
asc_pt 0.135 -0.352 1.000 -0.292 -0.186 0.245 0.457 0.265
asc_drive -0.553 -0.505 -0.292 1.000 -0.055 0.317 -0.773 -0.476
b_cost 0.049 0.128 -0.186 -0.055 1.000 -0.018 -0.025 -0.056
b_time -0.644 0.035 0.245 0.317 -0.018 1.000 -0.091 -0.143
b_purpose 0.433 0.254 0.457 -0.773 -0.025 -0.091 1.000 0.026
b_licence 0.307 0.132 0.265 -0.476 -0.056 -0.143 0.026 1.000PyCMTensor produces a probability vector for each entry in the (validation) dataset when predicting choices, followed by the predicted choice and the ground truth data. To display the estimated prediction probabilities, utilize the following function:
Command:
prob = mymodel.predict(ds)
print(pd.DataFrame(prob))Output:
0 1 2 3 pred_travel_mode true_travel_mode
0 1.823828e-02 0.026140 0.114631 0.840990 3 3
1 3.076091e-02 0.015439 0.128448 0.825352 3 2
2 6.249183e-01 0.019045 0.264769 0.091267 0 2
3 3.877266e-08 0.002577 0.073290 0.924132 3 3
4 5.058005e-05 0.025173 0.827738 0.147039 2 3
.. ... ... ... ... ... ...
992 3.490210e-01 0.013751 0.159579 0.477649 3 0
993 7.535218e-03 0.028429 0.107126 0.856910 3 3
994 2.813242e-01 0.014489 0.175488 0.528699 3 3
995 4.005819e-05 0.005061 0.011871 0.983029 3 3
996 1.740788e-02 0.030565 0.183020 0.769007 3 3
[997 rows x 6 columns]Disaggregated elasticities are generated from the model by specifiying the dataset and the reference choice as wrt_choice. For instance, to compute the elasticities of the mode for driving (wrt_choice=3) over all the input variables:
Command:
elas = mymodel.elasticities(ds=ds, wrt_choice=3)
print(pd.DataFrame(elas).round(3))Output:
cost_driving_ccharge cost_driving_fuel cost_transit driving_license dur_cycling \
0 -0.0 -0.033 0.063 0.228 0.022
1 -0.0 -0.074 0.159 0.716 0.022
2 -0.0 -0.110 0.189 0.000 0.029
3 -0.0 -0.032 0.000 0.793 0.015
4 -0.0 -0.089 0.114 0.000 0.031
... ... ... ... ... ...
3981 -0.0 -0.023 0.000 0.000 0.008
3982 -0.0 -0.131 0.000 0.798 0.038
3983 -0.0 -0.018 0.050 0.565 0.016
3984 -0.0 -0.012 0.040 0.162 0.019
3985 -0.0 -0.042 0.114 0.000 0.026
dur_driving dur_pt_bus dur_pt_int dur_pt_rail dur_walking purpose
0 -0.193 0.287 0.099 0.000 0.003 0.045
1 -1.033 0.000 0.000 0.312 0.063 0.712
2 -1.159 0.000 0.000 0.459 0.062 0.710
3 -0.482 0.000 0.000 0.107 0.297 0.473
4 -0.561 0.062 0.249 0.560 0.000 0.224
... ... ... ... ... ... ...
3981 -0.199 0.081 0.000 0.000 0.556 1.073
3982 -1.604 0.324 0.221 0.663 0.009 0.476
3983 -0.263 0.168 0.000 0.000 0.376 0.337
3984 -0.117 0.039 0.070 0.047 0.033 0.161
3985 -0.483 0.227 0.000 0.000 0.724 0.247
[3986 rows x 11 columns]The aggregated elasticities can then be obtained by taking the mean over the rows
Command:
import numpy as np
np.mean(pd.DataFrame(elas), axis=0)Aggregated elasticities (w.r.t driving mode):
Output:
cost_driving_ccharge -0.158813
cost_driving_fuel -0.063062
cost_transit 0.088155
driving_license 0.407574
dur_cycling 0.042154
dur_driving -0.794599
dur_pt_bus 0.225808
dur_pt_int 0.079546
dur_pt_rail 0.245628
dur_walking 0.333738
purpose 0.441163