Setup virtual environment:
$ python -m venv .venv
$ . .venv/bin/activate
Install requirements (Python 3.8):
$ pip install -r requirements.txt
Install requirements (Python 3.6):
$ pip install -r requirements-36.txt
$ cd code/
$ python model_main.py [-d <data-dir>] [-m <mode>] [--te-batch-size=<te_batch_size>] [--pooler-batch-size=<pooler_batch_size>] [--checkpoint-dir=<checkpoint_dir>] [--test-dirs=<test_dirs>] [--cv-dir=<cv_dir>] [-c <config_overwrites>] [-g <grid_params>] [--grid-dir=<grid_dir>]
data-dirspecifies base directory of misinformation project, default/data/misinformation-domains/modedifferentiates between "train", "val", "test", "grid", and "CV". If "val" or "test",test_dirsmust be specified. If "grid", thengrid_paramsmust be specified. If "CV", thencv_dirmust be specified. Default "train"te_batch_sizespecifies the number of documents that are encoded in parallel, default all documentspooler_batch_sizespecifies the number of sentences which are encoded in parallel, i.e. fed its CLS token encoding into an additional (pooler) layer, default all sentencescheckpoint_direnables the model to continue from a checkpoint stored in this directorytest_dirsis a comma-separated list of directory names, which are subdirectories of<data_dir>/train_checkpointscv_dirspecifies the name of the subdirectory, where the model states of each fold are storedconfig_overwritesis a string of comma separated key-value pairs, which specify config settings to overwrite, e.g. "key1=value1,key2=value2". To overwrite a nested value, e.g. "lr" in{"optimizer": {"lr": 0.001}}, one can specify the keys separated by ".", e.g. "optimizer.lr=0.099,key2=value2"grid_paramsspecify the path to a json with key value combinations for a grid search. The are only considered, ifmodeis "grid". The json has to be flat, the keys must be present in the config as well, or can be nested by separating with "." (see above), and the values must be lists of valid values for the respective keys.grid_dirspecifies the name of the subdirectory, where the model states of each training are stored
seed: The seed for initializing model weights, splitting data, and other non-deterministic operationslabel: The label to predict. One of "accuracy", "transparency", or "type"n_epochs: Number of epochs (total iterations over the whole training data)batch_size: Batch size (number of news websites) after which model weights are adjustedloss_fn: Class name of loss function, e.g. "CrossEntropyLoss"optimizer: Optimizer propertiesname: Name of the optimizer, one of "SGD" or "Adam"lr: Learning rate of the optimizermomentum: Momentum of the optimizer
psm: Properties of the post sequence modeltype: The type of the recurrent post sequence model (how the sequence of posts should be modeled). One of "RNN", "LSTM", or "GRU"hidden_size: Size of the hidden state in the recurrent modeloutput: Hidden states aggregation function, one of "last_state", "mean", "max", "mean+max"
te: Properties of the text encodertype: Function to aggregate the CLS token encodings, one of "mean", "LSTM"embedding_size: Size of the output embedding. Ifnull, take output size of pooler }
API access requires a bearer token that is stored in a separate file, not part of this repository.
Search query configs are kept as separate .json files in the code/configs/ folder. Please try to give them descriptive names.
Utility functions that will potentially be used in more than one application are collected in the twitter_functions.py module.
resources/domains/raw
Has a number of domain lists that have been classified into various disinformation related categories and were initially compiled in the Galotti 2020 et al. paper. We collected almost all of these lists, with the notable exception of the list from Décodex, some of them with updates, and compiled them into a new list data/domains/clean/domain_list_clean.csv. The cleaning is done in the script clean_misinformation_domains.ipynb.
resources/search_terms/
Has a list of different search terms connected to the COVID-19 pandemic and misinformation.