- Add a
datafolder to the root and put thegender.csvinto it. - install poetry
poetry installpoetry run python src/main.pyor click run from your IDE
src/main.pyis the entry point of the program- imports all other modules and defines Pipeline class
/datafor all the data files/notebooksfor quick and dirty exploration and testing
- Pipeline.run (called in main.py) takes one of the following as start_from param:
- 'raw'
- 'preprocessed'
- 'classifier_tokens'
- Depending on which is passed, the pipeline will run from the corresponding step, loading the data from
/data - The pipeline will save the data at each step to
/dataas well, so data is always up to date
- put all global config constants into
src/config/config.pyfor easy readability and modification - better add too many print statements than too few. That way it's easier to keep track of what's happening when running the code
- do
poetry add [package]when adding a new package to the project
- Ensure that the data has the same format as the one that was used throughout the code
- A .csv file with columns: auhtor_ID (str), post (str), female (int64)
- If your data has other format, then change it to the previous specified format
- Link the data with the project:
- Option1: Add the data to /data/raw folder by giving the name "gender"
- Option2: Navigate to config.py and change 'raw_data_path' variable value to your specific path location of the csv and your csv name
- Adding more complex models like RNNs and LSTMs which can capture the sequential nature of text data. Testing on these might give different results than our approach
- Make more robust embeddings with more complex LLMs that can caputre the semantic of the text better (eg. ChatGPT, Claude)