GitHub - philippmerz/data-pollution-mitigation

Data Pollution Mitigation

Setup

Add a data folder to the root and put the gender.csv into it.
install poetry
poetry install
poetry run python src/main.py or click run from your IDE

Code Structure

src/main.py is the entry point of the program
- imports all other modules and defines Pipeline class
/data for all the data files
/notebooks for quick and dirty exploration and testing

Usage

Pipeline.run (called in main.py) takes one of the following as start_from param:
- 'raw'
- 'preprocessed'
- 'classifier_tokens'
Depending on which is passed, the pipeline will run from the corresponding step, loading the data from /data
The pipeline will save the data at each step to /data as well, so data is always up to date

Guidelines

put all global config constants into src/config/config.py for easy readability and modification
better add too many print statements than too few. That way it's easier to keep track of what's happening when running the code
do poetry add [package] when adding a new package to the project

Running it on Different Data

Ensure that the data has the same format as the one that was used throughout the code
- A .csv file with columns: auhtor_ID (str), post (str), female (int64)
If your data has other format, then change it to the previous specified format
Link the data with the project:
- Option1: Add the data to /data/raw folder by giving the name "gender"
- Option2: Navigate to config.py and change 'raw_data_path' variable value to your specific path location of the csv and your csv name

Suggestions for Extending

Adding more complex models like RNNs and LSTMs which can capture the sequential nature of text data. Testing on these might give different results than our approach
Make more robust embeddings with more complex LLMs that can caputre the semantic of the text better (eg. ChatGPT, Claude)

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pollution Mitigation

Setup

Code Structure

Usage

Guidelines

Running it on Different Data

Suggestions for Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

philippmerz/data-pollution-mitigation

Folders and files

Latest commit

History

Repository files navigation

Data Pollution Mitigation

Setup

Code Structure

Usage

Guidelines

Running it on Different Data

Suggestions for Extending

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages