This project automates data ingestion, processing, and visualization using Apache Airflow, AWS S3, and Streamlit. It extracts Reddit posts, performs sentiment analysis, and visualizes insights on a real-time dashboard.
👉 Technologies Used:
- ETL & Orchestration: Apache Airflow
- Data Processing: Pandas, NumPy, PyArrow, TextBlob
- Storage: AWS S3 (Parquet format)
- Dashboard: Streamlit & Plotly
- Containerization: Docker & Kubernetes
- CI/CD: GitHub Actions & Docker Hub
- Cloud Deployment: AWS EC2
🏠 Apache Airflow ➔ 📦 AWS S3 (Data Storage) ➔ 📈 Streamlit Dashboard
- Airflow DAGs fetch Reddit data, analyze sentiment, and store results in AWS S3.
- Streamlit Dashboard dynamically pulls data from S3 and visualizes trends.
git clone https://github.com/yourusername/Reddit-Sentiment-Analysis-Data-Pipeline.git
cd Reddit-Sentiment-Analysis-Data-PipelineCreate a .env file in the project root and add:
AWS_ACCESS_KEY=your_aws_access_key
AWS_SECRET_KEY=your_aws_secret_key
S3_BUCKET_NAME=your_s3_bucket_name
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_client_secret
REDDIT_USER_AGENT=your_reddit_user_agentdocker-compose up --build -d- Airflow UI → http://localhost:8080
- Streamlit Dashboard → http://localhost:8501
Before pushing code, add these GitHub Secrets under Settings → Secrets & Variables:
| Secret Name | Description |
|---|---|
AWS_ACCESS_KEY |
AWS S3 Access Key |
AWS_SECRET_KEY |
AWS S3 Secret Key |
S3_BUCKET_NAME |
S3 Bucket Name |
REDDIT_CLIENT_ID |
Reddit API Client ID |
REDDIT_CLIENT_SECRET |
Reddit API Secret |
REDDIT_USER_AGENT |
Reddit API User Agent |
DOCKER_HUB_USERNAME |
Docker Hub Username |
DOCKER_HUB_ACCESS_TOKEN |
Docker Hub Token |
SERVER_IP |
AWS EC2 Public IP |
SERVER_USER |
SSH Username (ubuntu for AWS) |
SSH_PRIVATE_KEY |
Your .pem SSH Key |
- Push to
mainbranch → GitHub Actions builds & pushes Docker images. - Deploys to AWS EC2 via SSH.
- Pulls latest images & restarts containers automatically.
ssh -i your-key.pem ubuntu@your-server-ipdocker pull your-dockerhub-username/airflow:latest
docker pull your-dockerhub-username/streamlit-dashboard:latestdocker-compose down
docker-compose up -d- Streamlit Dashboard - EC2 Deployed Dashboard
- Airflow UI - EC2 Deployed Airflow UI I had to stop the EC2 instance because I needed to use a t2.medium instance, which wasn’t covered under the free tier, and I couldn’t afford to keep it running.
Here are some screenshots of the application to show the working as I had to stop the EC2 instance.
- Airflow UI(DAGs running daily)

- Calender

- Streamlit Dashboard

- Streamlit Chart 1

- Streamlit Chart 2

- Streamlit Chart 3

✅ Automated Data Pipeline – Extract, transform, store & visualize Reddit data.
✅ Parallel Processing – Uses Airflow CeleryExecutor with Redis for scalability.
✅ Secure Deployment – Manages secrets via GitHub Secrets & AWS Secrets Manager.
✅ CI/CD Pipeline – Automates Docker builds & deployment via GitHub Actions.
✅ Scalable Infrastructure – Runs as Docker containers, deployable to AWS, GCP, Kubernetes.
👨💻 Author: Madhur Dixit
🤝 Contributions: PRs are welcome! Open an issue to discuss improvements.
🌟 Star this repo if you found it useful!
This project is licensed under the MIT License. Feel free to modify and use it.