This project is a Python-based web scraping and data processing pipeline for collecting and cleaning real estate listing data from Zillow-style property sites. It handles session management, parsing structured listing data, and producing analysis-ready CSV files for downstream modeling or analytics.
It is released here as a standalone, reusable data collection tool.
-
Scrapes housing listings from Zillow-style endpoints
-
Handles session setup and request flow
-
Parses and normalizes listing data
-
Outputs clean CSV files for analysis
-
Includes backup + recovery JSON snapshots
-
Modular architecture for reuse in other projects
.
βββ data/
β βββ raw/ # Raw scraped CSV output
β βββ processed/ # Cleaned / analysis-ready CSVs
β βββ tmp/ # Backup JSON + intermediate files
βββ notebooks/
β βββ sanity_check.ipynb # Quick inspection & validation
βββ src/
β βββ scraper.py # Core scraping logic
β βββ session.py # Session & request handling
β βββ parser.py # Parsing + field normalization
β βββ build.py # Data pipeline orchestration
β βββ main.py # Entry point / CLI-style runner
βββ .env # Local environment variables
βββ .gitignore
βββ README.md
-
Create a virtual environment
python -m venv .venvsource .venv/bin/activate -
Install dependencies
pip install -r requirements.txt
Run the main pipeline:
python -m src.main
This will: β’ Initialize a session
β’ Scrape housing data
β’ Parse data
β’ Clean data
β’ Save results to data/raw/ and data/processed/
data/raw/irving_tx_housing.csv β raw scraped data
data/processed/irving_tx_housing_clean.csv β cleaned dataset
data/tmp/*.json β backup snapshots for recovery / debugging
This project is structured to be:
β’ Modular
β’ Reusable
β’ Analysis-friendly
β’ Easy to extend for other cities or data sources
It separates concerns between:
-
Session handling
-
Scraping
-
Parsing
-
Pipeline orchestration
This project is for educational and research purposes only. Users are responsible for complying with the terms of service of any website they access and for respecting robots.txt, rate limits, and local laws.
CLI arguments for city, state, and filters
Support for additional listing platforms
Database export (Postgres / DuckDB)
Geospatial enrichment (school zones, crime, transit, etc.)
Built by Bilal Haroon
Data Science Β· ML Β· Systems Β· Open Source