Skip to content

A modular Python pipeline for scraping, cleaning, and exporting real estate listing data for analysis and modeling.

Notifications You must be signed in to change notification settings

Bilalharoon/HousingDataScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Zillow Housing Data Scraper & Cleaner

This project is a Python-based web scraping and data processing pipeline for collecting and cleaning real estate listing data from Zillow-style property sites. It handles session management, parsing structured listing data, and producing analysis-ready CSV files for downstream modeling or analytics.

It is released here as a standalone, reusable data collection tool.

πŸš€ Features

  • Scrapes housing listings from Zillow-style endpoints

  • Handles session setup and request flow

  • Parses and normalizes listing data

  • Outputs clean CSV files for analysis

  • Includes backup + recovery JSON snapshots

  • Modular architecture for reuse in other projects

πŸ“ Project Structure

.
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                # Raw scraped CSV output
β”‚   β”œβ”€β”€ processed/         # Cleaned / analysis-ready CSVs
β”‚   └── tmp/               # Backup JSON + intermediate files
β”œβ”€β”€ notebooks/
β”‚   └── sanity_check.ipynb # Quick inspection & validation
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ scraper.py         # Core scraping logic
β”‚   β”œβ”€β”€ session.py         # Session & request handling
β”‚   β”œβ”€β”€ parser.py          # Parsing + field normalization
β”‚   β”œβ”€β”€ build.py           # Data pipeline orchestration
β”‚   └── main.py            # Entry point / CLI-style runner
β”œβ”€β”€ .env                   # Local environment variables
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md

βš™οΈ Setup

  1. Create a virtual environment

    python -m venv .venv

    source .venv/bin/activate

  2. Install dependencies

    pip install -r requirements.txt

▢️ Usage

Run the main pipeline:

python -m src.main

This will: β€’ Initialize a session

β€’ Scrape housing data

β€’ Parse data

β€’ Clean data

β€’ Save results to data/raw/ and data/processed/

πŸ“Š Output

data/raw/irving_tx_housing.csv β†’ raw scraped data

data/processed/irving_tx_housing_clean.csv β†’ cleaned dataset

data/tmp/*.json β†’ backup snapshots for recovery / debugging

🧠 Design Philosophy

This project is structured to be:

β€’ Modular

β€’ Reusable

β€’ Analysis-friendly

β€’ Easy to extend for other cities or data sources

It separates concerns between:

  • Session handling

  • Scraping

  • Parsing

  • Pipeline orchestration

⚠️ Legal & Ethical Note

This project is for educational and research purposes only. Users are responsible for complying with the terms of service of any website they access and for respecting robots.txt, rate limits, and local laws.

πŸ”§ Future Ideas

CLI arguments for city, state, and filters

Support for additional listing platforms

Database export (Postgres / DuckDB)

Geospatial enrichment (school zones, crime, transit, etc.)

πŸ‘€ Author

Built by Bilal Haroon

Data Science Β· ML Β· Systems Β· Open Source

About

A modular Python pipeline for scraping, cleaning, and exporting real estate listing data for analysis and modeling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published