Skip to content

natanhp/INTeLS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨ INTeLS: Indonesian Territorial List Scraper

Logo High-performance territorial data collector for Indonesia's administrative divisions.

🎯 Motivation

Accurate and up-to-date administrative boundary data for Indonesia (provinces, cities/regencies, subdistricts, and urban villages) is essential for a wide range of applications, including GIS, demographic analysis, public services, and logistics. Manually compiling this information is tedious and error-prone. INTeLS aims to provide a reliable, automated solution to extract this multi-level hierarchical data from a public web source, making it easily accessible for developers and researchers.

🚀 Core Features

  • Asynchronous Scraping Architecture: Built on asyncio for efficient, non-blocking I/O operations, ensuring high performance.

  • Multi-level Hierarchy Extraction: Automatically navigates and extracts data from provinces down to urban villages, capturing the full administrative structure.

  • CSV Export: Exports collected data into a clean, structured CSV format, ready for immediate use.

  • Efficient Handling of Large Datasets: Optimized to process and store a high volume of records (~80,000+ urban villages and their parent hierarchies) without memory issues.

  • Robust Browser Automation: Utilizes Playwright with stealth techniques to mimic real browser behavior and handle dynamic content.

🛠️ Technical Highlights

  • Python 3.12+: Modern Python versions for optimal performance and asyncio features.

  • Playwright: Powers the headless browser automation, enabling the scraper to interact with JavaScript-rendered pages and bypass common anti-bot mechanisms.

  • BeautifulSoup4: Used for efficient and flexible parsing of HTML content.

  • Asyncio: The foundation for the concurrent and high-performance scraping operations.

  • Semaphore-based Rate Limiting: Implements controlled concurrency to prevent overwhelming the target server and avoid IP bans.

🚨 Data Disclaimer

The data collected by INTeLS is sourced from m.nomor.net. This project is solely a tool to facilitate the collection of publicly available information. We do not claim ownership of the data itself, nor are we responsible for its accuracy or completeness. Users are encouraged to verify the data independently and adhere to the terms of service of the original data source.

📦 Installation

To get started with INTeLS, follow these steps:

1. Clone the repository:

git clone https://github.com/natanhp/INTeLS.git
cd INTeLS

2. Create and activate a virtual environment (recommended):

In this case, I use pyenv

pyenv virtualenv 3.12.9 intels
pyenv activate intels

3. Install Python dependencies:

pip install -r requirements.txt

4. Install Playwright browser binaries:

INTeLS uses Playwright, which requires specific browser binaries. Playwright can install these for you.

For Fedora 41

playwright install will produce missing dependencies error, the only thing you need to do is install the dependencies using the command below and ignore the error.

sudo dnf install -y \
    libicu \
    libjpeg-turbo \
    libwebp \
    flite \
    pcre \
    libffi

Source

Then, instruct Playwright to install its browsers:

playwright install

⚡ Usage

To run the scraper, execute the main script from your project's root directory:

python -m intels

The scraper will then start collecting data. Upon completion, CSV files will be generated in your project root directory.

📊 Output File Structure

Province (provinces.csv)

id,name
11,Aceh (NAD)

City (cities.csv)

id,province_id,name
11.05,11,Kab. Aceh Barat

Subdistrict (subdistricts.csv)

id,city_id,name
11.05.07,11.05,Arongan Lambalek

Urban Village (urban_villages.csv)

id,subdistrict_id,name
11.05.07.2002,11.05.07,Alue Bagok

💡 Upcoming Features

We're continuously working to enhance INTeLS. Here's a glimpse of what's planned for future releases:

  1. Enhanced Progress Bars: More detailed and interactive progress indicators for each scraping level.

  2. Database Export Options: Support for directly exporting collected data to various database systems (e.g., PostgreSQL, SQLite, MongoDB) for easier integration with applications.

  3. Circuit Breaker Pattern Implementation: To improve resilience against temporary network issues or server unresponsiveness, preventing cascading failures during long-running scrapes.

🤝 Contributing

We welcome contributions to INTeLS! If you'd like to improve the scraper, add new features, or fix bugs, please follow these steps:

  1. Fork the repository.

  2. Create a new branch (git checkout -b feature/your-feature-name).

  3. Make your changes and write tests.

  4. Run tox to ensure all tests pass and code style is consistent.

  5. Commit your changes (git commit -m 'feat: Add new feature').

  6. Push to your fork (git push origin feature/your-feature-name).

  7. Open a Pull Request to the main branch of this repository.

  8. Please ensure your code adheres to PEP 8 standards and includes clear documentation.

About

High-performance territorial data collector for Indonesia's administrative divisions.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages