High-performance territorial data collector for Indonesia's administrative divisions.
Accurate and up-to-date administrative boundary data for Indonesia (provinces, cities/regencies, subdistricts, and urban villages) is essential for a wide range of applications, including GIS, demographic analysis, public services, and logistics. Manually compiling this information is tedious and error-prone. INTeLS aims to provide a reliable, automated solution to extract this multi-level hierarchical data from a public web source, making it easily accessible for developers and researchers.
-
Asynchronous Scraping Architecture: Built on asyncio for efficient, non-blocking I/O operations, ensuring high performance.
-
Multi-level Hierarchy Extraction: Automatically navigates and extracts data from provinces down to urban villages, capturing the full administrative structure.
-
CSV Export: Exports collected data into a clean, structured CSV format, ready for immediate use.
-
Efficient Handling of Large Datasets: Optimized to process and store a high volume of records (~80,000+ urban villages and their parent hierarchies) without memory issues.
-
Robust Browser Automation: Utilizes Playwright with stealth techniques to mimic real browser behavior and handle dynamic content.
-
Python 3.12+: Modern Python versions for optimal performance and asyncio features.
-
Playwright: Powers the headless browser automation, enabling the scraper to interact with JavaScript-rendered pages and bypass common anti-bot mechanisms.
-
BeautifulSoup4: Used for efficient and flexible parsing of HTML content.
-
Asyncio: The foundation for the concurrent and high-performance scraping operations.
-
Semaphore-based Rate Limiting: Implements controlled concurrency to prevent overwhelming the target server and avoid IP bans.
The data collected by INTeLS is sourced from m.nomor.net. This project is solely a tool to facilitate the collection of publicly available information. We do not claim ownership of the data itself, nor are we responsible for its accuracy or completeness. Users are encouraged to verify the data independently and adhere to the terms of service of the original data source.
To get started with INTeLS, follow these steps:
git clone https://github.com/natanhp/INTeLS.git
cd INTeLSIn this case, I use pyenv
pyenv virtualenv 3.12.9 intels
pyenv activate intelspip install -r requirements.txtINTeLS uses Playwright, which requires specific browser binaries. Playwright can install these for you.
playwright install will produce missing dependencies error, the only thing you need to do is install the dependencies using the command below and ignore the error.
sudo dnf install -y \
libicu \
libjpeg-turbo \
libwebp \
flite \
pcre \
libffiplaywright installTo run the scraper, execute the main script from your project's root directory:
python -m intelsThe scraper will then start collecting data. Upon completion, CSV files will be generated in your project root directory.
id,name
11,Aceh (NAD)
id,province_id,name
11.05,11,Kab. Aceh Barat
id,city_id,name
11.05.07,11.05,Arongan Lambalek
id,subdistrict_id,name
11.05.07.2002,11.05.07,Alue Bagok
We're continuously working to enhance INTeLS. Here's a glimpse of what's planned for future releases:
-
Enhanced Progress Bars: More detailed and interactive progress indicators for each scraping level.
-
Database Export Options: Support for directly exporting collected data to various database systems (e.g., PostgreSQL, SQLite, MongoDB) for easier integration with applications.
-
Circuit Breaker Pattern Implementation: To improve resilience against temporary network issues or server unresponsiveness, preventing cascading failures during long-running scrapes.
We welcome contributions to INTeLS! If you'd like to improve the scraper, add new features, or fix bugs, please follow these steps:
-
Fork the repository.
-
Create a new branch (git checkout -b feature/your-feature-name).
-
Make your changes and write tests.
-
Run tox to ensure all tests pass and code style is consistent.
-
Commit your changes (git commit -m 'feat: Add new feature').
-
Push to your fork (git push origin feature/your-feature-name).
-
Open a Pull Request to the main branch of this repository.
-
Please ensure your code adheres to PEP 8 standards and includes clear documentation.