Open, Reproducible OpenStreetMap Data Pipeline
Transforming OSM data into analysis-ready GeoParquet format
Access ready-to-use OSM data with a simple duckdb query:
INSTALL httpfs;
LOAD httpfs;
INSTALL spatial;
LOAD spatial;
SET s3_endpoint='api.cadencemaps.infs.ch';
SET s3_url_style='path';
SELECT names.primary
FROM read_parquet(
's3://cadencemaps/release/2025-05-13/theme=places/type=place/country=CH/*',
filename=true,
hive_partitioning=1
)
WHERE categories.primary = 'restaurant';A modular pipeline for converting OpenStreetMap data into cloud-native GeoParquet files using DuckDB, PostGIS, and Docker. Built as part of a thesis project, it addresses the challenge of making OSM's rich dataset (5+ million POIs for the D-A-CH-LI) accessible for modern cloud-native analytics.
- ๐ Automated Pipeline: Weekly or Daily updates possible using CI/CD DataOps principles
- ๐ Analysis-Ready: Direct SQL querying without server infrastructure
- ๐ Standardized Schema: Overture Maps-compatible format for interoperability
- โก High Performance: Optimized spatial partitioning
- ๐ Quality Assured: Automated validation and vandalism detection
- ๐๏ธ Scalable Architecture: Country-scale processing in <24 hours
CadenceMaps employs a fully automated CI/CD pipeline that orchestrates the conversion of OpenStreetMap data into analysis-ready GeoParquet files. The pipeline is triggered weekly and processes the entire D-A-CH-LI region in parallel across multiple countries.
The GitLab CI/CD pipeline consists of seven distinct stages that ensure reliable, reproducible data processing:
-
๐ง Publish Stage: Builds and publishes Docker images for all services (converter, importer, database, validator, web, reverse-proxy, cert-extractor) to the GitLab container registry.
-
โ๏ธ Setup Stage: Initializes the production environment by deploying core infrastructure services including MinIO S3 storage, PostgreSQL database, reverse proxy, and the web interface.
-
๐ Deploy Stage: Executes parallel data processing for each country in the D-A-CH-LI region:
- Downloads latest OSM PBF files from Geofabrik (with fallback to OSM France)
- Imports data into PostgreSQL using osm2pgsql with custom Lua transformations
- Converts PostgreSQL data to GeoParquet format using DuckDB
- Uploads results to staging area in MinIO
-
๐งน Cleanup Stage: Performs system maintenance by removing unused Docker images and volumes to free up server resources.
-
โ Validate Stage: Runs vandalism detection algorithms.
-
๐ Approve Stage: Manual approval gate where validated data is promoted from staging to production release bucket, making it publicly accessible.
-
๐ Stop Proxy Stage: Optional manual stage to stop the reverse proxy for maintenance purposes.
- โฐ Scheduled Execution: Pipeline runs weekly or daily automatically, ensuring fresh data availability
- ๐ Fault Tolerance: Automatic fallback to alternative OSM data sources (OSM France) if Geofabrik fails
- ๐ Parallel Processing: All four countries (DE, AT, CH, LI) process simultaneously for optimal performance
- ๐พ Resource Management: Aggressive cleanup of Docker images and volumes to maintain server health
- ๐ Monitoring: Artifact collection of validation reports
- ๐ฆ Quality Gates: Manual approval step prevents bad data from reaching production
The entire pipeline typically completes in under 24 hours for the full D-A-CH-LI region, processing 5+ million points of interest.
| Component | Technology | Purpose |
|---|---|---|
| Data Ingestion | osm2pgsql | OSM to PostgreSQL conversion |
| Spatial Processing | PostgreSQL/PostGIS | Geographic data preprocessing |
| Format Conversion | Python + DuckDB | Tabular data transformation |
| Storage Format | GeoParquet | Cloud-optimized geospatial data |
| Hosting | MinIO (S3-compatible) | Scalable object storage |
| Orchestration | GitLab CI/CD + Docker | Automated pipeline execution |
cadencemaps/
โโโ ๐ณ .docker/ # Docker container configurations for production
โ โโโ cert-extractor/ # SSL certificate extraction service for minio
โ โโโ converter/ # Converter service Docker setup
โ โโโ importer/ # Importer service Docker setup
โ โโโ reverse-proxy/ # Caddy reverse proxy configuration
โ โโโ spatial-database/ # PostGIS database Docker setup
โ โโโ validator/ # Validator service Docker setup
โ โโโ web/ # Web service Docker setup
โ โโโ docker-compose.yml # Production Docker compose configuration
โโโ ๐ converter/ # Data conversion logic (PostgreSQL to GeoParquet)
โโโ ๐ฅ importer/ # osm2pgsql data import tools and mapping scripts
โ โโโ mappers/ # Lua scripts for data transformation
โ โโโ divisions/ # Administrative boundary mapping
โ โโโ places/ # Points of interest mapping
โ โโโ utils/ # Shared utility functions
โโโ โ
validator/ # Data quality validation and testing
โโโ ๐ web/ # Public website and documentation
โ โโโ release-page/ # Next.js frontend application
โโโ ๐ณ docker-compose.yml # Local development environment setup
- ๐ฉ๐ช Germany (Deutschland)
- ๐ฆ๐น Austria (รsterreich)
- ๐จ๐ญ Switzerland (Schweiz)
- ๐ฑ๐ฎ Liechtenstein
- places:
- place (POIs like restaurants, parks, etc. as points)
- divisions:
- division (logical definition of administrative boundaries as points)
- division_area (definition of administrative boundaries as polygons or multipolygons)
- division_boundary (definition of administrative boundaries as linestrings or multilinestrings)
- Clone the repository
git clone <repository-url>
cd cadencemaps- Start Docker services
docker-compose up -dAccess examples can be found on our Website in the Usage section.
- Processing Time: <24 hours for full update
- Data Volume: ~5M POIs processed
A typical client query like "all restaurants in D-A-CH-LI region" takes around 20 seconds to execute as per our testing:
SELECT * FROM read_parquet('s3://cadencemaps/release/2025-06-04/theme=places/type=place/*/*', hive_partitioning=1)
WHERE
country IN ('DE', 'CH', 'LI', 'AT')
AND (categories.primary = 'restaurant'
OR 'restaurant' IN categories.alternate);Query Performance
| Run | Execution Time |
|---|---|
| Run 1 | 20.17 seconds |
| Run 2 | 19.17 seconds |
| Run 3 | 19.40 seconds |
| Run 4 | 19.68 seconds |
| Run 5 | 19.05 seconds |
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- ๐ Geographic Expansion: Add new countries/regions
- โก Performance: Optimize processing algorithms
- ๐งฎ Analytics: New schema mappings and transformations
- ๐ Documentation: Improve guides and examples
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenStreetMap Community - For the incredible open geospatial data
- Geofabrik - For reliable OSM extracts
- Overture Maps Foundation - For schema standardization efforts
Built by Matthias Hersche, Fadil Smajilbasic and Nils Robin-Grob as part of the FS25 Bachelor Thesis at the OST
