Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,6 @@ jobs:
with:
version: "0.9.28"
- name: Install dependencies (including dev)
run: uv sync --group dev
run: uv sync --group dev --extra sparql
- name: Run pre-commit
run: uv run pre-commit run --all-files
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,21 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [1.6.0] - 2026-02-17

### Added
- **SPARQL / RDF resource support**: Ingest data from SPARQL endpoints (e.g. Apache Fuseki) and local RDF files (`.ttl`, `.rdf`, `.n3`, `.jsonld`) into property graphs
- New `SparqlPattern` for mapping `rdf:Class` instances to resources, alongside existing `FilePattern` and `TablePattern`
- New `RdfDataSource` abstract parent with shared RDF-to-dict conversion logic; concrete subclasses `RdfFileDataSource` (local files via rdflib) and `SparqlEndpointDataSource` (remote endpoints via SPARQLWrapper)
- New `SparqlEndpointConfig` (extends `DBConfig`) with `from_docker_env()` for Fuseki containers
- New `RdfInferenceManager` auto-infers graflo `Schema` from OWL/RDFS ontologies: `owl:Class` to vertices, `owl:DatatypeProperty` to fields, `owl:ObjectProperty` to edges
- `GraphEngine.infer_schema_from_rdf()` and `GraphEngine.create_patterns_from_rdf()` for the RDF inference workflow
- `Patterns` class extended with `sparql_patterns` and `sparql_configs` dicts
- `RegistryBuilder` handles `ResourceType.SPARQL` to create the appropriate data sources
- `ResourceType.SPARQL`, `DataSourceType.SPARQL`, `DBType.SPARQL` enum values
- `rdflib` and `SPARQLWrapper` available as the `sparql` optional extra (`pip install graflo[sparql]`)
- Docker scripts (`start-all.sh`, `stop-all.sh`, `cleanup-all.sh`) updated to include Fuseki
- Test suite with 22 tests: RDF file parsing, ontology inference, and live Fuseki integration

### Changed
- **Top-level imports optimized**: Key classes are now importable directly from `graflo`:
Expand All @@ -17,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`graflo.filter` package exports**: `FilterExpression`, `ComparisonOperator`, and `LogicalOperator` are now re-exported from `graflo.filter.__init__` (previously only available via `graflo.filter.onto`)

### Documentation
- Added data-flow diagram (Pattern -> DataSource -> Resource -> GraphContainer -> Target DB) to Concepts page
- Added **Mermaid class diagrams** to Concepts page showing:
- `GraphEngine` orchestration: how `GraphEngine` delegates to `InferenceManager`, `ResourceMapper`, `Caster`, and `ConnectionManager`
- `Schema` architecture: the full hierarchy from `Schema` through `VertexConfig`/`EdgeConfig`, `Resource`, `Actor` subtypes, `Field`, and `FilterExpression`
Expand Down
107 changes: 62 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# GraFlo <img src="https://raw.githubusercontent.com/growgraph/graflo/main/docs/assets/favicon.ico" alt="graflo logo" style="height: 32px; width:32px;"/>

A framework for transforming **tabular** (CSV, SQL) and **hierarchical** data (JSON, XML) into property graphs and ingesting them into graph databases (ArangoDB, Neo4j, **TigerGraph**, **FalkorDB**, **Memgraph**).
A framework for transforming **tabular** (CSV, SQL), **hierarchical** (JSON, XML), and **RDF/SPARQL** data into property graphs and ingesting them into graph databases (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph).

> **⚠️ Package Renamed**: This package was formerly known as `graphcast`.
> **Package Renamed**: This package was formerly known as `graphcast`.

![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg)
[![PyPI version](https://badge.fury.io/py/graflo.svg)](https://badge.fury.io/py/graflo)
Expand All @@ -11,56 +11,35 @@ A framework for transforming **tabular** (CSV, SQL) and **hierarchical** data (J
[![pre-commit](https://github.com/growgraph/graflo/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/growgraph/graflo/actions/workflows/pre-commit.yml)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15446131.svg)]( https://doi.org/10.5281/zenodo.15446131)

## Core Concepts
## Overview

### Property Graphs
graflo works with property graphs, which consist of:
graflo reads data from multiple source types, transforms it according to a declarative schema, and writes property-graph vertices and edges to a target graph database. The pipeline is:

- **Vertices**: Nodes with properties and optional unique identifiers
- **Edges**: Relationships between vertices with their own properties
- **Properties**: Both vertices and edges may have properties
**Pattern** (where data lives) --> **DataSource** (how to read it) --> **Resource** (what to extract) --> **GraphContainer** --> **Target DB**

### Schema
The Schema defines how your data should be transformed into a graph and contains:
### Supported sources

- **Vertex Definitions**: Specify vertex types, their properties, and unique identifiers
- Fields can be specified as strings (backward compatible) or typed `Field` objects with types (INT, FLOAT, STRING, DATETIME, BOOL)
- Type information enables better validation and database-specific optimizations
- **Edge Definitions**: Define relationships between vertices and their properties
- Weight fields support typed definitions for better type safety
- **Resource Mapping**: describe how data sources map to vertices and edges
- **Transforms**: Modify data during the casting process
- **Automatic Schema Inference**: Generate schemas automatically from PostgreSQL 3NF databases
| Source type | Pattern | DataSource | Schema inference |
|---|---|---|---|
| CSV / JSON / JSONL / Parquet files | `FilePattern` | `FileDataSource` | manual |
| PostgreSQL tables | `TablePattern` | `SQLDataSource` | automatic (3NF with PK/FK) |
| RDF files (`.ttl`, `.rdf`, `.n3`) | `SparqlPattern` | `RdfFileDataSource` | automatic (OWL/RDFS ontology) |
| SPARQL endpoints (Fuseki, ...) | `SparqlPattern` | `SparqlEndpointDataSource` | automatic (OWL/RDFS ontology) |
| REST APIs | -- | `APIDataSource` | manual |
| In-memory (list / DataFrame) | -- | `InMemoryDataSource` | manual |

### Resources
Resources are your data sources that can be:
### Supported targets

- **Table-like**: CSV files, database tables
- **JSON-like**: JSON files, nested data structures
ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph -- same API for all.

## Features

- **Graph Transformation Meta-language**: A powerful declarative language to describe how your data becomes a property graph:
- Define vertex and edge structures with typed fields
- Set compound indexes for vertices and edges
- Use blank vertices for complex relationships
- Specify edge constraints and properties with typed weight fields
- Apply advanced filtering and transformations
- **Typed Schema Definitions**: Enhanced type support throughout the schema system
- Vertex fields support types (INT, FLOAT, STRING, DATETIME, BOOL) for better validation
- Edge weight fields can specify types for improved type safety
- Backward compatible: fields without types default to None (suitable for databases like ArangoDB)
- **🚀 PostgreSQL Schema Inference**: **Automatically generate schemas from PostgreSQL 3NF databases** - No manual schema definition needed!
- Introspect PostgreSQL schemas to identify vertex-like and edge-like tables
- Automatically map PostgreSQL data types to graflo Field types (INT, FLOAT, STRING, DATETIME, BOOL)
- Infer vertex configurations from table structures with proper indexes
- Infer edge configurations from foreign key relationships
- Create Resource mappings from PostgreSQL tables automatically
- Direct database access - ingest data without exporting to files first
- **Async ingestion**: Efficient async/await-based ingestion pipeline for better performance
- **Parallel processing**: Use as many cores as you have
- **Database support**: Ingest into ArangoDB, Neo4j, **TigerGraph**, **FalkorDB**, and **Memgraph** using the same API (database agnostic). Source data from PostgreSQL and other SQL databases.
- **Server-side filtering**: Efficient querying with server-side filtering support (TigerGraph REST++ API)
- **Declarative graph transformation**: Define vertex/edge structures, indexes, weights, and transforms in YAML or Python dicts. Resources describe how each data source maps to vertices and edges.
- **Schema inference**: Automatically generate schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies (class/property introspection).
- **RDF / SPARQL ingestion**: Read `.ttl` files via rdflib or query SPARQL endpoints (e.g. Apache Fuseki). `owl:Class` maps to vertices, `owl:ObjectProperty` to edges, `owl:DatatypeProperty` to vertex fields.
- **Typed fields**: Vertex fields and edge weights support types (`INT`, `FLOAT`, `STRING`, `DATETIME`, `BOOL`) for validation and database-specific optimisation.
- **Parallel batch processing**: Configurable batch sizes and multi-core execution.
- **Database-agnostic**: Single API targeting ArangoDB, Neo4j, TigerGraph, FalkorDB, and Memgraph. Source data from PostgreSQL, SPARQL endpoints, files, APIs, or in-memory objects.

## Documentation
Full documentation is available at: [growgraph.github.io/graflo](https://growgraph.github.io/graflo)
Expand All @@ -69,6 +48,9 @@ Full documentation is available at: [growgraph.github.io/graflo](https://growgra

```bash
pip install graflo

# With RDF / SPARQL support (adds rdflib + SPARQLWrapper)
pip install graflo[sparql]
```

## Usage Examples
Expand Down Expand Up @@ -187,6 +169,34 @@ caster = Caster(schema)
# ... continue with ingestion
```

### RDF / SPARQL Ingestion

```python
from pathlib import Path
from graflo.hq import GraphEngine
from graflo.db.connection.onto import ArangoConfig

engine = GraphEngine()

# Infer schema from an OWL/RDFS ontology file
ontology = Path("ontology.ttl")
schema = engine.infer_schema_from_rdf(source=ontology)

# Create data-source patterns (reads a local .ttl file per rdf:Class)
patterns = engine.create_patterns_from_rdf(source=ontology)

# Or point at a SPARQL endpoint instead:
# from graflo.db.connection.onto import SparqlEndpointConfig
# sparql_cfg = SparqlEndpointConfig(uri="http://localhost:3030", dataset="mydata")
# patterns = engine.create_patterns_from_rdf(
# source=ontology,
# endpoint_url=sparql_cfg.query_endpoint,
# )

target = ArangoConfig.from_docker_env()
engine.define_and_ingest(schema=schema, target_db_config=target, patterns=patterns)
```

## Development

To install requirements
Expand Down Expand Up @@ -235,25 +245,32 @@ FalkorDB from [falkordb docker folder](./docker/falkordb) by
docker-compose --env-file .env up falkordb
```

and Memgraph from [memgraph docker folder](./docker/memgraph) by
Memgraph from [memgraph docker folder](./docker/memgraph) by

```shell
docker-compose --env-file .env up memgraph
```

and Apache Fuseki from [fuseki docker folder](./docker/fuseki) by

```shell
docker-compose --env-file .env up fuseki
```

To run unit tests

```shell
pytest test
```

> **Note**: Tests require external database containers (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph) to be running. CI builds intentionally skip test execution. Tests must be run locally with the required database images started (see [Test databases](#test-databases) section above).
> **Note**: Tests require external database containers (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, Fuseki) to be running. CI builds intentionally skip test execution. Tests must be run locally with the required database images started (see [Test databases](#test-databases) section above).

## Requirements

- Python 3.11+ (Python 3.11 and 3.12 are officially supported)
- python-arango
- sqlalchemy>=2.0.0 (for PostgreSQL and SQL data sources)
- rdflib>=7.0.0 + SPARQLWrapper>=2.0.0 (optional, install with `pip install graflo[sparql]`)

## Contributing

Expand Down
2 changes: 1 addition & 1 deletion docker/cleanup-all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"

# Database directories
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph")
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph" "fuseki")

# Colors for output
GREEN='\033[0;32m'
Expand Down
7 changes: 7 additions & 0 deletions docker/fuseki/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
IMAGE_VERSION=secoresearch/fuseki:5.1.0
SPEC=graflo
CONTAINER_NAME="${SPEC}.fuseki"
TS_PORT=3032
TS_PASSWORD="abc123-qwe"
TS_USERNAME="admin"
TS_DATASET="test"
16 changes: 16 additions & 0 deletions docker/fuseki/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
services:
fuseki:
image: ${IMAGE_VERSION}
user: "${UID}:${GID}"
restart: "no"
profiles: ["${CONTAINER_NAME}"]
ports:
- "${TS_PORT}:3030"
container_name: ${CONTAINER_NAME}
volumes:
- fuseki_data:/fuseki
environment:
- ADMIN_PASSWORD=${TS_PASSWORD}
volumes:
fuseki_data:
driver: local
2 changes: 1 addition & 1 deletion docker/start-all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"

# Database directories
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph")
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph" "fuseki")

# Colors for output
GREEN='\033[0;32m'
Expand Down
2 changes: 1 addition & 1 deletion docker/stop-all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"

# Database directories
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph")
DATABASES=("arango" "neo4j" "postgres" "falkordb" "memgraph" "nebula" "tigergraph" "fuseki")

# Colors for output
GREEN='\033[0;32m'
Expand Down
62 changes: 57 additions & 5 deletions docs/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,51 @@ graflo transforms data sources into property graphs through a pipeline of compon

Each component plays a specific role in this transformation process.

### Data flow: Pattern → DataSource → Resource → GraphContainer → Target DB

The diagram below shows how different data sources (files, SQL tables, RDF/SPARQL)
flow through the unified ingestion pipeline.

```mermaid
flowchart LR
subgraph sources [Data Sources]
TTL["*.ttl / *.rdf files"]
Fuseki["SPARQL Endpoint\n(Fuseki)"]
Files["CSV / JSON files"]
PG["PostgreSQL"]
end
subgraph patterns [Patterns]
FP[FilePattern]
TP[TablePattern]
SP[SparqlPattern]
end
subgraph datasources [DataSource Layer]
subgraph rdfFamily ["RdfDataSource (abstract)"]
RdfDS[RdfFileDataSource]
SparqlDS[SparqlEndpointDataSource]
end
FileDS[FileDataSource]
SQLDS[SQLDataSource]
end
subgraph pipeline [Shared Pipeline]
Res[Resource Pipeline]
GC[GraphContainer]
DBW[DBWriter]
end

TTL --> SP --> RdfDS --> Res
Fuseki --> SP --> SparqlDS --> Res
Files --> FP --> FileDS --> Res
PG --> TP --> SQLDS --> Res
Res --> GC --> DBW
```

- **Patterns** describe *where* data comes from (file paths, SQL tables, SPARQL endpoints).
- **DataSources** handle *how* to read data in batches from each source type.
- **Resources** define *what* to extract from each document (vertices, edges, transforms).
- **GraphContainer** collects the resulting vertices and edges.
- **DBWriter** pushes the graph data into the target database (ArangoDB, Neo4j, TigerGraph, etc.).

## Class Diagrams

### GraphEngine orchestration
Expand All @@ -28,6 +73,8 @@ classDiagram
+introspect(postgres_config) SchemaIntrospectionResult
+infer_schema(postgres_config) Schema
+create_patterns(postgres_config) Patterns
+infer_schema_from_rdf(source) Schema
+create_patterns_from_rdf(source) Patterns
+define_schema(schema, target_db_config)
+define_and_ingest(schema, target_db_config, ...)
+ingest(schema, target_db_config, ...)
Expand Down Expand Up @@ -63,6 +110,7 @@ classDiagram
class Patterns {
+file_patterns: list~FilePattern~
+table_patterns: list~TablePattern~
+sparql_patterns: list~SparqlPattern~
}

class DBConfig {
Expand Down Expand Up @@ -293,7 +341,7 @@ The `Schema` is the central configuration that defines how data sources are tran
- Resource mappings
- Data transformations
- Index configurations
- Automatic schema inference from normalized PostgreSQL databases (3NF) with proper primary keys (PK) and foreign keys (FK) using intelligent heuristics
- Automatic schema inference from normalized PostgreSQL databases (3NF with PK/FK) or from OWL/RDFS ontologies

### Vertex
A `Vertex` describes vertices and their database indexes. It supports:
Expand Down Expand Up @@ -386,11 +434,13 @@ Edges in graflo support a rich set of attributes that enable flexible relationsh
A `DataSource` defines where data comes from and how it's retrieved. graflo supports multiple data source types:

- **File Data Sources**: JSON, JSONL, CSV/TSV files
- **RDF File Data Sources**: Turtle (`.ttl`), RDF/XML (`.rdf`), N3 (`.n3`), JSON-LD files -- parsed via `rdflib`, triples grouped by subject into flat dictionaries
- **SPARQL Data Sources**: Remote SPARQL endpoints (e.g. Apache Fuseki) queried via `SPARQLWrapper` with pagination
- **API Data Sources**: REST API endpoints with pagination, authentication, and retry logic
- **SQL Data Sources**: SQL databases via SQLAlchemy with parameterized queries
- **In-Memory Data Sources**: Python objects (lists, DataFrames) already in memory

Data sources are separate from Resources - they handle data retrieval, while Resources handle data transformation. Many data sources can map to the same Resource, allowing data to be ingested from multiple sources.
Data sources are separate from Resources -- they handle data retrieval, while Resources handle data transformation. Many data sources can map to the same Resource, allowing data to be ingested from multiple sources.

### Resource
A `Resource` is a set of mappings and transformations that define how data becomes a graph, defined as a hierarchical structure of `Actors`. Resources are part of the Schema and define:
Expand Down Expand Up @@ -431,7 +481,8 @@ A `Transform` defines data transforms, from renaming and type-casting to arbitra
- **Edge Constraints**: Ensure edge uniqueness based on source, target, and weight
- **Reusable Transforms**: Define and reference transformations by name
- **Vertex Filtering**: Filter vertices based on custom conditions
- **PostgreSQL Schema Inference**: Automatically infer schemas from normalized PostgreSQL databases (3NF) with proper primary keys (PK) and foreign keys (FK) decorated, using intelligent heuristics to detect vertices and edges
- **PostgreSQL Schema Inference**: Infer schemas from normalized PostgreSQL databases (3NF) with PK/FK constraints
- **RDF / OWL Schema Inference**: Infer schemas from OWL/RDFS ontologies -- `owl:Class` becomes vertices, `owl:ObjectProperty` becomes edges, `owl:DatatypeProperty` becomes vertex fields

### Performance Optimization
- **Batch Processing**: Process large datasets in configurable batches (`batch_size` parameter of `Caster`)
Expand All @@ -453,6 +504,7 @@ A `Transform` defines data transforms, from renaming and type-casting to arbitra
- Specify types for weight fields when using databases that require type information (e.g., TigerGraph)
- Use typed `Field` objects or dicts with `type` key for better validation
8. Leverage key matching (`match_source`, `match_target`) for complex matching scenarios
9. Use PostgreSQL schema inference for automatic schema generation from normalized databases (3NF) with proper PK/FK constraints - the heuristics work best when primary keys and foreign keys are properly decorated
10. Specify field types for better validation and database-specific optimizations, especially when targeting TigerGraph
9. Use PostgreSQL schema inference for automatic schema generation from normalized databases (3NF) with proper PK/FK constraints
10. Use RDF/OWL schema inference (`infer_schema_from_rdf`) when ingesting data from SPARQL endpoints or `.ttl` files with a well-defined ontology
11. Specify field types for better validation and database-specific optimizations, especially when targeting TigerGraph

Loading