Parquet Triple Store Implementation

A high-performance RDF triple store implementation using RDFLib and Apache Parquet format.

Features

Storage: Store RDF graphs as Parquet files for efficient disk usage
Loading: Load RDF graphs from Parquet files
Batch Operations: Store and load multiple graphs
Querying: Basic SPARQL query support
Indexing: Optional indexed storage for faster subject/predicate queries
Merging: Merge multiple RDF graphs into a single dataset
Export: Export triples to Turtle format

Installation

cd parquad
pip install -r requirements.txt

Required packages:

rdflib >= 6.4.0
pyarrow >= 14.0.0
pandas >= 2.0.0

Basic Usage

Creating a Triple Store

from parquet_triple_store import ParquetTripleStore

# Initialize the store
store = ParquetTripleStore(storage_path="my_triple_store")

Storing RDF Graphs

from rdflib import Graph, URIRef, Literal, RDF
from parquet_triple_store import ParquetTripleStore

graph = Graph()

# Add some triples
graph.add((URIRef("http://example.org/person1"), RDF.type, URIRef("http://xmlns.com/foaf/0.1/Person")))
graph.add((URIRef("http://example.org/person1"), 
            URIRef("http://xmlns.com/foaf/0.1/name"), 
            Literal("Alice")))

# Store the graph
filepath = store.store_graph(graph, "my_data")

Loading RDF Graphs

# Load a specific graph
loaded_graph = store.load_graph("my_data")

# Load all graphs
all_triples = store.load_all_graphs()

Querying

# Get statistics
stats = store.get_statistics()
print(f"Total triples: {stats['total_triples']}")

# Export to Turtle
store.export_to_turtle("output.ttl")

Using Indexed Store

For faster queries by subject or predicate:

from parquet_triple_store import ParquetTripleStoreWithIndex

# Initialize indexed store
indexed_store = ParquetTripleStoreWithIndex()

# Store and load graphs
indexed_store.store_graph(graph, "indexed_data")
indexed_store.load_all_graphs()

# Query by subject
results = indexed_store.find_by_subject("http://example.org/person1")

# Find triples with criteria
results = indexed_store.find_triples(
    subject="http://example.org/person1",
    predicate="http://xmlns.com/foaf/0.1/name"
)

Batch Operations

from rdflib import Graph
from parquet_triple_store import ParquetTripleStore

# Store multiple graphs
graphs_to_store = [
    ("dataset1", graph1),
    ("dataset2", graph2),
    ("dataset3", graph3)
]

filenames = store.batch_store(graphs_to_store)

Merging Graphs

# Merge two graphs
merged_file = store.merge_graphs("dataset1", "dataset2")

API Reference

ParquetTripleStore

`init(storage_path: str)`

Initialize the triple store with a storage directory.

`store_graph(graph: Graph, filename: str = None) -> str`

Store an RDF graph as a Parquet file.

`load_graph(filename: str) -> Graph`

Load an RDF graph from a Parquet file.

`batch_store(graphs: List[Tuple[str, Graph]]) -> List[str]`

Store multiple graphs at once.

`load_all_graphs() -> pd.DataFrame`

Load all Parquet files and return as a DataFrame.

`get_statistics() -> dict`

Get statistics about stored triples.

`export_to_turtle(filename: str = "output.ttl") -> str`

Export triples to Turtle format.

`merge_graphs(filename1: str, filename2: str) -> str`

Merge two graphs and store as new file.

`delete_file(filename: str) -> bool`

Delete a specific Parquet file.

ParquetTripleStoreWithIndex

Extends ParquetTripleStore with additional indexing capabilities.

`find_by_subject(subject_uri: str) -> pd.DataFrame`

Find all triples with a specific subject.

`find_by_predicate(predicate_uri: str) -> pd.DataFrame`

Find all triples with a specific predicate.

`find_triples(subject: str = None, predicate: str = None, object: str = None) -> pd.DataFrame`

Find triples matching given criteria.

Performance Considerations

Parquet format provides excellent compression and fast reading
Indexed store is recommended for frequent subject/predicate queries
For large datasets, consider loading only necessary data
SPARQL queries require sparqlwrapper package

Example Usage

See usage_example.py for comprehensive examples.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
PERFORMANCE_ANALYSIS.md		PERFORMANCE_ANALYSIS.md
PERFORMANCE_SUMMARY.md		PERFORMANCE_SUMMARY.md
README.md		README.md
benchmark.py		benchmark.py
benchmark_parse_vs_parquet_query.py		benchmark_parse_vs_parquet_query.py
comprehensive_benchmark.py		comprehensive_benchmark.py
fast_benchmark.py		fast_benchmark.py
file_io_test.py		file_io_test.py
generate_synthetic_benchmark_data.py		generate_synthetic_benchmark_data.py
minimal_test.py		minimal_test.py
parquet_triple_store.py		parquet_triple_store.py
practical_use_cases.py		practical_use_cases.py
requirements.txt		requirements.txt
scaling_test.py		scaling_test.py
session-ses_3ec0.md		session-ses_3ec0.md
simple_comparison.py		simple_comparison.py
simple_test.py		simple_test.py
test_fix.py		test_fix.py
test_full_workflow.py		test_full_workflow.py
test_indexing.py		test_indexing.py
test_spql_query.py		test_spql_query.py
test_store_api.py		test_store_api.py
usage_example.py		usage_example.py
usage_example_rdf.py		usage_example_rdf.py
use_cases.py		use_cases.py
use_cases_results.md		use_cases_results.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parquet Triple Store Implementation

Features

Installation

Basic Usage

Creating a Triple Store

Storing RDF Graphs

Loading RDF Graphs

Querying

Using Indexed Store

Batch Operations

Merging Graphs

API Reference

ParquetTripleStore

`init(storage_path: str)`

`store_graph(graph: Graph, filename: str = None) -> str`

`load_graph(filename: str) -> Graph`

`batch_store(graphs: List[Tuple[str, Graph]]) -> List[str]`

`load_all_graphs() -> pd.DataFrame`

`get_statistics() -> dict`

`export_to_turtle(filename: str = "output.ttl") -> str`

`merge_graphs(filename1: str, filename2: str) -> str`

`delete_file(filename: str) -> bool`

ParquetTripleStoreWithIndex

`find_by_subject(subject_uri: str) -> pd.DataFrame`

`find_by_predicate(predicate_uri: str) -> pd.DataFrame`

`find_triples(subject: str = None, predicate: str = None, object: str = None) -> pd.DataFrame`

Performance Considerations

Example Usage

License

About

Uh oh!

Releases

Packages

Languages

pebaryan/parquad

Folders and files

Latest commit

History

Repository files navigation

Parquet Triple Store Implementation

Features

Installation

Basic Usage

Creating a Triple Store

Storing RDF Graphs

Loading RDF Graphs

Querying

Using Indexed Store

Batch Operations

Merging Graphs

API Reference

ParquetTripleStore

__init__(storage_path: str)

store_graph(graph: Graph, filename: str = None) -> str

load_graph(filename: str) -> Graph

batch_store(graphs: List[Tuple[str, Graph]]) -> List[str]

load_all_graphs() -> pd.DataFrame

get_statistics() -> dict

export_to_turtle(filename: str = "output.ttl") -> str

merge_graphs(filename1: str, filename2: str) -> str

delete_file(filename: str) -> bool

ParquetTripleStoreWithIndex

find_by_subject(subject_uri: str) -> pd.DataFrame

find_by_predicate(predicate_uri: str) -> pd.DataFrame

find_triples(subject: str = None, predicate: str = None, object: str = None) -> pd.DataFrame

Performance Considerations

Example Usage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`init(storage_path: str)`

`store_graph(graph: Graph, filename: str = None) -> str`

`load_graph(filename: str) -> Graph`

`batch_store(graphs: List[Tuple[str, Graph]]) -> List[str]`

`load_all_graphs() -> pd.DataFrame`

`get_statistics() -> dict`

`export_to_turtle(filename: str = "output.ttl") -> str`

`merge_graphs(filename1: str, filename2: str) -> str`

`delete_file(filename: str) -> bool`

`find_by_subject(subject_uri: str) -> pd.DataFrame`

`find_by_predicate(predicate_uri: str) -> pd.DataFrame`

`find_triples(subject: str = None, predicate: str = None, object: str = None) -> pd.DataFrame`

Packages