Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 72 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,29 @@ LDP gives you a realistic local environment to develop and test before deploying
- **Apache Iceberg** - Modern table format with ACID transactions
- **Jupyter** - Interactive development environment

## 📚 Getting Started Tutorial

**New to LDP?** Start with our comprehensive tutorial:

👉 **[Getting Started Tutorial](docs/getting-started-tutorial.md)** - Complete hands-on guide with tested examples

The tutorial covers:
- ✅ Platform setup for Windows, Linux, and macOS
- ✅ Working with MinIO (S3-compatible storage)
- ✅ Processing data with Spark
- ✅ Managing Iceberg tables (ACID transactions, time travel)
- ✅ Orchestrating workflows with Airflow
- ✅ Building your own data pipelines
- ✅ Production-ready examples and best practices

**All tutorial code is tested and ready to use!**

## Quick Start

LDP works on **macOS**, **Windows**, and **Linux**. Choose your platform:

- **[macOS](docs/platform-guides/macos.md)** - Use Homebrew and native tools
- **[Windows](docs/platform-guides/windows.md)** - Use PowerShell scripts and Chocolatey/winget
- **[macOS](docs/platform-guides/macos.md)** - Use Homebrew and native tools
- **[Linux](docs/setup-guide.md#linux-setup)** - Standard package managers

### Prerequisites
Expand Down Expand Up @@ -376,7 +393,34 @@ This copies all examples to their respective directories for testing and learnin

## Documentation

See the **[Documentation Index](docs/)** for detailed guides, architecture documentation, and troubleshooting.
### Getting Started
- **[📚 Getting Started Tutorial](docs/getting-started-tutorial.md)** - **START HERE!** Complete hands-on guide
- [Setup Guide](docs/setup-guide.md) - Detailed installation instructions
- [Writing Code Guide](docs/writing-code.md) - Best practices for developing pipelines
- [Platform Guides](docs/platform-guides/) - Windows, macOS, Linux specific guides

### Understanding LDP
- [Project Structure](docs/project-structure.md) - Directory layout and organization
- [Hive vs Iceberg](docs/hive-vs-iceberg.md) - Why we use Iceberg
- [Iceberg Catalog](docs/iceberg-hadoop-catalog.md) - HadoopCatalog explained

### Operations & Deployment
- [Production Guide](docs/production-guide.md) - Deploying to production
- [CI/CD Testing](docs/ci-testing.md) - Automated testing documentation
- [Troubleshooting](docs/troubleshooting.md) - Common issues and solutions

### Directory READMEs
Each major directory has its own README explaining its purpose:
- [airflow/](airflow/README.md) - Airflow DAG development
- [spark/](spark/README.md) - Spark job development
- [examples/](examples/README.md) - Example code library
- [docker/](docker/README.md) - Custom Docker images
- [config/](config/README.md) - Configuration files
- [terraform/](terraform/README.md) - Infrastructure as Code
- [scripts/](scripts/README.md) - Utility scripts
- [tests/](tests/README.md) - Testing strategies

See the **[Documentation Index](docs/)** for the complete list.

## Contributing

Expand All @@ -389,6 +433,32 @@ See the **[Documentation Index](docs/)** for detailed guides, architecture docum

MIT License

## Recent Updates

### December 2024

**🎉 Major Documentation Update**
- Added comprehensive [Getting Started Tutorial](docs/getting-started-tutorial.md) with tested examples
- Added README files to all major directories explaining their purpose
- Cross-platform support documentation (Windows PowerShell + Linux/macOS Bash)
- Examples directory is now clearly optional and can be deleted if desired

**🔧 Dependency Updates**
- Fixed: Pinned s3fs==2024.12.0 and fsspec==2024.12.0 to avoid yanked PyPI versions
- Updated: Python 3.13, Airflow 3.1.5, PySpark 4.0.1
- Updated: NumPy 2.3.5, Pandas 2.3.3, PyArrow 22.0.0
- See [UPGRADE-PLAN-2025](docs/UPGRADE-PLAN-2025.md) for migration details

**📝 Documentation Improvements**
- Clarified that LDP uses Minikube + Terraform (not docker-compose)
- Added Windows-first documentation with PowerShell scripts
- Tutorial uses actual scripts instead of make commands for clarity
- Added examples of Iceberg CRUD, MinIO operations, and Airflow DAGs

**🗑️ Cleanup**
- Removed Hive configuration (LDP uses Iceberg only)
- Clarified examples/ directory is optional reference material

## Support

For issues and questions, please open an issue in the repository.
115 changes: 115 additions & 0 deletions airflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Airflow Directory

This directory contains Apache Airflow configuration and DAG files for workflow orchestration.

## Structure

```
airflow/
├── dags/ # Your Airflow DAG files go here
├── logs/ # Airflow execution logs (auto-generated)
├── plugins/ # Custom Airflow plugins
└── README.md # This file
```

## What is Airflow?

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows (DAGs - Directed Acyclic Graphs).

## Adding DAGs

### Option 1: Copy from Examples

Copy tested DAG examples to this directory:

```bash
# Simple example
cp examples/simple_dag.py airflow/dags/

# Production examples
cp examples/dags/data_ingestion/ingest_daily.py airflow/dags/
cp examples/dags/data_transformation/transform_pipeline.py airflow/dags/
```

### Option 2: Create Your Own

Create a new DAG file in `airflow/dags/`:

```python
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

with DAG(
'my_pipeline',
default_args={
'owner': 'ldp',
'start_date': datetime(2024, 1, 1),
'retries': 1,
},
description='My data pipeline',
schedule='@daily',
catchup=False,
) as dag:

task = BashOperator(
task_id='my_task',
bash_command='echo "Hello LDP!"',
)
```

## DAG Best Practices

1. **Use `catchup=False`** - Don't backfill historical runs automatically
2. **Set proper retries** - Allow tasks to retry on transient failures
3. **Tag your DAGs** - Use tags for organization: `tags=['ingestion', 'daily']`
4. **Use logical_date** - Instead of deprecated `execution_date` (Airflow 3.0+)
5. **Make tasks idempotent** - Tasks should be safe to re-run

## Useful Commands

```bash
# Trigger a DAG
make airflow-trigger DAG=my_pipeline

# List all DAGs
make airflow-dags

# Check DAG for errors
make airflow-check

# View logs
make airflow-logs
```

## Accessing Airflow UI

- **URL**: http://localhost:8080
- **Username**: admin
- **Password**: admin

## Example DAGs

See the `examples/` directory for tested, production-ready DAG examples:

- `examples/simple_dag.py` - Basic DAG structure
- `examples/dags/data_ingestion/ingest_daily.py` - Daily data ingestion
- `examples/dags/data_transformation/transform_pipeline.py` - Spark transformation pipeline

## Common Issues

### DAG not appearing in UI

1. Check for Python syntax errors: `python airflow/dags/your_dag.py`
2. Wait 1-2 minutes for Airflow to scan for new DAGs
3. Check Airflow logs: `make airflow-logs`

### Import errors

Ensure all required packages are in `docker/airflow/requirements.txt`

## Learn More

- [Airflow Documentation](https://airflow.apache.org/docs/)
- Tutorial: `docs/getting-started-tutorial.md`
- Production Guide: `docs/production-guide.md`
114 changes: 114 additions & 0 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Config Directory

This directory contains configuration files for all LDP services.

## Structure

```
config/
├── iceberg/
│ └── catalog.properties # Iceberg catalog configuration
└── README.md # This file
```

## Configuration Files

### Iceberg Configuration

**File**: `iceberg/catalog.properties`

Configures the Apache Iceberg table format:

```properties
# Catalog type - using HadoopCatalog (file-based)
catalog-impl=org.apache.iceberg.hadoop.HadoopCatalog

# Warehouse location - where Iceberg tables are stored
warehouse=s3a://warehouse/

# S3/MinIO Configuration
s3.endpoint=http://minio:9000
s3.access-key-id=admin
s3.secret-access-key=minioadmin
s3.path-style-access=true

# File format settings
write.format.default=parquet
write.parquet.compression-codec=snappy

# Metadata settings
commit.retry.num-retries=3
commit.retry.min-wait-ms=100
```

**Key Settings:**

- **`catalog-impl`**: Uses HadoopCatalog (file-based, no external metastore)
- **`warehouse`**: All Iceberg tables stored in MinIO's `warehouse` bucket
- **`s3.endpoint`**: Points to MinIO service
- **`s3.path-style-access=true`**: Required for MinIO compatibility

## Understanding Iceberg Catalog

The HadoopCatalog is a simple, file-based catalog suitable for development and small deployments.

**Pros:**
- No external metastore needed
- Simple setup
- Works well with object storage (MinIO/S3)

**Cons:**
- Limited concurrency
- Not recommended for production multi-user environments

**Learn more:**
- `docs/iceberg-hadoop-catalog.md` - Detailed explanation
- `docs/hive-vs-iceberg.md` - Why we use Iceberg

## Environment-Specific Configuration

For different environments (dev, staging, prod), you can:

1. Create environment-specific config files:
```
config/iceberg/
├── catalog.dev.properties
├── catalog.staging.properties
└── catalog.prod.properties
```

2. Use environment variables in your Spark configurations to switch between them

## Adding New Configurations

When adding new services or configuration files:

1. Create a subdirectory: `config/service_name/`
2. Add configuration files
3. Update this README
4. Document in service README

## Service Configuration Locations

Some services have configuration embedded in other locations:

- **Airflow**: Environment variables in `docker-compose.yml`
- **Spark**: Configuration in Spark session creation (see `examples/iceberg_crud.py`)
- **MinIO**: Environment variables in `docker-compose.yml`
- **PostgreSQL**: Environment variables in `docker-compose.yml`

## Security Note

**⚠️ Important**: The current configuration uses default credentials suitable for local development only.

**For production:**
- Use strong passwords
- Store credentials in secrets management (AWS Secrets Manager, HashiCorp Vault)
- Never commit production credentials to git
- Use environment variables or secret files

## Learn More

- Iceberg Configuration: https://iceberg.apache.org/docs/latest/configuration/
- Getting Started: `docs/getting-started-tutorial.md`
- Production Setup: `docs/production-guide.md`
Loading
Loading