gridatek · klagrida · Dec 20, 2025 · Dec 20, 2025 · Dec 20, 2025 · Dec 20, 2025
diff --git a/README.md b/README.md
@@ -53,12 +53,29 @@ LDP gives you a realistic local environment to develop and test before deploying
 - **Apache Iceberg** - Modern table format with ACID transactions
 - **Jupyter** - Interactive development environment
 
+## 📚 Getting Started Tutorial
+
+**New to LDP?** Start with our comprehensive tutorial:
+
+👉 **[Getting Started Tutorial](docs/getting-started-tutorial.md)** - Complete hands-on guide with tested examples
+
+The tutorial covers:
+- ✅ Platform setup for Windows, Linux, and macOS
+- ✅ Working with MinIO (S3-compatible storage)
+- ✅ Processing data with Spark
+- ✅ Managing Iceberg tables (ACID transactions, time travel)
+- ✅ Orchestrating workflows with Airflow
+- ✅ Building your own data pipelines
+- ✅ Production-ready examples and best practices
+
+**All tutorial code is tested and ready to use!**
+
 ## Quick Start
 
 LDP works on **macOS**, **Windows**, and **Linux**. Choose your platform:
 
-- **[macOS](docs/platform-guides/macos.md)** - Use Homebrew and native tools
 - **[Windows](docs/platform-guides/windows.md)** - Use PowerShell scripts and Chocolatey/winget
+- **[macOS](docs/platform-guides/macos.md)** - Use Homebrew and native tools
 - **[Linux](docs/setup-guide.md#linux-setup)** - Standard package managers
 
 ### Prerequisites
@@ -376,7 +393,34 @@ This copies all examples to their respective directories for testing and learnin
 
 ## Documentation
 
-See the **[Documentation Index](docs/)** for detailed guides, architecture documentation, and troubleshooting.
+### Getting Started
+- **[📚 Getting Started Tutorial](docs/getting-started-tutorial.md)** - **START HERE!** Complete hands-on guide
+- [Setup Guide](docs/setup-guide.md) - Detailed installation instructions
+- [Writing Code Guide](docs/writing-code.md) - Best practices for developing pipelines
+- [Platform Guides](docs/platform-guides/) - Windows, macOS, Linux specific guides
+
+### Understanding LDP
+- [Project Structure](docs/project-structure.md) - Directory layout and organization
+- [Hive vs Iceberg](docs/hive-vs-iceberg.md) - Why we use Iceberg
+- [Iceberg Catalog](docs/iceberg-hadoop-catalog.md) - HadoopCatalog explained
+
+### Operations & Deployment
+- [Production Guide](docs/production-guide.md) - Deploying to production
+- [CI/CD Testing](docs/ci-testing.md) - Automated testing documentation
+- [Troubleshooting](docs/troubleshooting.md) - Common issues and solutions
+
+### Directory READMEs
+Each major directory has its own README explaining its purpose:
+- [airflow/](airflow/README.md) - Airflow DAG development
+- [spark/](spark/README.md) - Spark job development
+- [examples/](examples/README.md) - Example code library
+- [docker/](docker/README.md) - Custom Docker images
+- [config/](config/README.md) - Configuration files
+- [terraform/](terraform/README.md) - Infrastructure as Code
+- [scripts/](scripts/README.md) - Utility scripts
+- [tests/](tests/README.md) - Testing strategies
+
+See the **[Documentation Index](docs/)** for the complete list.
 
 ## Contributing
 
@@ -389,6 +433,32 @@ See the **[Documentation Index](docs/)** for detailed guides, architecture docum
 
 MIT License
 
+## Recent Updates
+
+### December 2024
+
+**🎉 Major Documentation Update**
+- Added comprehensive [Getting Started Tutorial](docs/getting-started-tutorial.md) with tested examples
+- Added README files to all major directories explaining their purpose
+- Cross-platform support documentation (Windows PowerShell + Linux/macOS Bash)
+- Examples directory is now clearly optional and can be deleted if desired
+
+**🔧 Dependency Updates**
+- Fixed: Pinned s3fs==2024.12.0 and fsspec==2024.12.0 to avoid yanked PyPI versions
+- Updated: Python 3.13, Airflow 3.1.5, PySpark 4.0.1
+- Updated: NumPy 2.3.5, Pandas 2.3.3, PyArrow 22.0.0
+- See [UPGRADE-PLAN-2025](docs/UPGRADE-PLAN-2025.md) for migration details
+
+**📝 Documentation Improvements**
+- Clarified that LDP uses Minikube + Terraform (not docker-compose)
+- Added Windows-first documentation with PowerShell scripts
+- Tutorial uses actual scripts instead of make commands for clarity
+- Added examples of Iceberg CRUD, MinIO operations, and Airflow DAGs
+
+**🗑️ Cleanup**
+- Removed Hive configuration (LDP uses Iceberg only)
+- Clarified examples/ directory is optional reference material
+
 ## Support
 
 For issues and questions, please open an issue in the repository.
diff --git a/airflow/README.md b/airflow/README.md
@@ -0,0 +1,115 @@
+# Airflow Directory
+
+This directory contains Apache Airflow configuration and DAG files for workflow orchestration.
+
+## Structure
+
+```
+airflow/
+├── dags/              # Your Airflow DAG files go here
+├── logs/              # Airflow execution logs (auto-generated)
+├── plugins/           # Custom Airflow plugins
+└── README.md          # This file
+```
+
+## What is Airflow?
+
+Apache Airflow is a platform to programmatically author, schedule, and monitor workflows (DAGs - Directed Acyclic Graphs).
+
+## Adding DAGs
+
+### Option 1: Copy from Examples
+
+Copy tested DAG examples to this directory:
+
+```bash
+# Simple example
+cp examples/simple_dag.py airflow/dags/
+
+# Production examples
+cp examples/dags/data_ingestion/ingest_daily.py airflow/dags/
+cp examples/dags/data_transformation/transform_pipeline.py airflow/dags/
+```
+
+### Option 2: Create Your Own
+
+Create a new DAG file in `airflow/dags/`:
+
+```python
+from datetime import datetime, timedelta
+from airflow import DAG
+from airflow.operators.bash import BashOperator
+
+with DAG(
+    'my_pipeline',
+    default_args={
+        'owner': 'ldp',
+        'start_date': datetime(2024, 1, 1),
+        'retries': 1,
+    },
+    description='My data pipeline',
+    schedule='@daily',
+    catchup=False,
+) as dag:
+
+    task = BashOperator(
+        task_id='my_task',
+        bash_command='echo "Hello LDP!"',
+    )
+```
+
+## DAG Best Practices
+
+1. **Use `catchup=False`** - Don't backfill historical runs automatically
+2. **Set proper retries** - Allow tasks to retry on transient failures
+3. **Tag your DAGs** - Use tags for organization: `tags=['ingestion', 'daily']`
+4. **Use logical_date** - Instead of deprecated `execution_date` (Airflow 3.0+)
+5. **Make tasks idempotent** - Tasks should be safe to re-run
+
+## Useful Commands
+
+```bash
+# Trigger a DAG
+make airflow-trigger DAG=my_pipeline
+
+# List all DAGs
+make airflow-dags
+
+# Check DAG for errors
+make airflow-check
+
+# View logs
+make airflow-logs
+```
+
+## Accessing Airflow UI
+
+- **URL**: http://localhost:8080
+- **Username**: admin
+- **Password**: admin
+
+## Example DAGs
+
+See the `examples/` directory for tested, production-ready DAG examples:
+
+- `examples/simple_dag.py` - Basic DAG structure
+- `examples/dags/data_ingestion/ingest_daily.py` - Daily data ingestion
+- `examples/dags/data_transformation/transform_pipeline.py` - Spark transformation pipeline
+
+## Common Issues
+
+### DAG not appearing in UI
+
+1. Check for Python syntax errors: `python airflow/dags/your_dag.py`
+2. Wait 1-2 minutes for Airflow to scan for new DAGs
+3. Check Airflow logs: `make airflow-logs`
+
+### Import errors
+
+Ensure all required packages are in `docker/airflow/requirements.txt`
+
+## Learn More
+
+- [Airflow Documentation](https://airflow.apache.org/docs/)
+- Tutorial: `docs/getting-started-tutorial.md`
+- Production Guide: `docs/production-guide.md`
diff --git a/config/README.md b/config/README.md
@@ -0,0 +1,114 @@
+# Config Directory
+
+This directory contains configuration files for all LDP services.
+
+## Structure
+
+```
+config/
+├── iceberg/
+│   └── catalog.properties    # Iceberg catalog configuration
+└── README.md                 # This file
+```
+
+## Configuration Files
+
+### Iceberg Configuration
+
+**File**: `iceberg/catalog.properties`
+
+Configures the Apache Iceberg table format:
+
+```properties
+# Catalog type - using HadoopCatalog (file-based)
+catalog-impl=org.apache.iceberg.hadoop.HadoopCatalog
+
+# Warehouse location - where Iceberg tables are stored
+warehouse=s3a://warehouse/
+
+# S3/MinIO Configuration
+s3.endpoint=http://minio:9000
+s3.access-key-id=admin
+s3.secret-access-key=minioadmin
+s3.path-style-access=true
+
+# File format settings
+write.format.default=parquet
+write.parquet.compression-codec=snappy
+
+# Metadata settings
+commit.retry.num-retries=3
+commit.retry.min-wait-ms=100
+```
+
+**Key Settings:**
+
+- **`catalog-impl`**: Uses HadoopCatalog (file-based, no external metastore)
+- **`warehouse`**: All Iceberg tables stored in MinIO's `warehouse` bucket
+- **`s3.endpoint`**: Points to MinIO service
+- **`s3.path-style-access=true`**: Required for MinIO compatibility
+
+## Understanding Iceberg Catalog
+
+The HadoopCatalog is a simple, file-based catalog suitable for development and small deployments.
+
+**Pros:**
+- No external metastore needed
+- Simple setup
+- Works well with object storage (MinIO/S3)
+
+**Cons:**
+- Limited concurrency
+- Not recommended for production multi-user environments
+
+**Learn more:**
+- `docs/iceberg-hadoop-catalog.md` - Detailed explanation
+- `docs/hive-vs-iceberg.md` - Why we use Iceberg
+
+## Environment-Specific Configuration
+
+For different environments (dev, staging, prod), you can:
+
+1. Create environment-specific config files:
+   ```
+   config/iceberg/
+   ├── catalog.dev.properties
+   ├── catalog.staging.properties
+   └── catalog.prod.properties
+   ```
+
+2. Use environment variables in your Spark configurations to switch between them
+
+## Adding New Configurations
+
+When adding new services or configuration files:
+
+1. Create a subdirectory: `config/service_name/`
+2. Add configuration files
+3. Update this README
+4. Document in service README
+
+## Service Configuration Locations
+
+Some services have configuration embedded in other locations:
+
+- **Airflow**: Environment variables in `docker-compose.yml`
+- **Spark**: Configuration in Spark session creation (see `examples/iceberg_crud.py`)
+- **MinIO**: Environment variables in `docker-compose.yml`
+- **PostgreSQL**: Environment variables in `docker-compose.yml`
+
+## Security Note
+
+**⚠️ Important**: The current configuration uses default credentials suitable for local development only.
+
+**For production:**
+- Use strong passwords
+- Store credentials in secrets management (AWS Secrets Manager, HashiCorp Vault)
+- Never commit production credentials to git
+- Use environment variables or secret files
+
+## Learn More
+
+- Iceberg Configuration: https://iceberg.apache.org/docs/latest/configuration/
+- Getting Started: `docs/getting-started-tutorial.md`
+- Production Setup: `docs/production-guide.md`