data-transformed

This repository contains the automated workflow for transforming raw cinema data into a standardized format for the Clusterflick project.

Purpose

The transformation workflow processes raw cinema event data retrieved from various cinema websites and converts it into a unified, standardized cinema data format. This includes:

Data normalization: Converting venue-specific data formats into a consistent schema
Movie matching: Attempting to match each event to a movie in The Movie Database (TMDB)
Categorization: When movie matching fails, events are categorized appropriately
Quality assurance: Ensuring data integrity and consistency across all venues

How It Works

Each job downloads the latest raw data from the clusterflick/data-retrieved repository's most recent release, then executes the transform command for each venue:

npx clusterflick/scripts transform <venue-identifier>

This command:

Reads the raw data for the specified venue
Normalizes the data structure
Attempts to match events to movies in TMDB using the Movie Database API
Categorizes events that cannot be matched to movies
Generates standardized output files

Schedule

The workflow is automatically triggered when the data retrieval workflow completes successfully. It can also be triggered manually via workflow dispatch if needed.

Workflow Structure

Transform commands are organized into parallel job groups to:

Minimize total wall clock time
Prevent cascading failures (one command failure won't stop others)
Enable individual job restarts when issues occur

Maintenance

Adding New Venues

When a new venue is added to the system, this workflow must be updated to include the transformation step. To add a venue:

Identify the appropriate job group (or create a new one if needed)

Add a new step to run the transform command for the venue:

- name: venue-identifier
  run: npx clusterflick/scripts transform venue-identifier

Ensure the job downloads the correct raw data files from the retrieval repository

Dependencies

The workflow requires several API keys configured as GitHub secrets:

MOVIEDB_API_KEY - For matching events to movies in The Movie Database
GEMINI_API_KEY - For AI-powered categorization when movie matching fails
PAT - Personal Access Token for triggering downstream workflows

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
scripts		scripts
.gitignore		.gitignore
.node-version		.node-version
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data-transformed

Purpose

How It Works

Schedule

Workflow Structure

Maintenance

Adding New Venues

Dependencies

About

Uh oh!

Releases 458

Uh oh!

Languages

License

clusterflick/data-transformed

Folders and files

Latest commit

History

Repository files navigation

data-transformed

Purpose

How It Works

Schedule

Workflow Structure

Maintenance

Adding New Venues

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 458

Uh oh!

Languages