PHI Anonymization and Retention Pipeline

Project Name: multi-split-mapred-job

1. Project Overview 🏥

This project implements a multi-stage MapReduce pipeline designed to solve a critical Healthcare Data Retention problem: the de-identification (anonymization) of Protected Health Information (PHI) before archival storage.

By processing raw PHI data, the pipeline reduces regulatory risk (e.g., HIPAA violations) and allows for cheaper, safer long-term storage of essential, non-PHI clinical metadata for research and regulatory compliance.

Core Technologies

Big Data System: MapR-DB (HBase API) and MapR-FS (HDFS API)
Processing Framework: Hadoop MapReduce
Language: Java 11
Build Tool: Apache Maven

2. Pipeline Stages (Multi-Split Job)

The solution is orchestrated by the AnonymizationPipelineDriver and runs in two sequential stages:

Job 1: PHI Identification and Anonymization

Input: Raw PHI data from MapR-DB (/user/mapr/phi_raw_data)
Mapper (PHIAnonymizationMapper): Reads the raw records, hashes key PHI fields (e.g., Patient Name $\rightarrow$ Anonymized ID), and retains non-PHI metadata.
Output: Key-Value text files to MapR-FS (/tmp/phi_intermediate_data)

Job 2: Archival Storage Preparation and Tagging

Input: Anonymized data from MapR-FS (/tmp/phi_intermediate_data)
Reducer (ArchivalTagReducer): Aggregates the records (if needed) and applies a Retention Tag (RET_GROUP_A_LONG_TERM) based on metadata.
Output: Final, anonymized records written back to an Archival MapR-DB table (/user/mapr/phi_archive_data)

3. Setup and Execution

Prerequisites

MapR Client setup with access to a cluster.
Java 11 installed.
Apache Maven installed.

Build

mvn clean package -DskipTests

This command generates the runnable fat JAR: target/phi-anonymization-pipeline-1.0.0-SNAPSHOT-jar-with-dependencies.jar.

Run

The driver class automatically performs the following:
Creates the input and output MapR-DB tables (SchemaUtility).
Cleans up any previous intermediate data on MapR-FS.
Runs Job 1 (Anonymization).
Runs Job 2 (Archival Tagging).
Cleans up intermediate data upon success.

# Example command to run the pipeline on the MapR/Hadoop cluster
hadoop jar target/phi-anonymization-pipeline-1.0.0-SNAPSHOT-jar-with-dependencies.jar \
com.mapr.health.driver.AnonymizationPipelineDriver

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
config		config
src/main/java/com/mapr/health		src/main/java/com/mapr/health
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
log4j.properties		log4j.properties
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PHI Anonymization and Retention Pipeline

1. Project Overview 🏥

Core Technologies

2. Pipeline Stages (Multi-Split Job)

Job 1: PHI Identification and Anonymization

Job 2: Archival Storage Preparation and Tagging

3. Setup and Execution

Prerequisites

Build

Run

About

Uh oh!

Releases

Packages

Languages

License

rmkr-dev/multi-split-mapred-job

Folders and files

Latest commit

History

Repository files navigation

PHI Anonymization and Retention Pipeline

1. Project Overview 🏥

Core Technologies

2. Pipeline Stages (Multi-Split Job)

Job 1: PHI Identification and Anonymization

Job 2: Archival Storage Preparation and Tagging

3. Setup and Execution

Prerequisites

Build

Run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages