Project Name: multi-split-mapred-job
This project implements a multi-stage MapReduce pipeline designed to solve a critical Healthcare Data Retention problem: the de-identification (anonymization) of Protected Health Information (PHI) before archival storage.
By processing raw PHI data, the pipeline reduces regulatory risk (e.g., HIPAA violations) and allows for cheaper, safer long-term storage of essential, non-PHI clinical metadata for research and regulatory compliance.
- Big Data System: MapR-DB (HBase API) and MapR-FS (HDFS API)
- Processing Framework: Hadoop MapReduce
- Language: Java 11
- Build Tool: Apache Maven
The solution is orchestrated by the AnonymizationPipelineDriver and runs in two sequential stages:
-
Input: Raw PHI data from MapR-DB (
/user/mapr/phi_raw_data) -
Mapper (
PHIAnonymizationMapper): Reads the raw records, hashes key PHI fields (e.g., Patient Name$\rightarrow$ Anonymized ID), and retains non-PHI metadata. -
Output: Key-Value text files to MapR-FS (
/tmp/phi_intermediate_data)
- Input: Anonymized data from MapR-FS (
/tmp/phi_intermediate_data) - Reducer (
ArchivalTagReducer): Aggregates the records (if needed) and applies a Retention Tag (RET_GROUP_A_LONG_TERM) based on metadata. - Output: Final, anonymized records written back to an Archival MapR-DB table (
/user/mapr/phi_archive_data)
- MapR Client setup with access to a cluster.
- Java 11 installed.
- Apache Maven installed.
mvn clean package -DskipTestsThis command generates the runnable fat JAR: target/phi-anonymization-pipeline-1.0.0-SNAPSHOT-jar-with-dependencies.jar.
- The driver class automatically performs the following:
- Creates the input and output MapR-DB tables (SchemaUtility).
- Cleans up any previous intermediate data on MapR-FS.
- Runs Job 1 (Anonymization).
- Runs Job 2 (Archival Tagging).
- Cleans up intermediate data upon success.
# Example command to run the pipeline on the MapR/Hadoop cluster
hadoop jar target/phi-anonymization-pipeline-1.0.0-SNAPSHOT-jar-with-dependencies.jar \
com.mapr.health.driver.AnonymizationPipelineDriver