PySpark Application for Distributed Random Data Generation. Types of Random data and configuration can be specified in "datagen_schema_config.json". Please refer to doc(link above) for more info about the same.
Dependencies :-
- HDP (should have been pre-installed in cluster)
- PySpark v2.3 (should have been pre-installed in cluster)
- Python2.7.5 (should have been pre-installed in cluster)
- exrex (should be manually installed across all nodes)
- Numpy (should have been pre-installed in cluster)
BootStrap and Onboarding :-
- After cloning the repo into the resource manager (yarn) node/process of your cluster, please install the above dependencies manually across all nodes of your cluster
- If it’s a kerberized cluster, please ensure kinit and kerberos appropriate kerberos token is authorized
- Ensure appropriate Pyspark version export SPARK_MAJOR_VERSION=2
- Run the PySpark job “datagen_job.py”
spark-submit --master yarn --files datagen_schema_config.json datagen_job.py
5)Accessing location of data generated :-
a)> beeline
b)> !connect hive jdbc url
c)show databases; / show tables;
d)describe <HIVE_TABLE_NAME>;