TestBench

PySpark Application for Distributed Random Data Generation. Types of Random data and configuration can be specified in "datagen_schema_config.json". Please refer to doc(link above) for more info about the same.

Dependencies :-

HDP (should have been pre-installed in cluster)
PySpark v2.3 (should have been pre-installed in cluster)
Python2.7.5 (should have been pre-installed in cluster)
exrex (should be manually installed across all nodes)
Numpy (should have been pre-installed in cluster)

BootStrap and Onboarding :-

After cloning the repo into the resource manager (yarn) node/process of your cluster, please install the above dependencies manually across all nodes of your cluster
If it’s a kerberized cluster, please ensure kinit and kerberos appropriate kerberos token is authorized
Ensure appropriate Pyspark version export SPARK_MAJOR_VERSION=2
Run the PySpark job “datagen_job.py” spark-submit --master yarn --files datagen_schema_config.json datagen_job.py 5)Accessing location of data generated :-
a)> beeline
b)> !connect hive jdbc url
c)show databases; / show tables;
d)describe <HIVE_TABLE_NAME>;

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
README.md		README.md
datagen_job.py		datagen_job.py
datagen_schema_config.json		datagen_schema_config.json
generate_column_schema.py		generate_column_schema.py
requirements.txt		requirements.txt
schema_scale_testing.json		schema_scale_testing.json
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TestBench

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

prashantyadla/TestBench

Folders and files

Latest commit

History

Repository files navigation

TestBench

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages