Instructions for setting up a Spark Cluster for the TextReuse ETL pipeline on CSC Rahti
- Create a project on CSC Rahti
- Add a
spark-credentialssecret with- username for Spark
- password for Spark and Jupyter Lab login
- nbpassword for the Jupyter internally
- Install OpenShift CLI and Helm on local machine
- Create a
values.yamlfollowing thevalues-template.yaml - Log into OpenShift project by getting login command from Rahti
- Run
helm install spark-cluster all-spark
Create a key for GitHub SSH in the persistent volume of the spark-notebook service. Then in the values-template.yaml add the location of this SSH key to add it to the SSH configmap seen in configmap.yaml.
Then when the notebook pod starts up run mkdir ~/.ssh && cp /etc/ssh-config/config ~/.ssh/config to copy the SSH configmap file to the correct location.