Install spark cluster by using yarn as resource manager¶

In this tutorial, we set up an Apache Spark cluster with Yarn as cluster resource manager. You can also try to run the Spark application in cluster mode

1. Get the spark bin¶

You can get the latest stable version of spark from this site : https://spark.apache.org/downloads.html

In this tutorial, we choose version 3.4.1.

wget http://repolin.casd.fr/extra/spark/spark-x.x.x-bin-hadoop3.tgz/

tar -xzvf spark-x.x.x-bin-hadoop3 spark-x.x.x

nano /etc/profile.d/spark.sh

cd spark-x.x.x/conf
cp spark-defaults.conf.template spark-defaults.conf

nano spark-defaults.conf

hdfs_optimisation.md dfs -mkdir /spark-logs
hdfs_optimisation.md dfs -chmod 777 /spark-logs

./sbin/start-history-server.sh


# Now unzip the spark bin
tar -xzvf spark-3.4.1-bin-hadoop3.tgz

# create a folder in opt to store the spark bin
sudo mkdir -p /opt/spark
sudo mv spark-3.4.1-bin-hadoop3 /opt/spark/
sudo chown -R hadoop:root /opt/spark/
sudo mv spark-3.4.1-bin-hadoop3 spark-3.4.1

2. Configure the spark¶

2.1 Configure spark env var¶

As always, we recommend you to set up the env var of spark in /etc/profile.d/spark.sh. So all users can benefit this configuration

Below is a minimum example

# create a path init file
sudo vim /etc/profile.d/spark.sh

# put the following lines
# hadoop conf
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
# spark conf
export SPARK_HOME=/opt/spark/spark-3.4.1
export PATH=$PATH:$SPARK_HOME/bin

2.2 Configure spark default¶

The spark default config specifies all default parameter of a spark session. If user adds config parameter during the creation of spark session, the default configuration value will be overwritten.

This file is located at $SPARK_HOME/conf/spark-defaults.conf.

Below is an example of spark default for using yarn as cluster resource manager

spark.master yarn
spark.submit.deployMode client
spark.driver.memory 512m
spark.executor.memory 512m
spark.yarn.am.memory  1G
spark.eventLog.enabled true
spark.eventLog.dir hdfs_optimisation.md://10.50.5.67:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory  hdfs_optimisation.md://10.50.5.67:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080

3. Test the cluster¶

3.1 Test the cluster within the same host¶

You can use the below command to test the cluster. Here we suppose you run the below command in the same server where you run the spark and hdfs/yarn master.

# submit a job  
spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.4.1.jar 10

4. Install a spark/yarn client to connect to remote spark¶

To run spark job on a remote cluster, you need to have spark bin, conf and hadoop conf to do so. Below is the minimum config.

4.1 Set the hadoop client conf¶

If the user does not require to use hdfs client, we don't need to download the hadoop bin. We just create a folder to host the necessary conf for spark client to know how to connect to the remote spark/yarn cluster.

# create a folder as HADOOP_HOME
mkdir -p /opt/hadoop

# create a conf folder for hadoop
mkdir -p /opt/hadoop/etc/hadoop

# you need to put at least two conf file core-site.xml and yarn-site.xml
vim core-site.xml

Put the below content to the core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://10.50.5.67:9000</value>
    </property>
</configuration>

Put the below content to the yarn-site.xml

<?xml version="1.0"?>

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>10.50.5.67</value>
    </property>

</configuration>

Set up the env var for hadoop in /etc/profile.d/hadoop.sh

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

4.2 Spark client¶

As you will submit jobs to spark/yarn cluster, you need the spark bin and proper config to do so.

Follow the below steps: - Download the spark bin. (follow the normal installation steps) - Reconfigure the spark-default

Below is an example of the spark-default.conf

spark.master yarn
spark.submit.deployMode client
spark.driver.memory 512m
spark.executor.memory 512m
spark.yarn.am.memory  1G
spark.eventLog.enabled true
spark.eventLog.dir hdfs_optimisation.md://10.50.5.67:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory  hdfs_optimisation.md://10.50.5.67:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080

Note we use hdfs to stores the spark application logs, and we can use the spark history server to show these logs.

# Go to the name node, and activate the spark history server
$SPARK_HOME/sbin/start-history-server.sh

# if you run the history server on windows
$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

# stop the history server
$SPARK_HOME/sbin/stop-history-server.sh

# then use can view the history server web page
http://<ip-spark-master>:18080

Test your client

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.4.1.jar 10

# if we use a particular queue, we need to add an option --queue
spark-submit --deploy-mode client --queue prod --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.4.1.jar 10

# we can add the 

spark-submit --deploy-mode client --queue prod --num-executors 2 --executor-memory 3g --executor-cores 2 --driver-memory 2g --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.4.1.jar 10

4.3 Test your client with your pyspark and jupyter¶

# 1. create a virtual env and install pyspark and jupyter notebook on it

# 2. open a jupyter notebook

# 3. copy the following code into the jupyter notebook
from pyspark.sql import SparkSession

spark=SparkSession.builder.master("yarn") \
                  .appName("spark_eda").getOrCreate()

df = spark.read.csv("hdfs://10.50.5.67:9000/user/rstudio/flights/airports.csv",header=True)

df.show(5)