Install hdfs(hadoop) on debian 11¶
The config of the hdfs cluster in this tutorial is the minimum config to run the cluster. It should be never used in a production environment.
0. Prepare the cluster.¶
In this tutorial, we will use a cluster of 3 server: - 10.50.6.103: spark-m01 - 10.50.6.104: spark-w01 - 10.50.6.105: spark-w02
0.1 Set server hostname¶
To set the hostname on each server, you can use below command
# the general form is
hostnamectl set-hostname <new-hostname>
# in our case
hostnamectl set-hostname spark-m01
# set the value for /etc/hosts
vim /etc/hosts
# remove the default host config if there are any, it may cause the datanode unable to connect to the namenode
127.0.0.1 spark-m01.casd.local spark-m01
# add the below lines
10.50.6.103 spark-m01.casd.local spark-m01
10.50.6.104 spark-w01.casd.local spark-w01
10.50.6.105 spark-w02.casd.local spark-w02
# test the hostname, run below command on each server
hostname
# test the connectivity, run below command on each server
ping spark-m01.casd.local
0.2 Create an account for hadoop related service on all three servers¶
# create a group called hadoop with gid 2001
sudo groupadd -g 2001 hadoop
# create a user account hadoop, the home folder is /opt/hadoop, the default shell is bash
sudo useradd -g hadoop -u 998 -r hadoop -m -d /opt/hadoop --shell /bin/bash
# login as hadoop
sudo su -l hadoop
0.3 Synchronize server time.¶
If you already have a ntp server in your organization, use it. Don't install your own, because if your cluster must
receive queries from other server which use another ntp, the time cap will create many problems.
For example, spark submit on my cluster always receive a timeout, because of this.
You can follow this tutorial Sync_time_in_debian.md to sync time.
1. Prepare ssh connexions¶
For the name node, generate an ssh key
# generate a ssh key pair
ssh-keygen -t rsa
# add the public key as the authorized key
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
For the data node, copy the ssh public key as the authorized key
cd
mkdir .ssh
vim ~/.ssh/authorized_keys
Test the ssh access
# test locally on name node
ssh localhost
# test on the two data node
ssh hadoop@spark-w01
ssh hadoop@spark-w02
2. Install and setup java¶
You need to install the jre and jdk version which the hadoop requires.
In this tutorial, we choose hadoop 3.3.6
Install the package¶
sudo apt update
sudo apt upgrade
# install jre
sudo apt install default-jre
# install jdk
sudo apt install default-jdk
# test jre
java -version
# test jdk
javac -version
Configure java¶
If you have multiple java version, you need to set up the currently used java version as the hadoop version that requires.
# choose the right jre version
sudo update-alternatives --config java
# choose the right jdk version
sudo update-alternatives --config javac
Set JAVA_HOME path to the currently used jdk path. There are many ways to set the JAVA_HOME: - In your own ~/.bashrc : the java config only works for you - In /etc/profile.d/java.sh: The java config works for all users
We choose the second solution
# create a file to the export
sudo vim /etc/profile.d/java.sh
# add below content, you may need to change the jdk path
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# this is optional, because the install or update-alternatives already add the java bin path to the PATH
export PATH=$PATH:$JAVA_HOME/bin
# add the env var
source /etc/profile.d/java.sh
# test the value
echo $JAVA_HOME
3. Download the hadoop binary and config¶
You can get the latest hadoop release from this page. https://hadoop.apache.org/releases.html
In our case, we choose version 3.3.6.
# Unzip the tar ball to `/opt/hadoop`
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoop/ --strip-components=1
# make sure the owner of the unzip files are hadoop:hadoop
sudo chown -R hadoop:hadoop /opt/hadoop/
Configure the HADOOP_HOME:
Run the below command on the three servers
# create a file to the export
sudo vim /etc/profile.d/hadoop.sh
# add below lines
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# Reload the environment variables
source /etc/profile.d/hadoop.sh
4. Configure the hadoop cluster¶
As we know, the hadoop cluster has three parts: - hdfs - yarn - mapreduce
Hadoop can be configured to run in a single node or multi-node cluster. In this tutorial, we only shows the multi
node cluster.
All the configuration files which we edit below are already existed. We only need to add few lines.
All Hadoop processes run on a Java Virtual Machine (JVM).
4.1 Configure hdfs¶
There is a doc on how to configure hdfs written by cloudera https://docs.cloudera.com/runtime/7.2.0/hdfs-overview/topics/hdfs-sizing-namenode-heap-memory.html
hadoop-env.sh¶
This file is used to set various environment variables and options for Hadoop's execution.
Some important conf options: - JAVA_HOME: You can specify the path to the Java home directory if you want to use a specific Java installation.
-
HADOOP_OPTS: Additional Java options to be applied to all Hadoop commands.
-
HADOOP_NAMENODE_OPTS, HADOOP_DATANODE_OPTS, and similar variables: These allow you to set JVM options specific to different Hadoop daemon processes like the NameNode, DataNode, etc.
-
HADOOP_HEAPSIZE_MAX/MIN: Sets the maximum/minimum heap size for the JVM, usually in MB.
-
HADOOP_LOG_DIR and HADOOP_MAPRED_LOG_DIR: Define the directories where log files are stored.
-
HADOOP_CLASSPATH: Allows you to specify additional classpath elements if needed.
-
PATH: It adds Hadoop's bin/ directory to the system PATH.
vim etc/hadoop/hadoop-env.sh
# In our case, we only add below lines
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HEAPSIZE=4096 # Set heap size to 4 GB for both NameNode and DataNode
# Set JVM options for NameNode
export HADOOP_NAMENODE_OPTS="-Xmx2048m -Xms2048m" # Set initial and maximum heap size for NameNode
# Set JVM options for DataNode
export HADOOP_DATANODE_OPTS="-Xmx1024m -Xms1024m" # Set initial and maximum heap size for DataNode
Xmx is the max heap memory, Xms is the initial head size
4.2 core-site.xml¶
# edit conf file
vim etc/hadoop/core-site.xml
# add the following lines
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.50.5.67:9000</value>
</property>
4.3 hdfs-site.xml¶
-
dfs.datanode.data.dir/dfs.data.dir(old version) : Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
# edit the conf file
vim etc/hadoop/hdfs-site.xml.template
# add the follwoing lines
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hadoop_tmp/hdfs/data</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hadoop_tmp/hdfs/data</value>
</property>
# create the data folder if they do not exit yet
/opt/hadoop/hadoop_tmp/hdfs/data
4.4 yarn-site.xml¶
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>10.50.5.67</value>
</property>
</configuration>
4.5 mapred-site.xml¶
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>10.50.5.67:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. Create master and worker files¶
The master file is used by startup scripts of the hadoop cluster to identify the name node address. It's located
at ~/hadoop/etc/hadoop/masters
# we put the name node ip address
10.50.5.67
The workers file is used by startup scripts of the hadoop cluster to identify the data node address. It's located
at ~/hadoop/etc/hadoop/workers
# we put the data node ip address
10.50.5.68
10.50.5.69
6. start/stop the cluster¶
The hadoop provides a list of script which allows us to start and stop the cluster easily. You can find these scripts
in /opt/hadoop/sbin
# start the cluster
./sbin/start-dfs.sh
# you should see below outputs
Starting namenodes on [k8s-master]
Starting datanodes
Starting secondary namenodes [k8s-master]
# you can check the running java process
jps
# in master node, you should see
138517 NameNode
138757 SecondaryNameNode
143035 Jps
# in data node, you should see
88904 DataNode
91423 Jps
7. test the cluster¶
You can open the web UI via http://10.50.5.67:9870/
# some usefull command
hdfs dfs -mkdir -p /dataset
# copy files
hdfs dfs -put localfile /dataset/remotefile
# list files
hdfs dfs -ls /dataset
# list files recursively
hdfs dfs -ls -r /dataset
8. Optimize the yarn configuration¶
With the above config, we can run the hdfs. To run mapreduce with yarn, we need to add extra lines to two
configuration files: yarn-site.xml and mapred-site.xml
8.1 Enriched yarn-site.xml¶
You can find the official doc on yarn config here https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>10.50.5.67</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
</configuration>
The following settings are set:
- yarn.resourcemanager.hostname: The hostname for the ResourceManager
- yarn.nodemanager.resource.cpu-vcores: This conf used by the ResourceManager to limit the number of vcores that can allocated for containers. Does not limit the number of CPUs used by Yarn containers
- yarn.nodemanager.aux-services: a list of services to use. In this case, use the MapReduce shuffle service
- yarn.nodemanager.resource.memory-mb: amount of physical memory in MB that can be allocated for containers
- yarn.scheduler.maximum-allocation-mb: The maximum allocation for every container request at the ResourceManager in MB
- yarn.scheduler.minimum-allocation-mb: The minium allocation for every container request at the ResourceManager in MB
- yarn.log-aggregation-enable: log aggregation collects the containers logs and allows us to inspect them while developing/debugging.
- yarn.nodemanager.vmem-check-enabled: verify if the worker has the physical memory capacity to allow the conf works
- yarn.resourcemanager.scheduler.class: Use the capacity scheduler to replace the default scheduler. We will give more details on how to use the capacity scheduler to define queues and capacity of each queue(e.g. CPU, MEM, etc.)
This conf needs to be set on each worker node, the conf such as yarn.nodemanager.resource.memory-mb, yarn.nodemanager.resource.cpu-vcores configure only the available memory and vcpu on the worker node, if we change it only on the master node, the worker node will have 0 effect.
8.2 Enriched mapred-site.xml¶
Normally the value of HADOOP_MAPRED_HOME is the same as HADOOP_HOME
Here 10.50.5.67 is the ip address of the name node
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>10.50.5.67:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>
</configuration>
8.3 Capacity Scheduler¶
Yarn provides a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.
You can find the full doc here. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues
Below is a minimum configuration example, which only has 1 queue called default. The default queue can use 100 percent of the resource of the cluster.
<configuration>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
</configuration>
For a more complex config, below is an example. It has two queues: prod(80%) and dev(20%). Each queue has its own job acl and admin acl
<configuration>
<property>
<name>yarn.scheduler.capacity.maximum-applications</name>
<value>10000</value>
<description>
Maximum number of applications that can be pending and running.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
<description>
The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>prod,dev</value>
<description>
The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.capacity</name>
<value>80</value>
<description>prod queue target capacity.</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.capacity</name>
<value>20</value>
<description>dev queue target capacity.</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.user-limit-factor</name>
<value>1</value>
<description>
Prod queue user limit a percentage from 0.0 to 1.0.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.maximum-capacity</name>
<value>80</value>
<description>
The maximum capacity of the prod queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.user-limit-factor</name>
<value>1</value>
<description>
Dev queue user limit a percentage from 0.0 to 1.0.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
<value>20</value>
<description>
The maximum capacity of the dev queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.state</name>
<value>RUNNING</value>
<description>
The state of the prod queue. State can be one of RUNNING or STOPPED.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.state</name>
<value>RUNNING</value>
<description>
The state of the dev queue. State can be one of RUNNING or STOPPED.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.acl_submit_applications</name>
<value>*</value>
<description>
The ACL of who can submit jobs to the prod queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.acl_administer_queue</name>
<value>*</value>
<description>
The ACL of who can administer jobs on the prod queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.acl_application_max_priority</name>
<value>*</value>
<description>
The ACL of who can submit applications with configured priority.
For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.maximum-application-lifetime
</name>
<value>-1</value>
<description>
Maximum lifetime of an application which is submitted to a queue
in seconds. Any value less than or equal to zero will be considered as
disabled.
This will be a hard time limit for all applications in this
queue. If positive value is configured then any application submitted
to this queue will be killed after exceeds the configured lifetime.
User can also specify lifetime per application basis in
application submission context. But user lifetime will be
overridden if it exceeds queue maximum lifetime. It is point-in-time
configuration.
Note : Configuring too low value will result in killing application
sooner. This feature is applicable only for leaf queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.prod.default-application-lifetime
</name>
<value>-1</value>
<description>
Default lifetime of an application which is submitted to a queue
in seconds. Any value less than or equal to zero will be considered as
disabled.
If the user has not submitted application with lifetime value then this
value will be taken. It is point-in-time configuration.
Note : Default lifetime can't exceed maximum lifetime. This feature is
applicable only for leaf queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.acl_submit_applications</name>
<value>*</value>
<description>
The ACL of who can submit jobs to the dev queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.acl_administer_queue</name>
<value>*</value>
<description>
The ACL of who can administer jobs on the dev queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.acl_application_max_priority</name>
<value>*</value>
<description>
The ACL of who can submit applications with configured priority.
For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.maximum-application-lifetime
</name>
<value>-1</value>
<description>
Maximum lifetime of an application which is submitted to a queue
in seconds. Any value less than or equal to zero will be considered as
disabled.
This will be a hard time limit for all applications in this
queue. If positive value is configured then any application submitted
to this queue will be killed after exceeds the configured lifetime.
User can also specify lifetime per application basis in
application submission context. But user lifetime will be
overridden if it exceeds queue maximum lifetime. It is point-in-time
configuration.
Note : Configuring too low value will result in killing application
sooner. This feature is applicable only for leaf queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.dev.default-application-lifetime
</name>
<value>-1</value>
<description>
Default lifetime of an application which is submitted to a queue
in seconds. Any value less than or equal to zero will be considered as
disabled.
If the user has not submitted application with lifetime value then this
value will be taken. It is point-in-time configuration.
Note : Default lifetime can't exceed maximum lifetime. This feature is
applicable only for leaf queue.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>40</value>
<description>
Number of missed scheduling opportunities after which the CapacityScheduler
attempts to schedule rack-local containers.
When setting this parameter, the size of the cluster should be taken into account.
We use 40 as the default value, which is approximately the number of nodes in one rack.
Note, if this value is -1, the locality constraint in the container request
will be ignored, which disables the delay scheduling.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
<value>-1</value>
<description>
Number of additional missed scheduling opportunities over the node-locality-delay
ones, after which the CapacityScheduler attempts to schedule off-switch containers,
instead of rack-local ones.
Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
after 40+20=60 missed opportunities.
When setting this parameter, the size of the cluster should be taken into account.
We use -1 as the default value, which disables this feature. In this case, the number
of missed opportunities for assigning off-switch containers is calculated based on
the number of containers and unique locations specified in the resource request,
as well as the size of the cluster.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value></value>
<description>
A list of mappings that will be used to assign jobs to queues
The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
Typically this list will be used to map users to queues,
for example, u:%user:%user maps all users to queues with the same name
as the user.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
<value>false</value>
<description>
If a queue mapping is present, will it override the value specified
by the user? This can be used by administrators to place jobs in queues
that are different than the one specified by the user.
The default is false.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
<value>1</value>
<description>
Controls the number of OFF_SWITCH assignments allowed
during a node's heartbeat. Increasing this value can improve
scheduling rate for OFF_SWITCH containers. Lower values reduce
"clumping" of applications on particular nodes. The default is 1.
Legal values are 1-MAX_INT. This config is refreshable.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.application.fail-fast</name>
<value>false</value>
<description>
Whether RM should fail during recovery if previous applications'
queue is no longer valid.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.workflow-priority-mappings</name>
<value></value>
<description>
A list of mappings that will be used to override application priority.
The syntax for this list is
[workflowId]:[full_queue_name]:[priority][,next mapping]*
where an application submitted (or mapped to) queue "full_queue_name"
and workflowId "workflowId" (as specified in application submission
context) will be given priority "priority".
</description>
</property>
<property>
<name>yarn.scheduler.capacity.workflow-priority-mappings-override.enable</name>
<value>false</value>
<description>
If a priority mapping is present, will it override the value specified
by the user? This can be used by administrators to give applications a
priority that is different than the one specified by the user.
The default is false.
</description>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>1</value>
<description>
This option configures the Configured Max Application Master Limit and all the related Max Application master
resources Per user.
</description>
</property>
</configuration>
# to check existing queue
hadoop queue -list
8.4 Start the yarn service¶
./sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
# you should see below process which runs in your name node
jps
24737 SecondaryNameNode
24486 NameNode
25318 ResourceManager
26806 Jps
You can access the web interface via http://<name-node>:8088/cluster
In our case, it will be /cluster
9. Useful url¶
- hdfs UI: (http://10.50.5.67:9870): monitor hdfs cluster status, storage usage and data lists
- Yarn UI: (http://10.50.5.67:8088): monitor cluster total and used resource for running jobs, job list and different job queues
- spark UI (http://10.50.5.67:8088/proxy/
): monitor current running spark jobs progress and resource utilisation - spark history server : (http://10.50.5.67:18080): Monitor the stats of terminated spark jobs