Install hdfs(hadoop) on debian 11¶

The config of the hdfs cluster in this tutorial is the minimum config to run the cluster. It should be never used in a production environment.

0. Prepare the cluster.¶

In this tutorial, we will use a cluster of 3 server: - 10.50.6.103: spark-m01 - 10.50.6.104: spark-w01 - 10.50.6.105: spark-w02

0.1 Set server hostname¶

To set the hostname on each server, you can use below command

# the general form is 
hostnamectl set-hostname <new-hostname>

# in our case
hostnamectl set-hostname spark-m01

# set the value for /etc/hosts
vim /etc/hosts
# remove the default host config if there are any, it may cause the datanode unable to connect to the namenode 
127.0.0.1 spark-m01.casd.local  spark-m01

# add the below lines
10.50.6.103     spark-m01.casd.local    spark-m01
10.50.6.104     spark-w01.casd.local    spark-w01
10.50.6.105     spark-w02.casd.local    spark-w02

# test the hostname, run below command on each server
hostname 

# test the connectivity, run below command on each server
ping spark-m01.casd.local

# create a group called hadoop with gid 2001
sudo groupadd -g 2001 hadoop

# create a user account hadoop, the home folder is /opt/hadoop, the default shell is bash
sudo useradd -g hadoop -u 998  -r hadoop -m -d /opt/hadoop --shell /bin/bash

# login as hadoop
sudo su -l hadoop

0.3 Synchronize server time.¶

If you already have a ntp server in your organization, use it. Don't install your own, because if your cluster must receive queries from other server which use another ntp, the time cap will create many problems. For example, spark submit on my cluster always receive a timeout, because of this.

You can follow this tutorial Sync_time_in_debian.md to sync time.

1. Prepare ssh connexions¶

For the name node, generate an ssh key

# generate a ssh key pair
ssh-keygen -t rsa

# add the public key as the authorized key
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

For the data node, copy the ssh public key as the authorized key

cd
mkdir .ssh
vim ~/.ssh/authorized_keys

Test the ssh access

# test locally on name node
ssh localhost

# test on the two data node
ssh hadoop@spark-w01
ssh hadoop@spark-w02

2. Install and setup java¶

You need to install the jre and jdk version which the hadoop requires.

In this tutorial, we choose hadoop 3.3.6

Install the package¶

sudo apt update
sudo apt upgrade

# install jre
sudo apt install default-jre

# install jdk
sudo apt install default-jdk

# test jre
java -version

# test jdk
javac -version

Configure java¶

If you have multiple java version, you need to set up the currently used java version as the hadoop version that requires.

# choose the right jre version
sudo update-alternatives --config java

# choose the right jdk version
sudo update-alternatives --config javac

Set JAVA_HOME path to the currently used jdk path. There are many ways to set the JAVA_HOME: - In your own ~/.bashrc : the java config only works for you - In /etc/profile.d/java.sh: The java config works for all users

We choose the second solution

# create a file to the export
sudo vim /etc/profile.d/java.sh

# add below content, you may need to change the jdk path
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# this is optional, because the install or update-alternatives already add the java bin path to the PATH
export PATH=$PATH:$JAVA_HOME/bin

# add the env var
source /etc/profile.d/java.sh

# test the value
echo $JAVA_HOME

3. Download the hadoop binary and config¶

You can get the latest hadoop release from this page. https://hadoop.apache.org/releases.html

In our case, we choose version 3.3.6.

# Unzip the tar ball to `/opt/hadoop`
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoop/ --strip-components=1

# make sure the owner of the unzip files are hadoop:hadoop
sudo chown -R hadoop:hadoop /opt/hadoop/

Configure the HADOOP_HOME:

Run the below command on the three servers

# create a file to the export
sudo vim /etc/profile.d/hadoop.sh

# add below lines
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

# Reload the environment variables
source /etc/profile.d/hadoop.sh

4. Configure the hadoop cluster¶

As we know, the hadoop cluster has three parts: - hdfs - yarn - mapreduce

Hadoop can be configured to run in a single node or multi-node cluster. In this tutorial, we only shows the multi node cluster.

All the configuration files which we edit below are already existed. We only need to add few lines.

All Hadoop processes run on a Java Virtual Machine (JVM).

4.1 Configure hdfs¶

There is a doc on how to configure hdfs written by cloudera https://docs.cloudera.com/runtime/7.2.0/hdfs-overview/topics/hdfs-sizing-namenode-heap-memory.html

hadoop-env.sh¶

This file is used to set various environment variables and options for Hadoop's execution.

Some important conf options: - JAVA_HOME: You can specify the path to the Java home directory if you want to use a specific Java installation.

HADOOP_OPTS: Additional Java options to be applied to all Hadoop commands.
HADOOP_NAMENODE_OPTS, HADOOP_DATANODE_OPTS, and similar variables: These allow you to set JVM options specific to different Hadoop daemon processes like the NameNode, DataNode, etc.
HADOOP_HEAPSIZE_MAX/MIN: Sets the maximum/minimum heap size for the JVM, usually in MB.
HADOOP_LOG_DIR and HADOOP_MAPRED_LOG_DIR: Define the directories where log files are stored.
HADOOP_CLASSPATH: Allows you to specify additional classpath elements if needed.
PATH: It adds Hadoop's bin/ directory to the system PATH.

vim etc/hadoop/hadoop-env.sh

# In our case, we only add below lines
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

export HADOOP_HEAPSIZE=4096 # Set heap size to 4 GB for both NameNode and DataNode

# Set JVM options for NameNode
export HADOOP_NAMENODE_OPTS="-Xmx2048m -Xms2048m" # Set initial and maximum heap size for NameNode

# Set JVM options for DataNode
export HADOOP_DATANODE_OPTS="-Xmx1024m -Xms1024m" # Set initial and maximum heap size for DataNode

Xmx is the max heap memory, Xms is the initial head size

4.2 core-site.xml¶

# edit conf file 
vim etc/hadoop/core-site.xml

# add the following lines
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://10.50.5.67:9000</value>
    </property>

4.3 hdfs-site.xml¶

dfs.datanode.data.dir/dfs.data.dir(old version) : Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

# edit the conf file
vim etc/hadoop/hdfs-site.xml.template

# add the follwoing lines
<property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///opt/hadoop/hadoop_tmp/hdfs/data</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///opt/hadoop/hadoop_tmp/hdfs/data</value>
    </property>

# create the data folder if they do not exit yet

/opt/hadoop/hadoop_tmp/hdfs/data

4.4 yarn-site.xml¶

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>10.50.5.67</value>
    </property>
</configuration>

4.5 mapred-site.xml¶

<configuration>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>10.50.5.67:54311</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

5. Create master and worker files¶

The master file is used by startup scripts of the hadoop cluster to identify the name node address. It's located at ~/hadoop/etc/hadoop/masters

# we put the name node ip address
10.50.5.67

The workers file is used by startup scripts of the hadoop cluster to identify the data node address. It's located at ~/hadoop/etc/hadoop/workers

# we put the data node ip address
10.50.5.68
10.50.5.69

6. start/stop the cluster¶

The hadoop provides a list of script which allows us to start and stop the cluster easily. You can find these scripts in /opt/hadoop/sbin

# start the cluster
./sbin/start-dfs.sh

# you should see below outputs
Starting namenodes on [k8s-master]
Starting datanodes
Starting secondary namenodes [k8s-master]

# you can check the running java process
jps

# in master node, you should see
138517 NameNode
138757 SecondaryNameNode
143035 Jps

# in data node, you should see
88904 DataNode
91423 Jps

7. test the cluster¶

You can open the web UI via http://10.50.5.67:9870/

# some usefull command
hdfs dfs -mkdir -p /dataset

# copy files
hdfs dfs -put localfile /dataset/remotefile

# list files
hdfs dfs -ls /dataset

# list files recursively
hdfs dfs -ls -r /dataset

8. Optimize the yarn configuration¶

With the above config, we can run the hdfs. To run mapreduce with yarn, we need to add extra lines to two configuration files: yarn-site.xml and mapred-site.xml

8.1 Enriched yarn-site.xml¶

You can find the official doc on yarn config here https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>10.50.5.67</value>
    </property>

    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>
    </property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>2048</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
</property>


<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>true</value>
</property>

<property>
  <name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
 </property>

</configuration>

The following settings are set:

yarn.resourcemanager.hostname: The hostname for the ResourceManager
yarn.nodemanager.resource.cpu-vcores: This conf used by the ResourceManager to limit the number of vcores that can allocated for containers. Does not limit the number of CPUs used by Yarn containers
yarn.nodemanager.aux-services: a list of services to use. In this case, use the MapReduce shuffle service
yarn.nodemanager.resource.memory-mb: amount of physical memory in MB that can be allocated for containers
yarn.scheduler.maximum-allocation-mb: The maximum allocation for every container request at the ResourceManager in MB
yarn.scheduler.minimum-allocation-mb: The minium allocation for every container request at the ResourceManager in MB
yarn.log-aggregation-enable: log aggregation collects the containers logs and allows us to inspect them while developing/debugging.
yarn.nodemanager.vmem-check-enabled: verify if the worker has the physical memory capacity to allow the conf works
yarn.resourcemanager.scheduler.class: Use the capacity scheduler to replace the default scheduler. We will give more details on how to use the capacity scheduler to define queues and capacity of each queue(e.g. CPU, MEM, etc.)

This conf needs to be set on each worker node, the conf such as yarn.nodemanager.resource.memory-mb, yarn.nodemanager.resource.cpu-vcores configure only the available memory and vcpu on the worker node, if we change it only on the master node, the worker node will have 0 effect.

8.2 Enriched mapred-site.xml¶

Normally the value of HADOOP_MAPRED_HOME is the same as HADOOP_HOME

Here 10.50.5.67 is the ip address of the name node

<configuration>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>10.50.5.67:54311</value>
    </property>

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>

    <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>512</value>
   </property>

   <property>
    <name>mapreduce.map.memory.mb</name>
    <value>256</value>
   </property>

   <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>256</value>
   </property>

   <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
   </property>

   <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
   </property>

   <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
   </property>
</configuration>

8.3 Capacity Scheduler¶

Yarn provides a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.

You can find the full doc here. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues

Below is a minimum configuration example, which only has 1 queue called default. The default queue can use 100 percent of the resource of the cluster.

<configuration>
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
  </property>
</configuration>

For a more complex config, below is an example. It has two queues: prod(80%) and dev(20%). Each queue has its own job acl and admin acl

<configuration>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>
      Maximum number of applications that can be pending and running.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.1</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare 
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare 
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>prod,dev</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.capacity</name>
    <value>80</value>
    <description>prod queue target capacity.</description>
  </property>

    <property>
    <name>yarn.scheduler.capacity.root.dev.capacity</name>
    <value>20</value>
    <description>dev queue target capacity.</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.user-limit-factor</name>
    <value>1</value>
    <description>
       Prod queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.maximum-capacity</name>
    <value>80</value>
    <description>
      The maximum capacity of the prod queue. 
    </description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.root.dev.user-limit-factor</name>
    <value>1</value>
    <description>
       Dev queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
    <value>20</value>
    <description>
      The maximum capacity of the dev queue. 
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.state</name>
    <value>RUNNING</value>
    <description>
      The state of the prod queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

    <property>
    <name>yarn.scheduler.capacity.root.dev.state</name>
    <value>RUNNING</value>
    <description>
      The state of the dev queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the prod queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the prod queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.prod.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>

   <property>
     <name>yarn.scheduler.capacity.root.prod.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>

   <property>
     <name>yarn.scheduler.capacity.root.prod.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>

      <property>
    <name>yarn.scheduler.capacity.root.dev.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the dev queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.dev.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the dev queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.dev.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>

   <property>
     <name>yarn.scheduler.capacity.root.dev.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>

   <property>
     <name>yarn.scheduler.capacity.root.dev.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>

  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>40</value>
    <description>
      Number of missed scheduling opportunities after which the CapacityScheduler 
      attempts to schedule rack-local containers.
      When setting this parameter, the size of the cluster should be taken into account.
      We use 40 as the default value, which is approximately the number of nodes in one rack.
      Note, if this value is -1, the locality constraint in the container request
      will be ignored, which disables the delay scheduling.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
    <value>-1</value>
    <description>
      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
    <value>1</value>
    <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.application.fail-fast</name>
    <value>false</value>
    <description>
      Whether RM should fail during recovery if previous applications'
      queue is no longer valid.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.workflow-priority-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to override application priority.
      The syntax for this list is
      [workflowId]:[full_queue_name]:[priority][,next mapping]*
      where an application submitted (or mapped to) queue "full_queue_name"
      and workflowId "workflowId" (as specified in application submission
      context) will be given priority "priority".
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.workflow-priority-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a priority mapping is present, will it override the value specified
      by the user? This can be used by administrators to give applications a
      priority that is different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>1</value>
    <description>
      This option configures the Configured Max Application Master Limit and all the related Max Application master
      resources Per user. 
    </description>
</property>

</configuration>

# to check existing queue
hadoop queue -list

8.4 Start the yarn service¶

./sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers

# you should see below process which runs in your name node
jps
24737 SecondaryNameNode
24486 NameNode
25318 ResourceManager
26806 Jps

You can access the web interface via http://<name-node>:8088/cluster

In our case, it will be /cluster

9. Useful url¶

hdfs UI: (http://10.50.5.67:9870): monitor hdfs cluster status, storage usage and data lists
Yarn UI: (http://10.50.5.67:8088): monitor cluster total and used resource for running jobs, job list and different job queues
spark UI (http://10.50.5.67:8088/proxy/): monitor current running spark jobs progress and resource utilisation
spark history server : (http://10.50.5.67:18080): Monitor the stats of terminated spark jobs