Hadoop installation on linux

Hadoop installation on linux

This article is for these guys that learn Hadoop or just start learning about Hadoop. In this article, We will not focus on the Hadoop concept in detail and this will be a basic tutorial for setting up Hadoop at a single node on a Linux machine. I'm using Fedora you can follow these steps with Ubuntu or Debian-based systems too.

What is Big data and Hadoop?

Before going towards the installation process let’s have a look over what is Big Data and Hadoop.

Big data

Big data is exactly what the name suggests, a big amount of data. It means a huge, more complex set of data. Or we can say that a large amount of diverse data, both structured as well as unstructured.

Hadoop

If you didn't get any idea from the internet definition of Hadoop then in simple words, Hadoop is a framework written in java to process big data effectively using distributed storage and processing.

Prerequisites

  • VIRTUAL BOX: It is used for OS installation if you are using Windows for daily purpose.

  • JAVA: You need to install the Java 8 package on your system.

  • HADOOP: Here I'm using 3.3.4 version, you can go with another too.

Install JDK

Before installing the Java Runtime Environment (JRE) and Java Development Kit (JDK) update and upgrade packages in your system.

sudo dnf upgrade

If you are on Ubuntu or Debian-based Linux system go for the following commands.

sudo apt-get update
sudo apt-get upgrade

Check if Java is already installed:

java -version; javac -version

If Java is not currently installed, then run the following command:

sudo dnf install java-1.8.0-openjdk.x86_64 -y

Ubuntu / Debian users run the following command:

sudo apt install openjdk-8-jdk -y

Install OpenSSH

The ssh command provides a secure encrypted connection between two hosts over an insecure network. This connection can also be used for terminal access, file transfers and for executing commands on the remote machine.

Install the OpenSSH server and client using the following command:

sudo dnf install openssh-server

Ubuntu / Debian users run the following command:

sudo apt install openssh-server openssh-client -y

Now it’s time to generate ssh key because Hadoop requires ssh access to manage its node, remote or local machine. So for our single node setup of Hadoop, we configure it such that we get access to the localhost.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

ssh_key.png

Copy the public keys from id_rsa.pub to authorized_keys

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

The user is now able to SSH without entering a password every time. Verify everything is set up correctly by using the SSH localhost.

ssh localhost
exit

Now we have completed the basic requirements for Hadoop installation.

Download Hadoop

Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement. Click on binary to download. Here I'm using Hadoop Version 3.3.4

hadoop site.png

After completing the download process of hadoop-3.3.4.tar.gz then extract it with the below commands.

tar xzf hadoop-3.3.4.tar.gz

Now we need to move this extracted folder to the location where we want to setup, here I'm using /usr/local/ to extract. The /usr/local hierarchy is for use by the system administrator when installing software locally. You can use any other location too. Once you have extracted the folder, modify the permissions for that folder.

sudo mv hadoop-3.3.4 /usr/local/hadoop
sudo chmod 777 -R /usr/local/hadoop/

Single Node Hadoop Setup

This configuration, also called pseudo-distributed mode, lets each Hadoop daemon operate as a single Java process. A Hadoop environment is set up by editing a set of config files:

  • bashrc

  • hadoop-env.sh

  • core-site.xml

  • hdfs-site.xml

  • mapred-site.xml

  • yarn-site.xml

Setting up the environment variables

To locate the correct Java path, run the following command in your terminal window:

which javac
readlink -f /usr/bin/javac

java location.png

The section of the path just before the /bin/javac directory needs to be assigned to the $JAVA_HOME variable in bashrc file.

Open bashrc file using any editor like vim, gedit, nano, etc., I prefer nano as a text editor:

nano ~/.bashrc

Edit the bashrc file to setting up the following Hadoop environment variables at the bottom of bashrc file:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.345.b01-1.fc36.x86_64/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

bashrc.png

Configuration Changes in hadoop-env.sh file

To make changes in Hadoop-env.sh it is necessary to obtain the java path, but during the configuration of the environment variables, we have already obtained the java path. So the java_home location which we have mentioned in bashrc copy the same here.

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

hadoop-env_sh.png

Configuration Changes in core-site.xml file

Now we will configure the core-site.xml. For that open that file using the below command:

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Once the file opens copy the below text inside the configuration tag.

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
</configuration>

core-site.png

Configuration Changes in hdfs-site.xml file

Now we will configure the hdfs-site.xml for that open that file using the below command.

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

hdfs-site.png

Configuration Changes in mapred-site.xml file

Now we will configure the mapred-site.xml for that open that file using the below command.

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Once the file opens copy the below text inside the configuration tag.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

mapred-site.png

Configuration Changes in yarn-site.xml file

Now we will configure the yarn-site.xml which is responsible for the execution of files in the Hadoop environment. For that open that file using the below command:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Once the file opens copy the below text inside the configuration tag.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

yarn-site.png

Format NameNode

First, we need to format the NameNode before starting Hadoop services for the first time. Now format the namenode using the following command.

hdfs namenode -format

namenode _format.png

Running Hadoop

Start Hadoop Cluster

Navigate to the hadoop/sbin directory and execute the following command to start the NameNode, DataNode, YARN resource and nodemanagers :

./start-all.sh

Start Hadoop Cluster.png

To verify all the Hadoop services/daemons are started successfully you can use the jps command.

Daemons are the processes that run in the background. There are mainly 4 daemons that run for Hadoop.

  • Namenode | Datanode | ResourceManager | NodeManager

These 4 demons run for Hadoop to be functional. Apart from this, there can be a secondary NameNode, standby NameNode, Job HistoryServer, etc.

jps

jps.png

Now you have successfully set up Hadoop on your system.

You can access both the Web UI for NameNode and YARN Resource Manager via http://localhost:9870/ in any browser.

HDFS commands

Here we will run basic HDFS commands with examples

  • ls : this command is used to check the files or directories in the HDFS. It shows the name, permissions, owner, size, and modification date for each file or directory in the specified directory.
hadoop fs -ls  <path>

ls.png

  • mkdir : this command is used to create a new directory if it does not exists. If the directory exists, it will give a “File already exists” error.

The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

hadoop fs -mkdir [-p] <paths>

creating home directory

username -> write the username of your computer

hadoop fs -mkdir -p /user/username/sample

mkdir.png