This article is for these guys that learn Hadoop or just start learning about Hadoop. In this article, We will not focus on the Hadoop concept in detail and this will be a basic tutorial for setting up Hadoop at a single node on a Linux machine. I'm using Fedora you can follow these steps with Ubuntu or Debian-based systems too.
What is Big data and Hadoop?
Before going towards the installation process let’s have a look over what is Big Data and Hadoop.
Big data
Big data is exactly what the name suggests, a big amount of data. It means a huge, more complex set of data. Or we can say that a large amount of diverse data, both structured as well as unstructured.
Hadoop
If you didn't get any idea from the internet definition of Hadoop then in simple words, Hadoop is a framework written in java to process big data effectively using distributed storage and processing.
Prerequisites
VIRTUAL BOX: It is used for OS installation if you are using Windows for daily purpose.
JAVA: You need to install the Java 8 package on your system.
HADOOP: Here I'm using 3.3.4 version, you can go with another too.
Install JDK
Before installing the Java Runtime Environment (JRE) and Java Development Kit (JDK) update and upgrade packages in your system.
sudo dnf upgrade
If you are on Ubuntu or Debian-based Linux system go for the following commands.
sudo apt-get update
sudo apt-get upgrade
Check if Java is already installed:
java -version; javac -version
If Java is not currently installed, then run the following command:
sudo dnf install java-1.8.0-openjdk.x86_64 -y
Ubuntu / Debian users run the following command:
sudo apt install openjdk-8-jdk -y
Install OpenSSH
The ssh command provides a secure encrypted connection between two hosts over an insecure network. This connection can also be used for terminal access, file transfers and for executing commands on the remote machine.
Install the OpenSSH server and client using the following command:
sudo dnf install openssh-server
Ubuntu / Debian users run the following command:
sudo apt install openssh-server openssh-client -y
Now it’s time to generate ssh key because Hadoop requires ssh access to manage its node, remote or local machine. So for our single node setup of Hadoop, we configure it such that we get access to the localhost.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Copy the public keys from id_rsa.pub
to authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
The user is now able to SSH without entering a password every time. Verify everything is set up correctly by using the SSH localhost.
ssh localhost
exit
Now we have completed the basic requirements for Hadoop installation.
Download Hadoop
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement. Click on binary to download. Here I'm using Hadoop Version 3.3.4
After completing the download process of hadoop-3.3.4.tar.gz then extract it with the below commands.
tar xzf hadoop-3.3.4.tar.gz
Now we need to move this extracted folder to the location where we want to setup, here I'm using /usr/local/
to extract. The /usr/local hierarchy is for use by the system administrator when installing software locally. You can use any other location too. Once you have extracted the folder, modify the permissions for that folder.
sudo mv hadoop-3.3.4 /usr/local/hadoop
sudo chmod 777 -R /usr/local/hadoop/
Single Node Hadoop Setup
This configuration, also called pseudo-distributed mode, lets each Hadoop daemon operate as a single Java process. A Hadoop environment is set up by editing a set of config files:
bashrc
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Setting up the environment variables
To locate the correct Java path, run the following command in your terminal window:
which javac
readlink -f /usr/bin/javac
The section of the path just before the /bin/javac directory needs to be assigned to the $JAVA_HOME variable in bashrc
file.
Open bashrc file using any editor like vim, gedit, nano, etc., I prefer nano as a text editor:
nano ~/.bashrc
Edit the bashrc
file to setting up the following Hadoop environment variables at the bottom of bashrc file:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.345.b01-1.fc36.x86_64/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Configuration Changes in hadoop-env.sh file
To make changes in Hadoop-env.sh it is necessary to obtain the java path, but during the configuration of the environment variables, we have already obtained the java path. So the java_home location which we have mentioned in bashrc copy the same here.
sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Configuration Changes in core-site.xml file
Now we will configure the core-site.xml. For that open that file using the below command:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Once the file opens copy the below text inside the configuration tag.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Configuration Changes in hdfs-site.xml file
Now we will configure the hdfs-site.xml for that open that file using the below command.
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Configuration Changes in mapred-site.xml file
Now we will configure the mapred-site.xml for that open that file using the below command.
sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Once the file opens copy the below text inside the configuration tag.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configuration Changes in yarn-site.xml file
Now we will configure the yarn-site.xml which is responsible for the execution of files in the Hadoop environment. For that open that file using the below command:
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
Once the file opens copy the below text inside the configuration tag.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Format NameNode
First, we need to format the NameNode before starting Hadoop services for the first time. Now format the namenode using the following command.
hdfs namenode -format
Running Hadoop
Start Hadoop Cluster
Navigate to the hadoop/sbin directory and execute the following command to start the NameNode, DataNode, YARN resource and nodemanagers :
./start-all.sh
To verify all the Hadoop services/daemons are started successfully you can use the jps
command.
Daemons are the processes that run in the background. There are mainly 4 daemons that run for Hadoop.
- Namenode | Datanode | ResourceManager | NodeManager
These 4 demons run for Hadoop to be functional. Apart from this, there can be a secondary NameNode, standby NameNode, Job HistoryServer, etc.
jps
Now you have successfully set up Hadoop on your system.
You can access both the Web UI for NameNode and YARN Resource Manager via http://localhost:9870/
in any browser.
HDFS commands
Here we will run basic HDFS commands with examples
- ls : this command is used to check the files or directories in the HDFS. It shows the name, permissions, owner, size, and modification date for each file or directory in the specified directory.
hadoop fs -ls <path>
- mkdir : this command is used to create a new directory if it does not exists. If the directory exists, it will give a “File already exists” error.
The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
hadoop fs -mkdir [-p] <paths>
creating home directory
username
-> write the username of your computer
hadoop fs -mkdir -p /user/username/sample