Setting up Hadoop Environment on Ubuntu 20.04 LTS

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox …
Grace Murray Hopper, American computer scientist

In the Apache Software Foundation’s projects, besides the long-standing and widely used httpd web server, perhaps none is more widely used in the industry than Hadoop. Hadoop provides a reliable and scalable distributed computing solution for massive data. Over the years, we’ve heard some arguments about “Is Hadoop dead?” However, products from various cloud service providers^{(Invalid Link)} indicate that Hadoop and its surrounding ecosystem are still experiencing benign development and continue to extend into more fields. To this day, Hadoop remains the best solution for distributed processing of massive data.

I personally dislike tool-oriented articles filled with topics like “How to configure xxx.” These topics are better suited for official documentation and beginner’s guides for those tools. However, when it comes to Hadoop, it has become the “ecosystem standard” for big data: it supports single-node operation and even cluster operation consisting of hundreds or thousands of machines. However, whether it’s the official documentation or most of the “configuration tutorials” written by the community, I believe they fail to explain to newcomers to the Hadoop ecosystem “why” we are making these configurations and “what significance” they have. Therefore, I wrote this article as a reference.

Preparation Before Installation

We’re setting up the Hadoop basic environment on Ubuntu 20.04 LTS. This environment consists of three parts: the Hadoop big data analysis framework, MapReduce for distributed data analysis, and the Hadoop Distributed File System (HDFS) for distributed data storage. This forms the foundation for running other components in the big data ecosystem.

Big Data – Data that is sufficiently large or complex to the extent that it cannot be processed by a single node.
A definition of Big Data that I personally favor.

Hadoop is designed to process data of arbitrary complexity, and since the actual volume of data that can be analyzed varies greatly – from what can be stored on a single machine to what requires large-scale data centers to handle – Hadoop offers multiple flexible configuration options. For the sake of entry-level users, let’s assume we’re configuring it in a pseudo-distributed environment on a single physical machine.

For the best practice of permission isolation^{(Invalid Link)}, we consider creating a separate user account and group for Hadoop in the system:

sudo adduser hadoop

Hadoop’s core components and peripheral projects are mostly based on Java. To run Hadoop, we need to install the Java Runtime Environment (JRE) on the system. There are multiple JRE options available, and we’ll use openjdk-8 as the JRE. The Hadoop official documentation outlines the supported Java versions. It has been tested that using openjdk-11 can lead to an error when starting YARN. The Hadoop framework communicates with local (or remote) computers using the SSH protocol, so we need to install an SSH server daemon on the local machine.

sudo apt install openjdk-8-jdk openssh-server -y

Hadoop typically operates in a distributed environment. In such an environment, it uses the SSH protocol to communicate with itself (or other servers). To configure Hadoop to communicate securely with servers, we generate a public-private key pair for the local environment’s hadoop user to support public key authentication in SSH:

sudo -u hadoop ssh-keygen -b 4096 -C hadoop
sudo -u hadoop ssh-copy-id localhost -p 22

The command above generates an RSA key of 4096 bits with a comment, and then copies the key to the local computer for use in supporting SSH key pair authentication.

Get Hadoop

Various distributions of Hadoop are typically obtained from the download page of the Hadoop official website, the Apache Bigtop project, or re-distributions such as Cloudera Manager. For the sake of explanation, we’ll use version 3.2.1 obtained from the Hadoop official website as an example. We’re preparing to install the relevant distribution to the /opt directory in the file system (another standard option is /usr/local). We download the binary archive of the relevant version from any mirror site on the download page.

wget "https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"

Then, extract it to /opt (or any other location you prefer), and subsequent commands assume that you have installed Hadoop (and its surrounding components) to the /opt directory:

sudo tar -zxvf hadoop-3.2.1.tar.gz -C /opt

To operate using a dedicated hadoop user, we grant the hadoop user ownership of the Hadoop software directory:

sudo chown hadoop:hadoop -R /opt/hadoop-3.2.1

So far, we have completed all the preparation and installation work for Hadoop. Next, we switch to the hadoop user and navigate to the installation directory to begin configuring Hadoop’s operation:

su hadoop
cd /opt/hadoop-3.2.1

Configure Hadoop

Hadoop is not a single software but rather a collection of tools for big data storage and processing. Therefore, it is necessary to configure the commonly used foundational components to set up the Hadoop environment.

Configure Environment Variables

Clearly, we prefer using commands like hdfs instead of /opt/hadoop-3.2.1/bin/hdfs to execute Hadoop commands. Execute the following command as the hadoop user to add Hadoop-related variables to the environment variables. If you prefer not to type them every time, you can add these commands to the end of the ~/.bashrc file:

export HADOOP_HOME=/opt/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

If you choose to add these commands to the ~/.bashrc file, you need to reload your .bashrc configuration using the following command:

source ~/.bashrc

Configure the Foundational Framework of Hadoop

To ensure the smooth operation of Hadoop components, it’s necessary to inform Hadoop about the location of the Java Runtime Environment (JRE). The foundational configuration file for the Hadoop framework is located at ./etc/hadoop/hadoop-env.sh. We only need to modify the line for JAVA_HOME. If you connect to the (local) server via SSH client using a different method (e.g., non-standard port), you will also need to modify the HADOOP_SSH_OPTS line accordingly:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_SSH_OPTS="-p 22"

Most components in the Hadoop ecosystem are configured using .xml files. They adhere to a unified format: configuration items are placed between <configuration> and </configuration> tags in each configuration file, and each configuration item is presented as follows:

<property>
<name>Configuration Key</name>
<value>Value</value>
</property>

We configure the core variables of Hadoop, which are located in the ./etc/hadoop/core-site.xml configuration file:

Config Item	Value
`fs.defaultFS`	`hdfs://localhost/`

… which means replacing the <configuration> section of the ./etc/hadoop/core-site.xml configuration file with the following content:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>

This will set the default file system for Hadoop to the local HDFS. The Apache Hadoop project provides a comprehensive list of configuration options.

Configure HDFS (Hadoop Distributed File System)

Before using the Hadoop Distributed File System (HDFS), we need to explicitly configure it to specify the storage locations for the NameNode and DataNodes. We plan to store the NameNode and DataNodes on the local file system, specifically in ~/hdfs:

mkdir -p ~/hdfs/namenode ~/hdfs/datanode

We will add the following properties to the HDFS configuration file ./etc/hadoop/hdfs-site.xml to reflect this:

Config Item	Value
`dfs.replication`	`1`
`dfs.name.dir`	`/home/hadoop/hdfs/namenode`
`dfs.data.dir`	`/home/hadoop/hdfs/datanode`

Please refer to the method used to configure the ./etc/hadoop/core-site.xml file above to add these configuration items to the configuration file

Before starting the Hadoop environment for the first time, we need to initialize the data for the NameNode node:

hdfs namenode -format

Forgetting to initialize the NameNode node may lead to encountering the NameNode is not formatted error.

Configure MapReduce (driven by YARN)

MapReduce is a simple programming model for data processing, and YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. We use a similar method as above to edit the ./etc/hadoop/mapred-site.xml file and configure the following items:

Config Item	Value
`mapreduce.framework.name`	`yarn`

Similarly, we edit the ./etc/hadoop/yarn-site.xml file:

Config Item	Value
`yarn.nodemanager.aux-services`	`mapreduce_shuffle`
`yarn.resourcemanager.hostname`	`localhost`

Start the Hadoop cluster

After completing the above configurations, we start the relevant services of Hadoop:

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

This will start the following daemon processes: one NameNode, one Secondary NameNode, one DataNode (HDFS), one ResourceManager, one NodeManager (YARN), and one JobHistoryServer (MapReduce). To verify that the Hadoop-related services have started successfully, we use the jps command provided by the JDK. This command displays all running Java programs.

jps

You can access the Hadoop File System (HFS) status by visiting port 9870 on this host. If you find that certain components are not working as expected, you can investigate the corresponding runtime logs for each service in the ./logs directory of the Hadoop installation directory.

Install Other Optional Components

Unlike the foundational framework of Hadoop, these components are typically optional and installed on an as-needed basis. Here, we introduce the installation and configuration of some tools.

Hive

Hive is a data warehouse framework built on top of Hadoop. It allows for reading, writing, and managing large-scale datasets in a distributed storage architecture, using SQL syntax for querying. It translates SQL queries into a series of jobs running on the Hadoop cluster, making it particularly suitable for users familiar with SQL but not proficient in Java for data analysis. Download the latest version of Hive distribution from the official Hive website; we will use version 3.1.2:

wget "http://mirrors.ibiblio.org/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz"

Similar to the method of installing the foundational components of Hadoop, we unpack Hive into /opt. Similarly, we modify the owner of the apache-hive-x.y.z-bin directory:

sudo tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt
sudo chown hadoop:hadoop -R /opt/apache-hive-3.1.2-bin

Similar to configuring a MySQL release version, we use the mysql_secure_installation command to configure MariaDB and set the root user password for the database.

Then, we modify the ~/.bashrc file as the hadoop user to add environment variable configurations related to Hive execution. The following commands need to be run as the hadoop user:

export HIVE_HOME=/opt/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

After applying the changes using source ~/.bashrc, we configure the relational database used by Hive. Hive stores metadata describing the data warehouse’s data (stored on HDFS) and the data itself in different locations. Typically, metadata (Metastore) is stored in a relational database. We’ll use Apache Derby as an example:

schematool -dbType derby -initSchema

It’s important to note that in this tutorial example, Hadoop version 3.2.1 and Hive version 3.1.2 both include the Guava component, but their versions are incompatible. If you encounter a java.lang.NoSuchMethodError error when running Hive commands, consider using the following workaround:

rm /opt/apache-hive-3.1.2-bin/lib/guava-19.0.jar
cp /opt/hadoop-3.2.1/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/apache-hive-3.1.2-bin/lib/

As of now, we can use the hive command to access the data warehouse stored on HDFS and hosted by the local Derby metastore database.