In the Apache Software Foundation’s projects, besides the long-standing and widely used httpd web server, perhaps none is more widely used in the industry than Hadoop. Hadoop provides a reliable and scalable distributed computing solution for massive data. Over the years, we’ve heard some arguments about “Is Hadoop dead?” However, products from various cloud service providers(Invalid Link) indicate that Hadoop and its surrounding ecosystem are still experiencing benign development and continue to extend into more fields. To this day, Hadoop remains the best solution for distributed processing of massive data.
I personally dislike tool-oriented articles filled with topics like “How to configure xxx.” These topics are better suited for official documentation and beginner’s guides for those tools. However, when it comes to Hadoop, it has become the “ecosystem standard” for big data: it supports single-node operation and even cluster operation consisting of hundreds or thousands of machines. However, whether it’s the official documentation or most of the “configuration tutorials” written by the community, I believe they fail to explain to newcomers to the Hadoop ecosystem “why” we are making these configurations and “what significance” they have. Therefore, I wrote this article as a reference.
Preparation Before Installation
We’re setting up the Hadoop basic environment on Ubuntu 20.04 LTS. This environment consists of three parts: the Hadoop big data analysis framework, MapReduce for distributed data analysis, and the Hadoop Distributed File System (HDFS) for distributed data storage. This forms the foundation for running other components in the big data ecosystem.
Hadoop is designed to process data of arbitrary complexity, and since the actual volume of data that can be analyzed varies greatly – from what can be stored on a single machine to what requires large-scale data centers to handle – Hadoop offers multiple flexible configuration options. For the sake of entry-level users, let’s assume we’re configuring it in a pseudo-distributed environment on a single physical machine.
For the best practice of permission isolation(Invalid Link), we consider creating a separate user account and group for Hadoop in the system:
sudo adduser hadoop
Hadoop’s core components and peripheral projects are mostly based on Java. To run Hadoop, we need to install the Java Runtime Environment (JRE) on the system. There are multiple JRE options available, and we’ll use openjdk-8 as the JRE. The Hadoop official documentation outlines the supported Java versions. It has been tested that using openjdk-11 can lead to an error when starting YARN. The Hadoop framework communicates with local (or remote) computers using the SSH protocol, so we need to install an SSH server daemon on the local machine.
sudo apt install openjdk-8-jdk openssh-server -y
Hadoop typically operates in a distributed environment. In such an environment, it uses the SSH protocol to communicate with itself (or other servers). To configure Hadoop to communicate securely with servers, we generate a public-private key pair for the local environment’s hadoop user to support public key authentication in SSH:
sudo -u hadoop ssh-keygen -b 4096 -C hadoop
sudo -u hadoop ssh-copy-id localhost -p 22
The command above generates an RSA key of 4096 bits with a comment, and then copies the key to the local computer for use in supporting SSH key pair authentication.
Get Hadoop
Various distributions of Hadoop are typically obtained from the download page of the Hadoop official website, the Apache Bigtop project, or re-distributions such as Cloudera Manager. For the sake of explanation, we’ll use version 3.2.1 obtained from the Hadoop official website as an example. We’re preparing to install the relevant distribution to the /opt
directory in the file system (another standard option is /usr/local
). We download the binary archive of the relevant version from any mirror site on the download page.
wget "https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"
Then, extract it to /opt
(or any other location you prefer), and subsequent commands assume that you have installed Hadoop (and its surrounding components) to the /opt
directory:
sudo tar -zxvf hadoop-3.2.1.tar.gz -C /opt
To operate using a dedicated hadoop
user, we grant the hadoop
user ownership of the Hadoop software directory:
sudo chown hadoop:hadoop -R /opt/hadoop-3.2.1
So far, we have completed all the preparation and installation work for Hadoop. Next, we switch to the hadoop
user and navigate to the installation directory to begin configuring Hadoop’s operation:
su hadoop
cd /opt/hadoop-3.2.1
Configure Hadoop
Hadoop is not a single software but rather a collection of tools for big data storage and processing. Therefore, it is necessary to configure the commonly used foundational components to set up the Hadoop environment.
Configure Environment Variables
Clearly, we prefer using commands like hdfs
instead of /opt/hadoop-3.2.1/bin/hdfs
to execute Hadoop commands. Execute the following command as the hadoop
user to add Hadoop-related variables to the environment variables. If you prefer not to type them every time, you can add these commands to the end of the ~/.bashrc
file:
export HADOOP_HOME=/opt/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
If you choose to add these commands to the ~/.bashrc
file, you need to reload your .bashrc
configuration using the following command:
source ~/.bashrc
Configure the Foundational Framework of Hadoop
To ensure the smooth operation of Hadoop components, it’s necessary to inform Hadoop about the location of the Java Runtime Environment (JRE). The foundational configuration file for the Hadoop framework is located at ./etc/hadoop/hadoop-env.sh
. We only need to modify the line for JAVA_HOME
. If you connect to the (local) server via SSH client using a different method (e.g., non-standard port), you will also need to modify the HADOOP_SSH_OPTS
line accordingly:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_SSH_OPTS="-p 22"
Most components in the Hadoop ecosystem are configured using .xml
files. They adhere to a unified format: configuration items are placed between <configuration>
and </configuration>
tags in each configuration file, and each configuration item is presented as follows:
<property>
<name>Configuration Key</name>
<value>Value</value>
</property>
We configure the core variables of Hadoop, which are located in the ./etc/hadoop/core-site.xml
configuration file:
Config Item | Value |
fs.defaultFS | hdfs://localhost/ |
… which means replacing the <configuration>
section of the ./etc/hadoop/core-site.xml
configuration file with the following content:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
This will set the default file system for Hadoop to the local HDFS. The Apache Hadoop project provides a comprehensive list of configuration options.
Configure HDFS (Hadoop Distributed File System)
Before using the Hadoop Distributed File System (HDFS), we need to explicitly configure it to specify the storage locations for the NameNode and DataNodes. We plan to store the NameNode and DataNodes on the local file system, specifically in ~/hdfs
:
mkdir -p ~/hdfs/namenode ~/hdfs/datanode
We will add the following properties to the HDFS configuration file ./etc/hadoop/hdfs-site.xml
to reflect this:
Config Item | Value |
dfs.replication | 1 |
dfs.name.dir | /home/hadoop/hdfs/namenode |
dfs.data.dir | /home/hadoop/hdfs/datanode |
./etc/hadoop/core-site.xml
file above to add these configuration items to the configuration fileBefore starting the Hadoop environment for the first time, we need to initialize the data for the NameNode node:
hdfs namenode -format
Forgetting to initialize the NameNode node may lead to encountering the NameNode is not formatted
error.
Configure MapReduce (driven by YARN)
MapReduce is a simple programming model for data processing, and YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. We use a similar method as above to edit the ./etc/hadoop/mapred-site.xml
file and configure the following items:
Config Item | Value |
mapreduce.framework.name | yarn |
Similarly, we edit the ./etc/hadoop/yarn-site.xml
file:
Config Item | Value |
yarn.nodemanager.aux-services | mapreduce_shuffle |
yarn.resourcemanager.hostname | localhost |
Start the Hadoop cluster
After completing the above configurations, we start the relevant services of Hadoop:
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
This will start the following daemon processes: one NameNode, one Secondary NameNode, one DataNode (HDFS), one ResourceManager, one NodeManager (YARN), and one JobHistoryServer (MapReduce). To verify that the Hadoop-related services have started successfully, we use the jps
command provided by the JDK. This command displays all running Java programs.
jps
You can access the Hadoop File System (HFS) status by visiting port 9870 on this host. If you find that certain components are not working as expected, you can investigate the corresponding runtime logs for each service in the ./logs
directory of the Hadoop installation directory.
Install Other Optional Components
Unlike the foundational framework of Hadoop, these components are typically optional and installed on an as-needed basis. Here, we introduce the installation and configuration of some tools.
Hive
Hive is a data warehouse framework built on top of Hadoop. It allows for reading, writing, and managing large-scale datasets in a distributed storage architecture, using SQL syntax for querying. It translates SQL queries into a series of jobs running on the Hadoop cluster, making it particularly suitable for users familiar with SQL but not proficient in Java for data analysis. Download the latest version of Hive distribution from the official Hive website; we will use version 3.1.2:
wget "http://mirrors.ibiblio.org/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz"
Similar to the method of installing the foundational components of Hadoop, we unpack Hive into /opt
. Similarly, we modify the owner of the apache-hive-x.y.z-bin
directory:
sudo tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt
sudo chown hadoop:hadoop -R /opt/apache-hive-3.1.2-bin
Similar to configuring a MySQL release version, we use the mysql_secure_installation
command to configure MariaDB and set the root user password for the database.
Then, we modify the ~/.bashrc
file as the hadoop user to add environment variable configurations related to Hive execution. The following commands need to be run as the hadoop
user:
export HIVE_HOME=/opt/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin
After applying the changes using source ~/.bashrc
, we configure the relational database used by Hive. Hive stores metadata describing the data warehouse’s data (stored on HDFS) and the data itself in different locations. Typically, metadata (Metastore) is stored in a relational database. We’ll use Apache Derby as an example:
schematool -dbType derby -initSchema
It’s important to note that in this tutorial example, Hadoop version 3.2.1 and Hive version 3.1.2 both include the Guava component, but their versions are incompatible. If you encounter a java.lang.NoSuchMethodError
error when running Hive commands, consider using the following workaround:
rm /opt/apache-hive-3.1.2-bin/lib/guava-19.0.jar
cp /opt/hadoop-3.2.1/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/apache-hive-3.1.2-bin/lib/
As of now, we can use the hive
command to access the data warehouse stored on HDFS and hosted by the local Derby metastore database.