James Madison
Hadoop 2.0 Setup and Configuration with Cloudera 4.0
by James Madison

Home
Home


Other Pages:
Technical






James Madison

Background and Context

Versions

These instructions walk you through getting a Hadoop 2.0 environment up and running in the following modes:

  • Standalone -- Everything works without having to worry about starting daemons.
  • Single-node cluster -- All the daemons run as if in a cluster, but the only node is localhost.
  • Cluster -- Daemons run on multiple servers to achieve scale.  Not supported by my scripts as of Summer 2012.

Standalone is much easier, but single-node is more interersting, so I have written the instructions for stand-alone, but have a note about one third of the way down about a one-line change you can make to bail out to standalone if you're having trouble with single-node.

Cloudera 4.0 is recommended as this is an easy way to get a Hadoop 2.0 stack known to work as a whole.  The Hadoop stack is open and free under Cloudera.

These instructions are for Hadoop 2.0.  Version 2.0 is a big change from 1.0.  As of July 2012, that version is still in alpha.  This means that help on Hadoop 2.0 is hard to find online.  I still recommend going with 2.0 despite the possible pain as it is the version of the future.

Get a Good Book

Although I have tried to be extremely prescriptive in these instructions, you are well served to know something about how Hadoop works.  This will get you through the bumps where you may have to free-style anything I've missed.  I recommend "Hadoop, the Definitive Guide" by Tom White.

If you get White's book, you must get the 3rd edition as that is the one that covers Hadoop 2.0.  While it is a great book, it clearly reads like a book written for Hadoop 1.0 with 2.0 wedged in later.  Thus it's flow is a bit awkward, and it's not entirely accurate for setting up a clustered environment, which is part of my motivation for publishing these instructions.

The MaxTemp demo you will be running below comes from White's book.

Flexibility in the Code

I tried to make all code relative to the environmental variables found in the jamHadoopEnv shell script and the $JAM_HADOOP variable you must set in your session.  I explain both later.  This should allow you to adjust locations as you see fit.  However, I only test my code with the directories shown, so I don't guarantee full environmental flexibility, but I did try.

Disclaimers and Licensing

Everything comes with no warranty.  All work has nothing to do with my employer.  You should review all code before running it.  Run everything with the least amount of system authority possible.  Use common sense with all strange code.

You may copy, modify, share with friends, print to a hard-copy and smoke, or otherwise do whatever you want with everything you find here that I have written.  It is free as in "free beer."

Installation

Download Software

For Java, the JDK is required.  The Java version used with these scripts and demo is:

jdk-7u5-linux-x64.tar.gz

 

For Hadoop 2.0, Cloudera is strongly recommended since they take care of many headaches related to getting versions of the Hadoop suite to work together.  The site is:

https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs

The Hadoop version used with these scripts and demo is:

hadoop-2.0.0-cdh4.0.0.tar.gz

 

Install the Software

Unzip and untar the downloads.  You do not yet need a variable pointing to HADOOP_INSTALL.  This will be taken care of by the scripts.  However, you do need to have JAVA_HOME set.  Set this in your environment, using whatever location you have for Java:

export JAVA_HOME=~/apps/java # Or the location you use.

 

Install the Scripts and Demo

Get the scripts:

http://www.qa76.net/code/hadoop_2.0_basic_demo.tar.gz

Unzip and untar the file.  The location does not matter, but the location must be pointed to by an environmental variables as follows:

export JAM_HADOOP=~/work/hadoop # Or the location you use.

 

To verify the install, you should find, among other things:

$JAM_HADOOP/setup/*.*

$JAM_HADOOP/MaxTemp/*.*

 

Take a few moments to review the setup directory, paying particular attention to the conf directory there.  These are the core of what the scripts do to help set things up.

Environment Configuration

Set Environmental Variables

Get the needed variables into the current session by running the following commands.  Do these either in every session or by adding them to your profile.  Note that this sets the Hadoop installation and configuration environmental variables.  Run:

export JAM_HADOOP=~/work/hadoop # Or the location you use

. $JAM_HADOOP/setup/bin/jamHadoopEnv

 

Observe carefully the contents of the jamHadoopEnv script.  This is the one that anchors the scripts and demo to your environment.  If you do not like any of the directory names, this is the place to change them.  The scripts and demo should pick up the changes, but I have only tested it with the values shown.  The contents of the jamHadoopEnv script are:

export HADOOP_VERSION=2.0.0-cdh4.0.0

export HADOOP_INSTALL=~/apps/hadoop-$HADOOP_VERSION

# export HADOOP_CONF_DIR=$JAM_HADOOP/setup/conf/cluster_none

export HADOOP_CONF_DIR=$JAM_HADOOP/setup/conf/cluster_single

export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export PATH=$PATH:$JAM_HADOOP/setup/bin

 

Notice that commented-out line.  I'm writing these instructions in the sequence needed for a singe-node cluster, since that is the more interesting case.  If you are impatient and want to just see the demo running without a cluster, or if you have attempted to get a cluster running and want to back off to the simpler non-clustered environment, you can do this:

Shortcut for the impatient or frustrated:

  • Comment out the cluster_single line in jamHadoopEnv.
  • Uncomment the cluster_none line in jamHadoopEnv.
  • Jump down to the hdfs format step and do that one command.
  • Jump to the demo application step, make it, then run the go command, which will pick up the cluster_none setting.

That should give you the same output as discussed below, but it's really just running as a Java application, versus some type of node configuration.  Now, back to the single-node instructions...

Study the Configuration Directory

The blessing and curse of using my scripts is that you could easily miss something important because I've done it for you.  Such is the case right here--stop for a moment and go into the cluster_single directory.  Study each script carefully.  Take some time to visit the Hadoop site and read about what is in those configuration files from the software documentation:

http://hadoop.apache.org/docs/r2.0.3-alpha/

The configuration links are on the left side, toward the bottom.  The links have the name of the XML file in them.  Again, review and learn.  There is power here.

Modify $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh

This script must have JAVA_HOME set explicitly.  It will not automatically use the one set in the environment.  Add this line:

export JAVA_HOME=~/apps/java # Or the location you use.

 

The log directory used by this script must exist.  Either make the one it is looking for, or be lazy and just comment it out for now, which I recommend until you know what you're doing.  This causes the logs to go to the install directory, which is a terrible production practice, but fine as you learn.  Comment out this line as shown:

# export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

 

MODIFY $HADOOP_INSTALL/libexec/hadoop-config.sh

This script must have JAVA_HOME set explicitly.  The rationale is the same as stated above.  Be sure to add it before any other line referencing JAVA_HOME.  Add this line:

export JAVA_HOME=~/apps/java # Or the location you use.

 

JVM Heap Size

I found that the heap size for Java had to be increased.  This is done with the mapred.child.java.opts option, setting it to a value such as -Xmx1024m or the like.  The accompanying configuration scripts have this setting and its required syntax.

Hadoop Execution

Setup Secure Shell

Hadoop running as a cluster (single or distributed) is a collection of several daemons.  To allow them to communicate with each other, secure shell must have been established.  Run this script:

$ jamHadoopSshSetup

 

The contents of this script are:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost # Just to test, exit once it works.

 

Start Hadoop Daemons

Start the several daemons by running this script:

$ jamHadoopStart

 

A large amount of information will go by.  Review it for error messages, but mostly it is a set of frighteningly large class paths.  Notice that it also prints the names of the log files.  These tend to be very useful.  If you find any errors, review the logs since they often have better information than the screen.

The contents of this script are:

echo ================================================================================

start-dfs.sh

start-yarn.sh

echo ================================================================================

echo The Hadoop processes running are:

ps -ef | grep $USER | grep "java" | grep -v "grep" | awk -F'java -Dproc_' '{print $2}' | cut -d" " -f1

echo ================================================================================

 

The script should have started the daemons, and it will show an excerpt from the Linux ps list that shows things likely worked because it can find the following 5 strings in the ps list:

================================================================================

The Hadoop processes running are:

namenode

datanode

secondarynamenode

resourcemanager

nodemanager

================================================================================

 

Initialize the Hadoop Cluster

Like many storage facilities, a formatting step must occur.  Run:

$ hdfs namenode -format

 

Shutdown the Hadoop Cluster

Don't do this now! But for closure, we'll address it here.  When you're ready to shut down the cluster, run:

$ jamHadoopStop

 

Demo Application

Prepare the Data

The data must get into the Hadoop cluster.  These steps take it from your local file system, and tell Hadoop to spray it into the cluster.  When doing a single node cluster, this is the same device, but logically, it's a very meaningful step.  Run:

$ hdfs dfs -mkdir input

$ hdfs -copyFromLocal sample.txt input/sample.txt

$ hdfs dfs -ls input

 

If it worked, that last command should show you that you have a file in the Hadoop file system.

Prepare the Demo

The demo application is a humble little Java application that prints two rows when it works.  These represent the maximum temperatures for the years displayed.  See White's book for details as this is his demo code that I'm borrowing.

First "make" the application, where the make script is just a shell script:

$ cd $JAM_HADOOP

$ cd MaxTemp

$ make

 

Review the make file as you see fit.  It's standard Java work.  You should have an up-to-date MaxTemp.jar file when it all works.

Run the Demo

The running of the application can be done with one command.  While still in the $JAM_HADOOP/MaxTemp directory, run:

$ go

 

There are a number of lines of overhead in the go script, mostly to help you troubleshoot, but the interesting lines are:

export HADOOP_CLASSPATH=$JAM_HADOOP/MaxTemp/MaxTemp.jar

hadoop MaxTemp input/sample.txt output

 

The first line allows Hadoop to find our demo application.  The second one is the magic moment—it invokes Hadoop, running the application on the cluster.  If all goes well, the output, besides several dozen lines from Hadoop, will just be two lines:

1949    111

1950    22

 

Celebrate! You're running a Hadoop single-node cluster.

Other Information

Cygwin

Cygwin is a great product, but I discourage using it for Hadoop.  I had tried for several hours to get Hadoop to work on Cygwin, but eventually gave up.  It was working reasonably well, but I hit an issue that a number of sites said was unresolved.  Maybe it was, maybe it wasn't.  But it was just easier to get a working Linux environment and build there.  Also, Hadoop is not supported for production on Cygwin, so you have to make the transition at some point.

Mahout

Mahout is the Hadoop machine learning library.  It is not in the Cloudera stack as of CDH version 4.0.  Hadoop is the engine of big data, but data mining, predictive analytics, and machine learning are the places in the data stack where the miracles happen, so it will be exciting to watch Mahout mature.

Bugs and Errata

I'm confident there are errors in these instructions and in the code.  Please let me know if any issues you find or any ideas for improvement at madjim@bigfoot.com.

 


This site does not require or capture any of your personal or financial information.
http://www.qa76.net  |  © 2019 James Madison  |  qa76.net@gmail.com