|
Hadoop and Map-Reduce in the CCLHadoop is an open-source implementation of the Map-Reduce concept of data processing. We have installed Hadoop on a 32-node cluster within the CSE department. Hadoop is good at a restricted class of data intensive processing workloads. For more computation intensive workloads, consider using our 600 CPU Condor Pool.Getting StartedTo get started with Hadoop, set the following environment variables:
setenv JAVA_HOME /afs/nd.edu/user37/ccl/software/external/java/jdk
setenv HADOOP_HOME /afs/nd.edu/user37/ccl/software/external/hadoop
setenv PATH ${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
Now, try the following commands, which list the Hadoop filesystem,
make a private directory, and upload a file:
hadoop fs -ls / hadoop fs -mkdir /YOURNAME hadoop fs -put /usr/share/dict/linux.words /YOURNAME/words hadoop fs -ls /YOURNAME hadoop fs -cat /YOURNAME/words | lessNote that Hadoop has no meaningful access controls. Any data that you put into the system is essentially readable and writeable by anyone in the CSE department. Your private directory is simply there as a convenience. Be a good citizen, and do not mess around with other people's data.
Example of Map-Reduce Using JavaHere is a very brief introduction to Map-Reduce using Java. If you are not a Java programmer, see the next section on Streaming.I have already uploaded the complete text of Tolstoy's "War and Peace" to the system under /public/warandpeace.txt. You are going to use WordCount.java to compute the frequncy of words in the novel. Begin by downloading the source of WordCount.java to your machine. Now, compile into wordcount.jar it as follows:
mkdir wordcount_classes
javac -classpath ${HADOOP_HOME}/hadoop-*-core.jar -d wordcount_classes WordCount.java
jar -cvf wordcount.jar -C wordcount_classes .
To perform a Map-Reduce job, run hadoop with the jar option
and specify the input file and a new directory for output files:
hadoop jar wordcount.jar WordCount /public/warandpeace.txt /YOURNAME/outputsNow, your outputs are stored under /YOURNAME/outputs in Hadoop: hadoop fs -ls /YOURNAME/outputs hadoop fs -cat /YOURNAME/outputs/part-00000
Example of Map-Reduce Using Streaming Using the streaming mode, you can run Map-Reduce programs where the mapper and reducer are ordinary programs written in whatever language you like. Data is passed between programs in plain ASCII format, where each line consists of a key string, a tab character, a value string, and a newline.For example, if you want to compute the frequency of words, you could write a mapper and reducer in Perl like this: cat words.txt | ./WordCountMap.pl | sort | ./WordCountReduce.pl > output.txtThen, to run it all in Hadoop, run a command like this:
hadoop jar $HADOOP_HOME/contrib/hadoop-*-streaming.jar
-input /public/gutenberg/\*
-output /YOURNAME/output2
-mapper WordCountMap.pl
-file WordCountMap.pl
-reducer WordCountReduce.pl
-file WordCountReduce.pl
| |||||||||||||||||||||||||