Cluster Grid

Font Size

SCREEN

Layout

Menu Style

Cpanel

First HadoopApp on the NCIT Cluster

First HadoopApp on the NCIT Cluster

This example is based on the hadoop WordCount tutorial at: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. First of all we need our Map-Reduce program. All files presented here are in this archive.

     package org.myorg;
     
     import java.io.IOException;
     import java.util.*;
     
     import org.apache.hadoop.fs.Path;
     import org.apache.hadoop.conf.*;
     import org.apache.hadoop.io.*;
     import org.apache.hadoop.mapred.*;
     import org.apache.hadoop.util.*;
     
     public class WordCount {
     
        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
          private final static IntWritable one = new IntWritable(1);
         private Text word = new Text();
     
          public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
              word.set(tokenizer.nextToken());
              output.collect(word, one);
            }
          }
        }
     
        public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
          public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
              sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
          }
        }
     
        public static void main(String[] args) throws Exception {
          JobConf conf = new JobConf(WordCount.class);
          conf.setJobName("wordcount");
     
          conf.setOutputKeyClass(Text.class);
          conf.setOutputValueClass(IntWritable.class);
     
         conf.setMapperClass(Map.class);
          conf.setCombinerClass(Reduce.class);
          conf.setReducerClass(Reduce.class);
     
          conf.setInputFormat(TextInputFormat.class);
          conf.setOutputFormat(TextOutputFormat.class);
     
          FileInputFormat.setInputPaths(conf, new Path(args[0]));
          FileOutputFormat.setOutputPath(conf, new Path(args[1]));
     
          JobClient.runJob(conf);
        }
     }

Ok, then we want to compile this (build.sh). We use module load to set any environment variables.


#!/bin/bash
#
# Original tutorial at:

# http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

. /opt/modules/Modules/3.2.5/init/bash

module load java/jdk1.6.0_23-64bit
module load libraries/hadoop-0.20.2

[[ ! -d build ]] && mkdir build;

javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar \
        -sourcepath src \
        -d build \
        src/org/myorg/WordCount.java
jar -cvf wordcount.jar -C build/ .

Let's run it:

[alexandru.herisanu@fep-53-1 ex3]$ ./build.sh

Note: src/org/myorg/WordCount.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
adding: org/myorg/WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)

Now you want to upload your files, see here. Let's run it using SGE integration. The HDFS filesystem is always up and running but the job  trackers are not (run.sh).

#!/bin/bash
#
# i presume you just compiled the program and uploaded the input files
# into /user/alexandru.herisanu/myjob (well your directory)
#

qsub -q ibm-nehalem.q -pe hadoop 4 -N HadoopExample -cwd \
        -jsv /opt/n1sge6/sge-6.2u5/ncit-hadoop/jsv.sh \
        -l hdfs_input=/user/alexandru.herisanu/myjob <<EOF

module load java/jdk1.6.0_23-64bit
module load libraries/hadoop-0.20.2

hadoop --config \$TMPDIR/conf jar wordcount.jar org.myorg.WordCount \
        /user/alexandru.herisanu/myjob /user/alexandru.herisanu/myjob/output
hadoop --config \$TMPDIR/conf fs -cat /user/hadoop-alexandru.herisanu/myjob/output/part*

EOF

Let's run it.

[alexandru.herisanu@fep-53-1 ex3]$ cat *149654

11/05/31 12:58:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/05/31 12:58:39 INFO mapred.FileInputFormat: Total input paths to process : 2
11/05/31 12:58:39 INFO mapred.JobClient: Running job: job_201105311258_0001
11/05/31 12:58:40 INFO mapred.JobClient:  map 0% reduce 0%
11/05/31 12:58:49 INFO mapred.JobClient:  map 33% reduce 0%
11/05/31 12:58:52 INFO mapred.JobClient:  map 66% reduce 0%
11/05/31 12:58:53 INFO mapred.JobClient:  map 100% reduce 0%
11/05/31 12:59:01 INFO mapred.JobClient:  map 100% reduce 100%
11/05/31 12:59:03 INFO mapred.JobClient: Job complete: job_201105311258_0001
11/05/31 12:59:03 INFO mapred.JobClient: Counters: 19
11/05/31 12:59:03 INFO mapred.JobClient:   Job Counters
11/05/31 12:59:03 INFO mapred.JobClient:     Launched reduce tasks=1
11/05/31 12:59:03 INFO mapred.JobClient:     Rack-local map tasks=2
11/05/31 12:59:03 INFO mapred.JobClient:     Launched map tasks=3
11/05/31 12:59:03 INFO mapred.JobClient:     Data-local map tasks=1
11/05/31 12:59:03 INFO mapred.JobClient:   FileSystemCounters
11/05/31 12:59:03 INFO mapred.JobClient:     FILE_BYTES_READ=79
11/05/31 12:59:03 INFO mapred.JobClient:     HDFS_BYTES_READ=55
11/05/31 12:59:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=266
11/05/31 12:59:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=41
11/05/31 12:59:03 INFO mapred.JobClient:   Map-Reduce Framework
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce input groups=5
11/05/31 12:59:03 INFO mapred.JobClient:     Combine output records=6
11/05/31 12:59:03 INFO mapred.JobClient:     Map input records=2
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce shuffle bytes=91
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce output records=5
11/05/31 12:59:03 INFO mapred.JobClient:     Spilled Records=12
11/05/31 12:59:03 INFO mapred.JobClient:     Map output bytes=82
11/05/31 12:59:03 INFO mapred.JobClient:     Map input bytes=51
11/05/31 12:59:03 INFO mapred.JobClient:     Combine input records=8
11/05/31 12:59:03 INFO mapred.JobClient:     Map output records=8
11/05/31 12:59:03 INFO mapred.JobClient:     Reduce input records=6
Bye     1
Goodbye 1
Hadoop  2
Hello   2
World   2
Starting Hadoop PE
$HADOOP_HOME = /opt/lib/hadoop/hadoop-0.20.2
starting jobtracker, logging to /export/home/ncit-cluster/prof/alexandru.herisanu/hadoop-alexandru.herisanu-jobtracker-nehalem-wn14.grid.pub.ro.out
modified context of job 149654
modified context of job 149654
Stopping Hadoop PE
stopping jobtracker
nehalem-wn11.grid.pub.ro: stopping tasktracker
nehalem-wn13.grid.pub.ro: stopping tasktracker
nehalem-wn14.grid.pub.ro: stopping tasktracker
nehalem-wn12.grid.pub.ro: stopping tasktracker
[alexandru.herisanu@fep-53-1 ex3]$

Done!