Hadoop Tutorial

In this tutorial you will create a slice composed of three virtual machines that are a Hadoop cluster. The tutorial will lead you through creating the slice, observing the properties of the slice, and running a Hadoop example which sorts a large dataset.

After completing the tutorial you should be able to:

  1. Use ExoGENI to create a virtual distributed computational cluster.
  2. Create simple post boot scripts with simple template replacement.
  3. Use the basic functionality of a Hadoop filesystem.
  4. Observe resource utilization of compute nodes in a virtual distributed computational cluster.

Create the Request:

(Note: a completed rdf request file can be found here.   The request file can be loaded into Flukes by choosing File -> Open request)

1.  Start the Flukes application

2.  Create the Hadoop master node by selecting “Node” from the “Add Nodes” menu. Click in the field to place a VM in the request.Screen Shot 2015-09-17 at 2.19.01 PM

3.  Right-click the node and edit its properties.   Set:  (name: “master”,  node type: “XO medium”, image: “Hadoop 2.7.1 (Centos7)”, domain: “System select”).   The name must be “master”.Screen Shot 2015-09-17 at 2.30.32 PM

4. Create a group of Hadoop workers by choosing “Node Group” from the “Add Nodes” menu.  Click in the field to add the group to the request.Screen Shot 2015-09-17 at 2.31.55 PM

5.  Right-click the node group and edit its properties.   Set:  (name: “workers”,  node type: “XO medium”, image: “Hadoop 2.7.1 (Centos7)”, domain: “System select”, Number of Servers: “2”).   The name must be “workers”.Screen Shot 2015-09-17 at 2.33.23 PM

 

6.  Draw a link between the master and the workersScreen Shot 2015-09-17 at 2.34.50 PM

7.   Edit the properties of the master and workers by right-clicking and choosing “edit propoerties”.  Set the IP address of the master to 172.16.1.1/24 and the workers group to 172.16.1.100/24.  Note that each node in the workers group will be assigned the sequential  IP addresses starting with 172.16.1.100.

8.  Edit the post-boot script for the master node.   Right-click the master node and choose “Edit Properties”.  Click “PostBoot Script”.  A text box will appear.  Add the following script:

#!/bin/bash

#setup /etc/hosts
echo $master.IP("Link0") $master.Name() >> /etc/hosts
#set ( $sizeWorkerGroup = $workers.size() - 1 )
#foreach ( $j in [0..$sizeWorkerGroup] )
 echo $workers.get($j).IP("Link0") `echo $workers.get($j).Name() | sed 's/\//-/g'` >> /etc/hosts
#end

HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.1/etc/hadoop
CORE_SITE_FILE=${HADOOP_CONF_DIR}/core-site.xml
HDFS_SITE_FILE=${HADOOP_CONF_DIR}/hdfs-site.xml
MAPRED_SITE_FILE=${HADOOP_CONF_DIR}/mapred-site.xml
YARN_SITE_FILE=${HADOOP_CONF_DIR}/yarn-site.xml
SLAVES_FILE=${HADOOP_CONF_DIR}/slaves

echo "" > $CORE_SITE_FILE
cat > $CORE_SITE_FILE << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>fs.default.name</name>
 <value>hdfs://$master.Name():9000</value>
</property>
</configuration>
EOF

echo "" > $HDFS_SITE_FILE
cat > $HDFS_SITE_FILE << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
 <name>dfs.datanode.du.reserved</name>
 <!-- cluster variant -->
 <value>20000000000</value>
 <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
 </description>
 </property>
<property>
 <name>dfs.replication</name>
 <value>2</value>
</property>
<property>
 <name>dfs.name.dir</name>
 <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
 <name>dfs.data.dir</name>
 <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
EOF

echo "" > $MAPRED_SITE_FILE
cat > $MAPRED_SITE_FILE << EOF
<configuration>
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>
</configuration>
EOF

echo "" > $YARN_SITE_FILE
cat > $YARN_SITE_FILE << EOF
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
 <property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>master:8031</value>
 </property>
 <property>
 <name>yarn.resourcemanager.address</name>
 <value>master:8032</value>
 </property>
 <property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:8030</value>
 </property>
 <property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>master:8033</value>
 </property>
 <property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>master:8088</value>
 </property>
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
</configuration>
EOF

echo "" > $SLAVES_FILE
cat > $SLAVES_FILE << EOF
#set ( $sizeWorkerGroup = $workers.size() - 1 )
#foreach ( $j in [0..$sizeWorkerGroup] )
 `echo $workers.get($j).Name() | sed 's/\//-/g'` 
#end
EOF


echo "" > /home/hadoop/.ssh/config
cat > /home/hadoop/.ssh/config << EOF
Host `echo $self.IP("Link0") | sed 's/.[0-9][0-9]*$//g'`.* master workers-* 0.0.0.0
 StrictHostKeyChecking no
 UserKnownHostsFile=/dev/null
EOF

chmod 600 /home/hadoop/.ssh/config
chown hadoop:hadoop /home/hadoop/.ssh/config

9.  Edit the post-boot script for the workers group and add the same script as the master.

10. Name your slice and submit it to ExoGENI.

11.  Wait for the resources to become ActiveScreen Shot 2015-09-17 at 2.45.44 PM

Check the status of the VMs:

1. Login to the master node.

2.  Observe the properties of the network interfaces

[root@master ~]# ifconfig 
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
 inet 10.103.0.9 netmask 255.255.255.0 broadcast 10.103.0.255
 inet6 fe80::f816:3eff:fe0c:3d02 prefixlen 64 scopeid 0x20<link>
 ether fa:16:3e:0c:3d:02 txqueuelen 1000 (Ethernet)
 RX packets 3518 bytes 823670 (804.3 KiB)
 RX errors 0 dropped 0 overruns 0 frame 0
 TX packets 1704 bytes 218947 (213.8 KiB)
 TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
 inet 172.16.1.1 netmask 255.255.255.0 broadcast 172.16.1.255
 inet6 fe80::fc16:3eff:fe00:2862 prefixlen 64 scopeid 0x20<link>
 ether fe:16:3e:00:28:62 txqueuelen 1000 (Ethernet)
 RX packets 3734 bytes 689012 (672.8 KiB)
 RX errors 0 dropped 0 overruns 0 frame 0
 TX packets 1930 bytes 215880 (210.8 KiB)
 TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
 inet 127.0.0.1 netmask 255.0.0.0
 inet6 ::1 prefixlen 128 scopeid 0x10<host>
 loop txqueuelen 0 (Local Loopback)
 RX packets 251 bytes 31259 (30.5 KiB)
 RX errors 0 dropped 0 overruns 0 frame 0
 TX packets 251 bytes 31259 (30.5 KiB)
 TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

3. Observe the contents of the NEuca user data file.

[root@master ~]# neuca-user-data 
[global]
actor_id=b0e0e413-77ef-4775-b476-040ca8377d1d
slice_id=304eaa29-272d-48e2-8640-ec7300196ac9
reservation_id=c9fd992b-a67b-4cc6-8156-a062df9e8889
unit_id=e22734de-f187-41d4-b952-f07263fbfd43
;router= Not Specified
;iscsi_initiator_iqn= Not Specified
slice_name=pruth.hadoop.1
unit_url=http://geni-orca.renci.org/owl/c9d160a4-1ac0-4b9e-9a8f-8613077a6463#master
host_name=master
management_ip=128.120.83.33
physical_host=ucd-w5
nova_id=b8a37d51-bd50-40b6-89ff-6fbb87bb5fc1
[users]
root=no:ssh-dss AAAAB3NzaC1kc3MAAACBALYdIgCmJoZt+4YN7lUlDLR0ebwfsBd+d0Tw3O18JUc2bXoTzwWBGV+BkGtGljzmr1SDrRXkOWgA//pogG7B1vpizHV6K5MoFQDoSEy/64ycEago611xtMt13xuJei0pPyAphv/NrYlD1xZBMuEG9JTe8EK/H43ZhLcK4b1HwWrTAAAAFQDnZNZVDojT0aHHgJqBncy+iBHs9wAAAIBiMfYPoDgVASgknEzBschxTzTFuhof+lxBh0v5i9OsinMuRa1K5wBbA1eo63PKxywQSnODQhItme0Tn8Pp1ETpM0YkzE48K1NxW3l9iBipSRDMEh8aUlfX5R7xfRRY7tUNXlQQAzXYX8ZvXoA+mbZ9BkBXtSNI5uD1z3Gk5k/WQwAAAIBVIVuHJVgSiCw/m8yjCVH1QgO045ACf4l9/3HaoDwFaNrL1WKQvplhz/DVqtWq/2ZAIrwXr/0IgviRRZ/iVpul3s15ecTJzHAhHMaaDn4vuphH6xbs6JHLFyvBQGJy1euoY9BPqtTFZnH7KdWoChCQXfujDrtcx/5MfBn4tO5kQQ== pruth@dhcp152-54-9-28.europa.renci.org:
pruth=yes:ssh-dss AAAAB3NzaC1kc3MAAACBALYdIgCmJoZt+4YN7lUlDLR0ebwfsBd+d0Tw3O18JUc2bXoTzwWBGV+BkGtGljzmr1SDrRXkOWgA//pogG7B1vpizHV6K5MoFQDoSEy/64ycEago611xtMt13xuJei0pPyAphv/NrYlD1xZBMuEG9JTe8EK/H43ZhLcK4b1HwWrTAAAAFQDnZNZVDojT0aHHgJqBncy+iBHs9wAAAIBiMfYPoDgVASgknEzBschxTzTFuhof+lxBh0v5i9OsinMuRa1K5wBbA1eo63PKxywQSnODQhItme0Tn8Pp1ETpM0YkzE48K1NxW3l9iBipSRDMEh8aUlfX5R7xfRRY7tUNXlQQAzXYX8ZvXoA+mbZ9BkBXtSNI5uD1z3Gk5k/WQwAAAIBVIVuHJVgSiCw/m8yjCVH1QgO045ACf4l9/3HaoDwFaNrL1WKQvplhz/DVqtWq/2ZAIrwXr/0IgviRRZ/iVpul3s15ecTJzHAhHMaaDn4vuphH6xbs6JHLFyvBQGJy1euoY9BPqtTFZnH7KdWoChCQXfujDrtcx/5MfBn4tO5kQQ== pruth@dhcp152-54-9-28.europa.renci.org:
[interfaces]
fe163e002862=up:ipv4:172.16.1.1/24
[storage]
[routes]
[scripts]bootscript=#!/bin/bash
 #setup /etc/hosts
 echo 172.16.1.1 master >> /etc/hosts
 echo 172.16.1.100 `echo workers/0 | sed 's/\//-/g'` >> /etc/hosts
 echo 172.16.1.101 `echo workers/1 | sed 's/\//-/g'` >> /etc/hosts
 HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.1/etc/hadoop
 CORE_SITE_FILE=${HADOOP_CONF_DIR}/core-site.xml
 HDFS_SITE_FILE=${HADOOP_CONF_DIR}/hdfs-site.xml
 MAPRED_SITE_FILE=${HADOOP_CONF_DIR}/mapred-site.xml
 YARN_SITE_FILE=${HADOOP_CONF_DIR}/yarn-site.xml
 SLAVES_FILE=${HADOOP_CONF_DIR}/slaves
 echo "" > $CORE_SITE_FILE
 cat > $CORE_SITE_FILE << EOF
 <?xml version="1.0" encoding="UTF-8"?>
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
 <property>
 <name>fs.default.name</name>
 <value>hdfs://master:9000</value>
 </property>
 </configuration>
 EOF
 echo "" > $HDFS_SITE_FILE
 cat > $HDFS_SITE_FILE << EOF
 <?xml version="1.0" encoding="UTF-8"?>
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
 <property>
 <name>dfs.datanode.du.reserved</name>
 <!-- cluster variant -->
 <value>20000000000</value>
 <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
 </description>
 </property>
 <property>
 <name>dfs.replication</name>
 <value>2</value>
 </property>
 <property>
 <name>dfs.name.dir</name>
 <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.data.dir</name>
 <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
 </property>
 </configuration>
 EOF
 echo "" > $MAPRED_SITE_FILE
 cat > $MAPRED_SITE_FILE << EOF
 <configuration>
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>
 </configuration>
 EOF
 echo "" > $YARN_SITE_FILE
 cat > $YARN_SITE_FILE << EOF
 <?xml version="1.0"?>
 <configuration>
 <!-- Site specific YARN configuration properties -->
 <property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>master:8031</value>
 </property>
 <property>
 <name>yarn.resourcemanager.address</name>
 <value>master:8032</value>
 </property>
 <property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:8030</value>
 </property>
 <property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>master:8033</value>
 </property>
 <property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>master:8088</value>
 </property>
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 </configuration>
 EOF
 echo "" > $SLAVES_FILE
 cat > $SLAVES_FILE << EOF
 `echo workers/0 | sed 's/\//-/g'` 
 `echo workers/1 | sed 's/\//-/g'` 
 EOF
 echo "" > /home/hadoop/.ssh/config
 cat > /home/hadoop/.ssh/config << EOF
 Host `echo 172.16.1.1 | sed 's/.[0-9][0-9]*$//g'`.* master workers-* 0.0.0.0
 StrictHostKeyChecking no
 UserKnownHostsFile=/dev/null
 EOF
 chmod 600 /home/hadoop/.ssh/config
chown hadoop:hadoop /home/hadoop/.ssh/config

4. Test for connectivity between the VMs.

[root@master ~]# ping workers-0
PING workers-0 (172.16.1.100) 56(84) bytes of data.
64 bytes from workers-0 (172.16.1.100): icmp_seq=1 ttl=64 time=1.21 ms
64 bytes from workers-0 (172.16.1.100): icmp_seq=2 ttl=64 time=0.967 ms
64 bytes from workers-0 (172.16.1.100): icmp_seq=3 ttl=64 time=1.03 ms
64 bytes from workers-0 (172.16.1.100): icmp_seq=4 ttl=64 time=1.06 ms
^C
--- workers-0 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 0.967/1.069/1.212/0.089 ms
[root@master ~]# ping workers-1
PING workers-1 (172.16.1.101) 56(84) bytes of data.
64 bytes from workers-1 (172.16.1.101): icmp_seq=1 ttl=64 time=1.35 ms
64 bytes from workers-1 (172.16.1.101): icmp_seq=2 ttl=64 time=1.06 ms
64 bytes from workers-1 (172.16.1.101): icmp_seq=3 ttl=64 time=0.922 ms
64 bytes from workers-1 (172.16.1.101): icmp_seq=4 ttl=64 time=1.00 ms
^C
--- workers-1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 0.922/1.083/1.351/0.163 ms

Run the Hadoop HDFS:

1.  Login to the master node as root

2.  Switch to the hadoop user

[root@master ~]# su hadoop -
[hadoop@master root]$

3.  Change to hadoop user’s home directory

[hadoop@master root]$ cd ~

4.  Format the HDFS file system

[hadoop@master ~]$ hdfs namenode -format
15/09/17 18:50:22 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/172.16.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.1
...

...
15/09/17 18:50:24 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/09/17 18:50:24 INFO util.ExitUtil: Exiting with status 0
15/09/17 18:50:24 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/172.16.1.1
************************************************************/

5.  Start the HDFS dfs service

[hadoop@master ~]$ start-dfs.sh
Starting namenodes on [master]
master: Warning: Permanently added 'master,172.16.1.1' (ECDSA) to the list of known hosts.
master: starting namenode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-namenode-master.out
workers-1: Warning: Permanently added 'workers-1,172.16.1.101' (ECDSA) to the list of known hosts.
workers-0: Warning: Permanently added 'workers-0,172.16.1.100' (ECDSA) to the list of known hosts.
workers-1: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-workers-1.out
workers-0: starting datanode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-datanode-workers-0.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.7.1/logs/hadoop-hadoop-secondarynamenode-master.out

6.  Start the yarn service

[hadoop@master ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.7.1/logs/yarn-hadoop-resourcemanager-master.out
workers-1: Warning: Permanently added 'workers-1,172.16.1.101' (ECDSA) to the list of known hosts.
workers-0: Warning: Permanently added 'workers-0,172.16.1.100' (ECDSA) to the list of known hosts.
workers-1: starting nodemanager, logging to /home/hadoop/hadoop-2.7.1/logs/yarn-hadoop-nodemanager-workers-1.out
workers-0: starting nodemanager, logging to /home/hadoop/hadoop-2.7.1/logs/yarn-hadoop-nodemanager-workers-0.out

7.   Test to see if the HDFS has started

[hadoop@master ~]$ hdfs dfsadmin -report
Configured Capacity: 14824083456 (13.81 GB)
Present Capacity: 8522616832 (7.94 GB)
DFS Remaining: 8522567680 (7.94 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (2):

Name: 172.16.1.101:50010 (workers-1)
Hostname: workers-1
Decommission Status : Normal 
Configured Capacity: 7412041728 (6.90 GB) 
DFS Used: 24576 (24 KB) 
Non DFS Used: 3150721024 (2.93 GB)
DFS Remaining: 4261296128 (3.97 GB)
DFS Used%: 0.00%
DFS Remaining%: 57.49%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Sep 17 18:55:41 UTC 2015

Name: 172.16.1.100:50010 (workers-0)
Hostname: workers-0 
Decommission Status : NormalsA
Configured Capacity: 7412041728 (6.90 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 3150745600 (2.93 GB)
DFS Remaining: 4261271552 (3.97 GB)
DFS Used%: 0.00%
DFS Remaining%: 57.49%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Sep 17 18:55:41 UTC 2015

 Simple HDFS tests:

1.  Create a small test file

[hadoop@master ~]$ echo Hello ExoGENI World > hello.txt

2. Put the test file into the Hadoop filesystem

[hadoop@master ~]$ hdfs dfs -put hello.txt /hello.txt

3. Check for the file’s existence

[hadoop@master ~]$ hdfs dfs -ls /
Found 1 items
-rw-r--r-- 2 hadoop supergroup 20 2015-09-17 19:11 /hello.txt

4. Check the contents of the file

[hadoop@master ~]$ hdfs dfs -cat /hello.txt
Hello ExoGENI World

Run the Hadoop Sort Testcase

Test the true power of the Hadoop filesystem by creating and sorting a large random dataset. It may be useful/interesting to login to the master and/or worker VMs and use tools like top, iotop, and iftop to observe the resource utilization on each of the VMs during the sort test. Note: on these VMs iotop and iftop must be run as root.

1.  Create a 1 GB random data set:  After the data is created, use the ls functionally to confirm the data exists. Note that the data is composed of several files in a directory.

[hadoop@master ~]$ hadoop jar /home/hadoop/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar teragen 10000000 /input
15/09/17 19:15:41 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.1.1:8032
15/09/17 19:15:42 INFO terasort.TeraSort: Generating 10000000 using 2
15/09/17 19:15:42 INFO mapreduce.JobSubmitter: number of splits:2
15/09/17 19:15:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442515994234_0001
15/09/17 19:15:43 INFO impl.YarnClientImpl: Submitted application application_1442515994234_0001
15/09/17 19:15:43 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1442515994234_0001/
15/09/17 19:15:43 INFO mapreduce.Job: Running job: job_1442515994234_0001
15/09/17 19:15:53 INFO mapreduce.Job: Job job_1442515994234_0001 running in uber mode : false
15/09/17 19:15:53 INFO mapreduce.Job: map 0% reduce 0%
15/09/17 19:16:05 INFO mapreduce.Job: map 1% reduce 0%
15/09/17 19:16:07 INFO mapreduce.Job: map 2% reduce 0%
15/09/17 19:16:08 INFO mapreduce.Job: map 3% reduce 0%
15/09/17 19:16:13 INFO mapreduce.Job: map 4% reduce 0%
15/09/17 19:16:17 INFO mapreduce.Job: map 5% reduce 0%
15/09/17 19:16:22 INFO mapreduce.Job: map 6% reduce 0%
...
15/09/17 19:23:10 INFO mapreduce.Job: map 96% reduce 0%
15/09/17 19:23:15 INFO mapreduce.Job: map 97% reduce 0%
15/09/17 19:23:19 INFO mapreduce.Job: map 98% reduce 0%
15/09/17 19:23:24 INFO mapreduce.Job: map 99% reduce 0%
15/09/17 19:23:28 INFO mapreduce.Job: map 100% reduce 0%
15/09/17 19:23:35 INFO mapreduce.Job: Job job_1442515994234_0001 completed successfully
15/09/17 19:23:35 INFO mapreduce.Job: Counters: 31
 File System Counters
 FILE: Number of bytes read=0
 FILE: Number of bytes written=229878
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=167
 HDFS: Number of bytes written=1000000000
 HDFS: Number of read operations=8
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=4
 Job Counters 
 Launched map tasks=2
 Other local map tasks=2
 Total time spent by all maps in occupied slots (ms)=916738
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=916738
 Total vcore-seconds taken by all map tasks=916738
 Total megabyte-seconds taken by all map tasks=938739712
 Map-Reduce Framework
 Map input records=10000000
 Map output records=10000000
 Input split bytes=167
 Spilled Records=0
 Failed Shuffles=0
 Merged Map outputs=0
 GC time elapsed (ms)=1723
 CPU time spent (ms)=21600
 Physical memory (bytes) snapshot=292618240
 Virtual memory (bytes) snapshot=4156821504
 Total committed heap usage (bytes)=97517568
 org.apache.hadoop.examples.terasort.TeraGen$Counters
 CHECKSUM=21472776955442690
 File Input Format Counters 
 Bytes Read=0
 File Output Format Counters 
 Bytes Written=1000000000

2. Sort the data

[hadoop@master ~]$ hadoop jar /home/hadoop/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /input /output
15/09/17 19:25:25 INFO terasort.TeraSort: starting
15/09/17 19:25:27 INFO input.FileInputFormat: Total input paths to process : 2
Spent 216ms computing base-splits.
Spent 3ms computing TeraScheduler splits.
Computing input splits took 220ms
Sampling 8 splits of 8
Making 1 from 100000 sampled records
Computing parititions took 5347ms
Spent 5569ms computing partitions.
15/09/17 19:25:32 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.1.1:8032
15/09/17 19:25:33 INFO mapreduce.JobSubmitter: number of splits:8
15/09/17 19:25:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442515994234_0002
15/09/17 19:25:34 INFO impl.YarnClientImpl: Submitted application application_1442515994234_0002
15/09/17 19:25:34 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1442515994234_0002/
15/09/17 19:25:34 INFO mapreduce.Job: Running job: job_1442515994234_0002
15/09/17 19:25:42 INFO mapreduce.Job: Job job_1442515994234_0002 running in uber mode : false
15/09/17 19:25:42 INFO mapreduce.Job: map 0% reduce 0%
15/09/17 19:25:57 INFO mapreduce.Job: map 11% reduce 0%
15/09/17 19:26:00 INFO mapreduce.Job: map 17% reduce 0%
15/09/17 19:26:07 INFO mapreduce.Job: map 21% reduce 0%
15/09/17 19:26:10 INFO mapreduce.Job: map 25% reduce 0%
15/09/17 19:26:13 INFO mapreduce.Job: map 34% reduce 0%
15/09/17 19:26:14 INFO mapreduce.Job: map 37% reduce 0%
15/09/17 19:26:15 INFO mapreduce.Job: map 41% reduce 0%
15/09/17 19:26:16 INFO mapreduce.Job: map 44% reduce 0%
15/09/17 19:26:17 INFO mapreduce.Job: map 57% reduce 0%
15/09/17 19:26:18 INFO mapreduce.Job: map 58% reduce 0%
15/09/17 19:26:20 INFO mapreduce.Job: map 62% reduce 0%
15/09/17 19:26:22 INFO mapreduce.Job: map 62% reduce 8%
15/09/17 19:26:26 INFO mapreduce.Job: map 66% reduce 8%
15/09/17 19:26:27 INFO mapreduce.Job: map 67% reduce 8%
15/09/17 19:26:28 INFO mapreduce.Job: map 67% reduce 13%
15/09/17 19:26:35 INFO mapreduce.Job: map 72% reduce 13%
15/09/17 19:26:37 INFO mapreduce.Job: map 73% reduce 13%
15/09/17 19:26:38 INFO mapreduce.Job: map 79% reduce 13%
15/09/17 19:26:42 INFO mapreduce.Job: map 84% reduce 13%
15/09/17 19:26:44 INFO mapreduce.Job: map 84% reduce 17%
15/09/17 19:26:45 INFO mapreduce.Job: map 86% reduce 17%
15/09/17 19:26:48 INFO mapreduce.Job: map 89% reduce 17%
15/09/17 19:26:51 INFO mapreduce.Job: map 94% reduce 17%
15/09/17 19:26:53 INFO mapreduce.Job: map 96% reduce 17%
15/09/17 19:26:54 INFO mapreduce.Job: map 100% reduce 17%
...
15/09/17 19:34:21 INFO mapreduce.Job: map 100% reduce 87%
15/09/17 19:34:24 INFO mapreduce.Job: map 100% reduce 93%
15/09/17 19:34:27 INFO mapreduce.Job: map 100% reduce 99%
15/09/17 19:34:28 INFO mapreduce.Job: map 100% reduce 100%
15/09/17 19:34:28 INFO mapreduce.Job: Job job_1442515994234_0002 completed successfully
15/09/17 19:34:29 INFO mapreduce.Job: Counters: 51
 File System Counters
 FILE: Number of bytes read=2080000144
 FILE: Number of bytes written=3121046999
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=1000000816
 HDFS: Number of bytes written=1000000000
 HDFS: Number of read operations=27
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=2
 Job Counters 
 Killed map tasks=3
 Launched map tasks=11
 Launched reduce tasks=1
 Data-local map tasks=9
 Rack-local map tasks=2
 Total time spent by all maps in occupied slots (ms)=455604
 Total time spent by all reduces in occupied slots (ms)=495530
 Total time spent by all map tasks (ms)=455604
 Total time spent by all reduce tasks (ms)=495530
 Total vcore-seconds taken by all map tasks=455604
 Total vcore-seconds taken by all reduce tasks=495530
 Total megabyte-seconds taken by all map tasks=466538496
 Total megabyte-seconds taken by all reduce tasks=507422720
 Map-Reduce Framework
 Map input records=10000000
 Map output records=10000000
 Map output bytes=1020000000
 Map output materialized bytes=1040000048
 Input split bytes=816
 Combine input records=0
 Combine output records=0
 Reduce input groups=10000000
 Reduce shuffle bytes=1040000048
 Reduce input records=10000000
 Reduce output records=10000000
 Spilled Records=30000000
 Shuffled Maps =8
 Failed Shuffles=0
 Merged Map outputs=8
 GC time elapsed (ms)=2276
 CPU time spent (ms)=86920
 Physical memory (bytes) snapshot=1792032768
 Virtual memory (bytes) snapshot=18701041664
 Total committed heap usage (bytes)=1277722624
 Shuffle Errors
 BAD_ID=0
 CONNECTION=0
 IO_ERROR=0
 WRONG_LENGTH=0
 WRONG_MAP=0
 WRONG_REDUCE=0
 File Input Format Counters 
 Bytes Read=1000000000
 File Output Format Counters 
 Bytes Written=1000000000
15/09/17 19:34:29 INFO terasort.TeraSort: done

3.  Look at the output

List the output directory:

[hadoop@master ~]$ hdfs dfs -ls /output
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2015-09-17 19:34 /output/_SUCCESS
-rw-r--r-- 10 hadoop supergroup 0 2015-09-17 19:25 /output/_partition.lst
-rw-r--r-- 1 hadoop supergroup 1000000000 2015-09-17 19:34 /output/part-r-00000

Get the sorted output file:

[hadoop@master ~]$ hdfs dfs -get /output/part-r-00000 part-r-00000

Look at the output file:

[hadoop@master ~]$ hexdump part-r-00000 | less

Does the output match your expectations?

 

Have something to add?

Loading Facebook Comments ...