星期四, 9月 30, 2010

Hadoop with openSUSE -- English Version

This document is for openSUSE users who want to use Hadoop.

Environment setup note:

OS: openSUSE 11.2 ( sure, of course ^^)
HD: 80GB

Prepare two PC for single and cluster pratice.
You can set ip your own ip with your env.

Server1:
10.10.x.y    server.digitalairlines.com    server

Server2:
10.10.v.w    server2.digitalairlines.com    server2



Partition
  • swap 1GB
  • /         73.5GB


User Admin
  • User: root  password: linux
  • User: max  password:  linux


Software
  • select Base Development Packages
  • update openSUSE packages
  • install  java-1_6_0-sun-devel  packages( I found openjdk got problem  ^^||)( you can install it in update repositories)


Services ( Daemon)
  • Active  sshd  and set bootable
    • #rcsshd  start
    • #chkconfig  sshd  on


For numerous use
  • Fix  /etc/fstab  and  /boot/grub/menu.lst  HardDisk DeviceName Use /dev/sda1  and not use  /dev/disk/by-id   -- If you want to clone your hard disks to deploy it!!
  • Delete /etc/udev/rules.d/70-persistent-net.rules  for Network Interface Card ( If you didn’t delete it, when you clone your disk, your new NIC name will be eth1 and not eth0 )



Prepare the software


---------------------------------------- Pratice  ------------------------------------------
Hadoop with single host

At Server1
Please login with max use password linux
Please notice shell promote is  >

Step 1. Create ssh key for connet ssh without password
Use non-interactive method to create  Server1  DSA key pair

>ssh-keygen  -N ''  -d  -q  -f  ~/.ssh/id_dsa

copy public key for  authorized_keys
>cp  ~/.ssh/id_dsa.pub   ~/.ssh/authorized_keys

>ssh-add   ~/.ssh/id_dsa
Identity added: /root/.ssh/id_dsa (/root/.ssh/id_dsa)


Test connect to ssh without password -- with Key
>ssh  localhost
The authenticity of host 'localhost (: :1)' can't be established.
RSA key fingerprint is 05:22:61:78:05:04:7e:d1:81:67:f2:d5:8a:42:bb:9f.
Are you sure you want to continue connecting (yes/no)? Please input   yes

Logout  SSH
>exit

Step 2.  Instll  Hadoop
Exarct Hadoop package(we prepare it at  /opt/OSSF) -- please use sudo to do it
(Because regular user has no permission with  /opt  folder)

>sudo  tar  zxvf   /opt/OSSF/hadoop-0.20.2.tar.gz   -C   /opt

It will ask to input  root password, pleasure input  linux

Change  /opt/hadoop-0.20.2 owner to  max,  and the group belong  users
> sudo  chown   -R  max:users   /opt/hadoop-0.20.2/

Create  /var/hadoop Folder
> sudo  mkdir   /var/hadoop

Change  /var/hadoop  owner to  max, and group belong  users
> sudo  chown  -R  max:users   /var/hadoop/


Step 3.  Set up Hadoop Configuration


3-1. Set up environment with  hadoop-env.sh
>vi   /opt/hadoop-0.20.2/conf/hadoop-env.sh
#Please add these settings ( Depend your env)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-sun
export HADOOP_HOME=/opt/hadoop-0.20.2
export HADOOP_CONF_DIR=/opt/hadoop-0.20.2/conf

3-2.  add configuration with  core-site.xml  in <configuration> to </configuration>
you can copy and paste it ^^
>vi   /opt/hadoop-0.20.2/conf/core-site.xml

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/hadoop-\${user.name}</value>
</property>

</configuration>

3-3. add configuration with  hdfs-site.xml ( Set up replication) in <configuration> to </configuration>
you can copy and paste it ^^
>vi   /opt/hadoop-0.20.2/conf/hdfs-site.xml

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

3-4. add  configuration with mapred-site.xml 內的( For  JobTracker )  in <configuration> to </configuration>
you can copy and paste it ^^

>vi   /opt/hadoop-0.20.2/conf/mapred-site.xml

<configuration>

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

</configuration>

Step 4. Format  HDFS
>/opt/hadoop-0.20.2/bin/hadoop   namenode   -format
10/07/20 00:51:13 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = server/127.0.0.2
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/07/20 00:51:13 INFO namenode.FSNamesystem: fsOwner=max,users,video
10/07/20 00:51:13 INFO namenode.FSNamesystem: supergroup=supergroup
10/07/20 00:51:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/07/20 00:51:14 INFO common.Storage: Image file of size 93 saved in 0 seconds.
10/07/20 00:51:14 INFO common.Storage: Storage directory /var/hadoop/hadoop-\max/dfs/name has been successfully formatted.
10/07/20 00:51:14 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at server/127.0.0.2
************************************************************/

Step 5. Start  hadoop
>/opt/hadoop-0.20.2/bin/start-all.sh
starting namenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-namenode-server.out
localhost: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-server.out
localhost: starting secondarynamenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-secondarynamenode-server.out
starting jobtracker, logging to /opt/hadoop-0.20.2/logs/hadoop-max-jobtracker-server.out
localhost: starting tasktracker, logging to /opt/hadoop-0.20.2/logs/hadoop-max-tasktracker-server.out

Step 6. Check Hadoop Status
You can use mouse to link your computer

Hadoop Admin
http://localhost:50030

Hadoop Task Tracker
http://localhost:50060

Hadoop DFS
http://localhost:50070



Lab2 HDFS  commands pratice

1.
show hadoop command  help
>/opt/hadoop-0.20.2/bin/hadoop   fs

use  hadoop  command to list HDFS
( But we don’t upload and file to HDFS, It will have error messages )
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls


2. Upload  /opt/hadoop-0.20.2/conf  Folder to  HDFS  and rename to input
Syntax is
#hadoop command                              upload           Local-Dir              HDFS-Folder-Name
>/opt/hadoop-0.20.2/bin/hadoop   fs   -put   /opt/hadoop-0.20.2/conf   input


3. Please check  HDFS  again
3-1 check the HDFS
> /opt/hadoop-0.20.2/bin/hadoop  fs   -ls
Found 1 items
drwxr-xr-x   - max supergroup       0 2010-07-18 21:16 /user/max/input

If you don’t order the path , Default path is   /user/username
you can use absolute path name too, for example
> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/

Tips: You can check the  /var/hadoop folder before / after you upload to HDFS
(You can see some change to your folder with your localhost)
>ls  -lh  /var/hadoop/hadoop-\\max/dfs/data/current/

3-2 List  input  folder on HDFS
>/opt/hadoop-0.20.2/bin/hadoop   fs    -ls   input
Found 13 items
-rw-r--r--   1 max supergroup    3936 2010-07-21 16:00 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-21 16:00 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-21 16:00 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-21 16:00 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-21 16:00 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-21 16:00 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-21 16:00 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-21 16:00 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-21 16:00 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/masters
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/slaves
-rw-r--r--   1 max supergroup    1243 2010-07-21 16:00 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-21 16:00 /user/max/input/ssl-server.xml.example

4. Download files from  HDFS  to local
Please check your local folder first
>ls

Use command  “ hadoop  fs  -get “ to download it
>/opt/hadoop-0.20.2/bin/hadoop   fs   -get   input    fromHDFS

Please check your local folder again
>ls


5. Use -cat to check the file on HDFS
>/opt/hadoop-0.20.2/bin/hadoop   fs   -cat   input/slaves
localhost

6. Delete files on  HDFS  with  -rm  ( with directory please use  -rmr )
Check  input Folder’s files first, you will see /user/max/input/slaves exist
>> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/input
Found 13 items
-rw-r--r--   1 max supergroup    3936 2010-07-21 16:00 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-21 16:00 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-21 16:00 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-21 16:00 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-21 16:00 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-21 16:00 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-21 16:00 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-21 16:00 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-21 16:00 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/masters
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/slaves
-rw-r--r--   1 max supergroup    1243 2010-07-21 16:00 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-21 16:00 /user/max/input/ssl-server.xml.example

Use  hadoop fs -rm  to delete file with name slaves
>/opt/hadoop-0.20.2/bin/hadoop   fs   -rm   input/slaves
Deleted hdfs://localhost:9000/user/max/input/slaves

Check  input Folder’s files again, you will see /user/max/input/slaves not  exist
>> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/input
Found 12 items
-rw-r--r--   1 max supergroup    3936 2010-07-22 15:08 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-22 15:08 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-22 15:08 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-22 15:08 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-22 15:08 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-22 15:08 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-22 15:08 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-22 15:08 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-22 15:08 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-22 15:08 /user/max/input/masters
-rw-r--r--   1 max supergroup    1243 2010-07-22 15:08 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-22 15:08 /user/max/input/ssl-server.xml.example

Use  hadoop  fs  -rmr  to delete folder
>/opt/hadoop-0.20.2/bin/hadoop   fs   -rm   input
Deleted hdfs://localhost:9000/user/max/input



Lab 3 Hadoop example pratice

1.grep command

1-1 Upload   /opt/hadoop-0.20.2/conf  folder to  HDFS  and rename to  source
Syntax is
#hadoop                                           upload      LocalFolder                       HDFS-Folder-Name
>/opt/hadoop-0.20.2/bin/hadoop   fs   -put   /opt/hadoop-0.20.2/conf           source

1-2 Check upload  source folder sucessful

> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/
Found 1 items
drwxr-xr-x   - max supergroup       0 2010-07-23 15:13 /user/max/source


1-3 Use grep command to find out files in  source folder, and the content text start with dfs , save it to  output-1

>/opt/hadoop-0.20.2/bin/hadoop   jar   /opt/hadoop-0.20.2/hadoop-0.20.2-examples.jar grep   source   output-1    'dfs[a-z.]+'

1-4  Check Result
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls    output-1
Found 2 items
drwxr-xr-x   - max supergroup       0 2010-07-20 00:33 /user/max/output/_logs
-rw-r--r--   1 max supergroup      96 2010-07-20 00:33 /user/max/output/part-00000

>/opt/hadoop-0.20.2/bin/hadoop  fs   -cat    output-1/part-00000
3    dfs.class
2    dfs.period
1    dfs.file
1    dfs.replication
1    dfs.servers
1    dfsadmin
1    dfsmetrics.log

2. wordcount  Pratice

2-1 Count how many words in  source folder and save it to  output-2
>/opt/hadoop-0.20.2/bin/hadoop   jar /opt/hadoop-0.20.2/hadoop-0.20.2-examples.jar wordcount   source   output-2

2-2 Check result
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls    output-2
Found 2 items
drwxr-xr-x   - max supergroup       0 2010-07-20 02:00 /user/max/output-2/_logs
-rw-r--r--   1 max supergroup   10886 2010-07-20 02:01 /user/max/output-2/part-r-00000

Display the result with -cat
>/opt/hadoop-0.20.2/bin/hadoop   fs  -cat   output-2/part-r-00000





Lab 4  Hadoop Cluster  

-- Please do it on  Server 2 ---
Login with  max password  linux  
1, Prepare the folder

>sudo  mkdir   /opt/hadoop-0.20.2
Input  root password, Please input  linux

>sudo  mkdir   /var/hadoop
>sudo  chown   -R  max:users   /opt/hadoop-0.20.2/
>sudo  chown   -R  max:users   /var/hadoop

Set up Name resolve ( It’s very important )
>sudo   vi   /etc/hosts
Please comment  server2’s name resolve for  127.0.0.2
#127.0.0.2    server2.digitalairlines.com    server2
Please add  server1 and server2 IP ( Depend on your env)
10.10.x.y    server.digitalairlines.com    server
10.10.v.w    server2.digitalairlines.com    server2


-----------------------------------------------------------------------------------------------------------------------

-- Please do it on  Server 1 ---

1-1 stop hadoop
>/opt/hadoop-0.20.2/bin/stop-all.sh

1-2 Delete old   hadoop folder
>rm  -rf   /var/hadoop/*

1-3 Modify  Namenode configurtion
>vi   /opt/hadoop-0.20.2/conf/core-site.xml
Please fix
                            <value>hdfs://localhost:9000</value>
To  server1  IP
                            <value>hdfs://Srv1’s ip:9000</value>

***You can use “>ip address show” or  “/sbin/ifconfig” display  IP address***

1-4  Modify  HDFS replication setting
>vi   /opt/hadoop-0.20.2/conf/hdfs-site.xml
Please fix
            <value>1</value>
to
            <value>2</value>

1-5
>vi  /opt/hadoop-0.20.2/conf/mapred-site.xml
Please fix
<value>localhost:9001</value>
to
<value>Srv1’s ip:9001</value>

1-6  Set up  slaves (The host who act  slaves will be  datanode and  tasktracker  role)
>vi  /opt/hadoop-0.20.2/conf/slaves
Please delete  localhost
Please add  Srv1’s ip
Please add  Srv2’s ip

***The   ip addree might be 10.10.x.y ***

1-7 Set up Name resolve
>sudo   vi   /etc/hosts
Please comment  server1’s Name resolve for  127.0.0.2
#127.0.0.2    server.digitalairlines.com    server
Please add  server1 and server2  IP  address for name resolve
10.10.x.y    server.digitalairlines.com    server
10.10.v.w    server2.digitalairlines.com    server2

1-8 Modify  ssh configuration
>sudo   vi   /etc/ssh/ssh_config
Uncomment the StrictHostKeyChecking and modify it to no
# StrictHostKeyChecking ask

StrictHostKeyChecking  no

1-9 Copy  SSH Key to another Node

>scp   -r   ~/.ssh   Srv2-IP:~/
Warning: Permanently added '10.10.v.w' (RSA) to the list of known hosts.
Password: Please input  max password

Test connect to  SSH without password -- with key
Connect to  server1
>ssh    Srv1’s IP
>exit
Connect to  server2
>ssh    Srv2’s IP
>exit

1-10 Copy  hadoop to Server 2
>scp   -r   /opt/hadoop-0.20.2/*    Srv2-IP:/opt/hadoop-0.20.2/

1-11 Format HDFS
>/opt/hadoop-0.20.2/bin/hadoop   namenode   -format


1-12 Start  DFS ( It will depend  /opt/hadoop-0.20.2/conf/slaves to active  datanode  )
>/opt/hadoop-0.20.2/bin/start-dfs.sh
Please check there are 2  datanode
starting namenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-namenode-linux-7tce.out
10.10.x.y: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-linux-7tce.out
10.10.v.w: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-server2.out
localhost: starting secondarynamenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-secondarynamenode-linux-7tce.out

Please Check  “ http://Srv1’s IP:50070/ ”
Please check  “Live Nodes” -- It should be  2 

1-13 Start JobTracker 
>/opt/hadoop-0.20.2/bin/start-mapred.sh

Please Check  “ http://Srv1’s IP:50030/ ”
Please Check  “Nodes” -- It should be  2 

Now, just run programs like Lab3 to examine