sakananote: 9月 2010

This document is for openSUSE users who want to use Hadoop.

Environment setup note:

OS: openSUSE 11.2 ( sure, of course ^^)
HD: 80GB

Prepare two PC for single and cluster pratice.
You can set ip your own ip with your env.

Server1:
10.10.x.y server.digitalairlines.com server

Server2:
10.10.v.w server2.digitalairlines.com server2

Partition

swap 1GB
/ 73.5GB

User Admin

User: root password: linux
User: max password: linux

Software

select Base Development Packages
update openSUSE packages
install java-1_6_0-sun-devel packages( I found openjdk got problem ^^||)( you can install it in update repositories)

Services ( Daemon)

Active sshd and set bootable

#rcsshd start
#chkconfig sshd on

For numerous use

Fix /etc/fstab and /boot/grub/menu.lst HardDisk DeviceName Use /dev/sda1 and not use /dev/disk/by-id -- If you want to clone your hard disks to deploy it!!
Delete /etc/udev/rules.d/70-persistent-net.rules for Network Interface Card ( If you didn’t delete it, when you clone your disk, your new NIC name will be eth1 and not eth0 )

Prepare the software

Create Directory OSSF at /opt

#mkdir /opt/OSSF

Download hadoop-0.20.2.tar.gz at /opt/OSSF
You can download the software here
http://www.apache.org/dyn/closer.cgi/hadoop/core/

---------------------------------------- Pratice ------------------------------------------
Hadoop with single host

At Server1
Please login with max use password linux
Please notice shell promote is >

Step 1. Create ssh key for connet ssh without password
Use non-interactive method to create Server1 DSA key pair

>ssh-keygen -N '' -d -q -f ~/.ssh/id_dsa

copy public key for authorized_keys
>cp ~/.ssh/id_dsa.pub   ~/.ssh/authorized_keys

>ssh-add   ~/.ssh/id_dsa
Identity added: /root/.ssh/id_dsa (/root/.ssh/id_dsa)

Test connect to ssh without password -- with Key
>ssh localhost
The authenticity of host 'localhost (: :1)' can't be established.
RSA key fingerprint is 05:22:61:78:05:04:7e:d1:81:67:f2:d5:8a:42:bb:9f.
Are you sure you want to continue connecting (yes/no)? Please input   yes

Logout SSH
>exit

Step 2. Instll Hadoop
Exarct Hadoop package(we prepare it at /opt/OSSF) -- please use sudo to do it
(Because regular user has no permission with /opt folder)

>sudo tar zxvf   /opt/OSSF/hadoop-0.20.2.tar.gz   -C   /opt

It will ask to input root password, pleasure input linux

Change /opt/hadoop-0.20.2 owner to max, and the group belong users
> sudo chown   -R max:users   /opt/hadoop-0.20.2/

Create /var/hadoop Folder
> sudo mkdir   /var/hadoop

Change /var/hadoop owner to max, and group belong users
> sudo chown -R max:users   /var/hadoop/

Step 3. Set up Hadoop Configuration

3-1. Set up environment with hadoop-env.sh
>vi   /opt/hadoop-0.20.2/conf/hadoop-env.sh
#Please add these settings ( Depend your env)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-sun
export HADOOP_HOME=/opt/hadoop-0.20.2
export HADOOP_CONF_DIR=/opt/hadoop-0.20.2/conf

3-2. add configuration with core-site.xml in <configuration> to </configuration>
you can copy and paste it ^^
>vi   /opt/hadoop-0.20.2/conf/core-site.xml

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/hadoop-\${user.name}</value>
</property>

</configuration>

3-3. add configuration with hdfs-site.xml ( Set up replication) in <configuration> to </configuration>
you can copy and paste it ^^
>vi   /opt/hadoop-0.20.2/conf/hdfs-site.xml

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

3-4. add configuration with mapred-site.xml 內的( For JobTracker ) in <configuration> to </configuration>
you can copy and paste it ^^

>vi   /opt/hadoop-0.20.2/conf/mapred-site.xml

<configuration>

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

</configuration>

Step 4. Format HDFS
>/opt/hadoop-0.20.2/bin/hadoop   namenode   -format
10/07/20 00:51:13 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = server/127.0.0.2
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/07/20 00:51:13 INFO namenode.FSNamesystem: fsOwner=max,users,video
10/07/20 00:51:13 INFO namenode.FSNamesystem: supergroup=supergroup
10/07/20 00:51:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/07/20 00:51:14 INFO common.Storage: Image file of size 93 saved in 0 seconds.
10/07/20 00:51:14 INFO common.Storage: Storage directory /var/hadoop/hadoop-\max/dfs/name has been successfully formatted.
10/07/20 00:51:14 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at server/127.0.0.2
************************************************************/

Step 5. Start hadoop
>/opt/hadoop-0.20.2/bin/start-all.sh
starting namenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-namenode-server.out
localhost: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-server.out
localhost: starting secondarynamenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-secondarynamenode-server.out
starting jobtracker, logging to /opt/hadoop-0.20.2/logs/hadoop-max-jobtracker-server.out
localhost: starting tasktracker, logging to /opt/hadoop-0.20.2/logs/hadoop-max-tasktracker-server.out

Step 6. Check Hadoop Status
You can use mouse to link your computer

Hadoop Admin
http://localhost:50030

Hadoop Task Tracker
http://localhost:50060

Hadoop DFS
http://localhost:50070

Lab2 HDFS commands pratice

1.
show hadoop command help
>/opt/hadoop-0.20.2/bin/hadoop   fs

use hadoop command to list HDFS
( But we don’t upload and file to HDFS, It will have error messages )
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls

2. Upload /opt/hadoop-0.20.2/conf Folder to HDFS and rename to input
Syntax is
#hadoop command                              upload           Local-Dir              HDFS-Folder-Name
>/opt/hadoop-0.20.2/bin/hadoop   fs   -put   /opt/hadoop-0.20.2/conf   input

3. Please check HDFS again
3-1 check the HDFS
> /opt/hadoop-0.20.2/bin/hadoop fs   -ls
Found 1 items
drwxr-xr-x   - max supergroup       0 2010-07-18 21:16 /user/max/input

If you don’t order the path , Default path is   /user/username
you can use absolute path name too, for example
> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/

Tips: You can check the /var/hadoop folder before / after you upload to HDFS
(You can see some change to your folder with your localhost)
>ls -lh /var/hadoop/hadoop-\\max/dfs/data/current/

3-2 List input folder on HDFS
>/opt/hadoop-0.20.2/bin/hadoop   fs    -ls   input
Found 13 items
-rw-r--r--   1 max supergroup    3936 2010-07-21 16:00 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-21 16:00 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-21 16:00 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-21 16:00 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-21 16:00 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-21 16:00 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-21 16:00 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-21 16:00 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-21 16:00 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/masters
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/slaves
-rw-r--r--   1 max supergroup    1243 2010-07-21 16:00 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-21 16:00 /user/max/input/ssl-server.xml.example

4. Download files from HDFS to local
Please check your local folder first
>ls

Use command “ hadoop fs -get “ to download it
>/opt/hadoop-0.20.2/bin/hadoop   fs   -get   input    fromHDFS

Please check your local folder again
>ls

5. Use -cat to check the file on HDFS
>/opt/hadoop-0.20.2/bin/hadoop   fs   -cat   input/slaves
localhost

6. Delete files on HDFS with -rm ( with directory please use -rmr )
Check input Folder’s files first, you will see /user/max/input/slaves exist
>> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/input
Found 13 items
-rw-r--r--   1 max supergroup    3936 2010-07-21 16:00 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-21 16:00 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-21 16:00 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-21 16:00 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-21 16:00 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-21 16:00 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-21 16:00 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-21 16:00 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-21 16:00 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/masters
-rw-r--r--   1 max supergroup      10 2010-07-21 16:00 /user/max/input/slaves
-rw-r--r--   1 max supergroup    1243 2010-07-21 16:00 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-21 16:00 /user/max/input/ssl-server.xml.example

Use hadoop fs -rm to delete file with name slaves
>/opt/hadoop-0.20.2/bin/hadoop   fs   -rm   input/slaves
Deleted hdfs://localhost:9000/user/max/input/slaves

Check input Folder’s files again, you will see /user/max/input/slaves not exist
>> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/input
Found 12 items
-rw-r--r--   1 max supergroup    3936 2010-07-22 15:08 /user/max/input/capacity-scheduler.xml
-rw-r--r--   1 max supergroup     535 2010-07-22 15:08 /user/max/input/configuration.xsl
-rw-r--r--   1 max supergroup     379 2010-07-22 15:08 /user/max/input/core-site.xml
-rw-r--r--   1 max supergroup    2367 2010-07-22 15:08 /user/max/input/hadoop-env.sh
-rw-r--r--   1 max supergroup    1245 2010-07-22 15:08 /user/max/input/hadoop-metrics.properties
-rw-r--r--   1 max supergroup    4190 2010-07-22 15:08 /user/max/input/hadoop-policy.xml
-rw-r--r--   1 max supergroup     254 2010-07-22 15:08 /user/max/input/hdfs-site.xml
-rw-r--r--   1 max supergroup    2815 2010-07-22 15:08 /user/max/input/log4j.properties
-rw-r--r--   1 max supergroup     270 2010-07-22 15:08 /user/max/input/mapred-site.xml
-rw-r--r--   1 max supergroup      10 2010-07-22 15:08 /user/max/input/masters
-rw-r--r--   1 max supergroup    1243 2010-07-22 15:08 /user/max/input/ssl-client.xml.example
-rw-r--r--   1 max supergroup    1195 2010-07-22 15:08 /user/max/input/ssl-server.xml.example

Use hadoop fs -rmr to delete folder
>/opt/hadoop-0.20.2/bin/hadoop   fs   -rm   input
Deleted hdfs://localhost:9000/user/max/input

Lab 3 Hadoop example pratice

1.grep command

1-1 Upload   /opt/hadoop-0.20.2/conf folder to HDFS and rename to source
Syntax is
#hadoop                                          upload      LocalFolder                       HDFS-Folder-Name
>/opt/hadoop-0.20.2/bin/hadoop   fs   -put   /opt/hadoop-0.20.2/conf           source

1-2 Check upload source folder sucessful

> /opt/hadoop-0.20.2/bin/hadoop   fs   -ls   /user/max/
Found 1 items
drwxr-xr-x   - max supergroup       0 2010-07-23 15:13 /user/max/source

1-3 Use grep command to find out files in source folder, and the content text start with dfs , save it to output-1

>/opt/hadoop-0.20.2/bin/hadoop   jar   /opt/hadoop-0.20.2/hadoop-0.20.2-examples.jar grep   source   output-1    'dfs[a-z.]+'

1-4 Check Result
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls    output-1
Found 2 items
drwxr-xr-x   - max supergroup       0 2010-07-20 00:33 /user/max/output/_logs
-rw-r--r--   1 max supergroup      96 2010-07-20 00:33 /user/max/output/part-00000

>/opt/hadoop-0.20.2/bin/hadoop fs   -cat    output-1/part-00000
3    dfs.class
2    dfs.period
1    dfs.file
1    dfs.replication
1    dfs.servers
1    dfsadmin
1    dfsmetrics.log

2. wordcount Pratice

2-1 Count how many words in source folder and save it to output-2
>/opt/hadoop-0.20.2/bin/hadoop   jar /opt/hadoop-0.20.2/hadoop-0.20.2-examples.jar wordcount   source   output-2

2-2 Check result
>/opt/hadoop-0.20.2/bin/hadoop   fs   -ls    output-2
Found 2 items
drwxr-xr-x   - max supergroup       0 2010-07-20 02:00 /user/max/output-2/_logs
-rw-r--r--   1 max supergroup 10886 2010-07-20 02:01 /user/max/output-2/part-r-00000

Display the result with -cat
>/opt/hadoop-0.20.2/bin/hadoop   fs -cat   output-2/part-r-00000

Lab 4 Hadoop Cluster

-- Please do it on Server 2 ---
Login with max password linux
1, Prepare the folder

>sudo mkdir   /opt/hadoop-0.20.2
Input root password, Please input linux

>sudo mkdir   /var/hadoop
>sudo chown   -R max:users   /opt/hadoop-0.20.2/
>sudo chown   -R max:users   /var/hadoop

Set up Name resolve ( It’s very important )
>sudo   vi   /etc/hosts
Please comment server2’s name resolve for 127.0.0.2
#127.0.0.2    server2.digitalairlines.com    server2
Please add server1 and server2 IP ( Depend on your env)
10.10.x.y    server.digitalairlines.com    server
10.10.v.w    server2.digitalairlines.com    server2

-----------------------------------------------------------------------------------------------------------------------

-- Please do it on Server 1 ---

1-1 stop hadoop
>/opt/hadoop-0.20.2/bin/stop-all.sh

1-2 Delete old   hadoop folder
>rm -rf   /var/hadoop/*

1-3 Modify Namenode configurtion
>vi   /opt/hadoop-0.20.2/conf/core-site.xml
Please fix
                            <value>hdfs://localhost:9000</value>
To server1 IP
                            <value>hdfs://Srv1’s ip:9000</value>

***You can use “>ip address show” or “/sbin/ifconfig” display IP address***

1-4 Modify HDFS replication setting
>vi   /opt/hadoop-0.20.2/conf/hdfs-site.xml
Please fix
            <value>1</value>
to
            <value>2</value>

1-5
>vi /opt/hadoop-0.20.2/conf/mapred-site.xml
Please fix

<value>localhost:9001</value>

1-6 Set up slaves (The host who act slaves will be datanode and tasktracker role)
>vi /opt/hadoop-0.20.2/conf/slaves
Please delete localhost
Please add Srv1’s ip
Please add Srv2’s ip

***The ip addree might be 10.10.x.y ***

1-7 Set up Name resolve
>sudo   vi   /etc/hosts
Please comment server1’s Name resolve for 127.0.0.2
#127.0.0.2    server.digitalairlines.com    server
Please add server1 and server2 IP address for name resolve
10.10.x.y    server.digitalairlines.com    server
10.10.v.w    server2.digitalairlines.com    server2

1-8 Modify ssh configuration
>sudo   vi   /etc/ssh/ssh_config
Uncomment the StrictHostKeyChecking and modify it to no
# StrictHostKeyChecking ask

StrictHostKeyChecking no

1-9 Copy SSH Key to another Node

>scp   -r   ~/.ssh   Srv2-IP:~/
Warning: Permanently added '10.10.v.w' (RSA) to the list of known hosts.
Password: Please input max password

Test connect to SSH without password -- with key
Connect to server1
>ssh    Srv1’s IP
>exit
Connect to server2
>ssh    Srv2’s IP
>exit

1-10 Copy hadoop to Server 2
>scp   -r   /opt/hadoop-0.20.2/*    Srv2-IP:/opt/hadoop-0.20.2/

1-11 Format HDFS
>/opt/hadoop-0.20.2/bin/hadoop   namenode   -format

1-12 Start DFS ( It will depend /opt/hadoop-0.20.2/conf/slaves to active datanode )
>/opt/hadoop-0.20.2/bin/start-dfs.sh
Please check there are 2  datanode
starting namenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-namenode-linux-7tce.out
10.10.x.y: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-linux-7tce.out
10.10.v.w: starting datanode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-datanode-server2.out
localhost: starting secondarynamenode, logging to /opt/hadoop-0.20.2/logs/hadoop-max-secondarynamenode-linux-7tce.out

Please Check “ http://Srv1’s IP:50070/ ”
Please check “Live Nodes” -- It should be 2

1-13 Start JobTracker
>/opt/hadoop-0.20.2/bin/start-mapred.sh

Please Check “ http://Srv1’s IP:50030/ ”
Please Check “Nodes” -- It should be 2

Now, just run programs like Lab3 to examine

sakananote

星期四, 9月 30, 2010

Hadoop with openSUSE -- English Version