My blog has moved to http://stuartjandrews.com!

You will be automatically redirected momentarily.

2011-09-06

Hadoop-Filesystem-Tools

If you’re like me, you get frustrated by the amount of typing that is required to copy a file from your Hadoop filesystem to your local filesystem, e.g.:

hdfs dfs -get hdfs://xxx/very/long/path/to/a/file \
    /yyy/very/long/path/to/a/file

Also, if you are like me, you want the directory structures of the two filesystems to be mirror-images. This means you typically have to type a common path component twice, which is redundant, time consuming, and error prone.

To address this issue (and to exercise my Bash scripting skills), I hacked together a collection of shell scripts that automate this process, together called HDFS-Tools. The HDFS-Tools simplify the management of files in your Hadoop Filesystem by helping to synchronize a local copy of the filesystem with HDFS.

How Does It Work?

To enable HDFS-Tools, one must first designate a directory to hold the root of the local copy; this is done by setting the HDFS_PREFIX environment variable. Paths relative to HDFS_PREFIX in the local copy are the same as in HDFS.

Once this is done, copying data between HDFS and your local copy is simply a matter of getting or putting a file; e.g.:

hget <path> 

HDFS-Tools deals with the task of expanding the path arguments to create the conventional command format, using the HDFS_PREFIX and your HDFS’s configuration. Furthermore, with some code from rapleaf’s dev blog, these commands have been augmented with filename auto-completion. Together, these features make hget, hput, etc., more convenient than using:

hdfs dfs -get <hdfs_path> <local_path>

Say goodbye to the frustration of typing long paths in HDFS. Indeed, you rarely need to type more than the commands themselves.

Filename Auto-Completion

Auto-completion is available for hls, hget, and hput, by pressing <TAB>. There may be a delay before results are displayed, as the query to the remote HDFS is issued. When the CWD is below HDFS_PREFIX, filename auto-completion displays paths relative to CWD; otherwise, they are relative to HDFS_PREFIX. In the later case, the paths are displayed with a / prefix.

Auto-completion for directories is a little clunky because a space character is appended to the result. In order to extend the path further, you must type <backspace><TAB>.

Details

HDFS-Tools consists of the following:

hpwd
List corresponding path in HDFS. When the current working directory resides under HDFS_PREFIX, the hpwd command lists the corresponding location in HDFS. The result has the form: hdfs://host/path. The command hpwd -r lists only the path component, while hpwd -p lists only the hdfs://host/ component.
hls
List files from HDFS. hls [path ..] lists files from HDFS that correspond to path; e.g. hdfs://host/[path ..]. When the current working directory resides under HDFS_PREFIX, the path is relative to it; e.g. hdfs://host/CWD/[path ..]. A recursive directory listing is produced with a -r flag.
hget
Retrieve files from HDFS. hget [path ..] copies the corresponding files from HDFS to the local filesystem. Directories will not be created unless the -p flag is present. Local files will not be overwritten, unless the -f flag is included.
hput
Copy files to HDFS. hput [path ..] copies local files to the corresponding locations in HDFS. HDFS files will not be overwritten, unless the -f flag is included.
hconnect
Connect to a remote HDFS. hconnect opens or closes an ssh tunnel for communication with remote HDFS.
henv
This is a configuration script for HDFS-Tools auto-completion.

Notes

  • Use option -h to display help for a command, and -v for extra debugging information.
  • When the current working directory is outside of HDFS_PREFIX, HDFS-Tools behave as though they have been invoked with the current working directory set to HDFS_PREFIX.
  • One drawback of HDFS-Tools is that filename globbing is not supported, so you can not do things like hget '[io]*'.

Installation & Setup

HDFS-Tools are available on GitHub.

Note: HDFS-Tools are configured for use with Hadoop 0.21.0.

Bare Minimum

  1. Install these scripts somewhere on your path
  2. HDFS_PREFIX – Select the local directory where you wish to mirror HDFS
  3. HADOOP_CONF_DIR – Select the directory containing the active configuration, in order to lookup information on HDFS
  4. Add the following line to your .bash_profile
    source <HDFS-TOOLS>/henv
    

For Remote Connections

  1. HDFS_USER – Set the user name used to connect to the remote hadoop filesystem
  2. HDFS_HOST – Set the host
  3. HDFS_PORT – Set the port

hconnect opens an ssh tunnel to the remote host using ssh -ND $HDFS_PORT $HDFS_USER@$HDFS_HOST

Examples Part 1

The first set of examples demonstrate the behavior of HDFS-Tools with CWD=HDFS_PREFIX, where HDFS_PREFIX=~/Data/Hdfs-2011-08-28.

List Files

  1. –> hls

    Found 3 items
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:50 /Users
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /jobtracker
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user
    

  2. –> hls -v user/stu

    HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
    HDFS_PWD=
    HDFS_URL=/user/stu/input/hdfs-site.xml
    Found 2 items
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:45 /user/stu/input
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user/stu/output
    

  3. –> hls -v not/a/valid/file

    HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
    HDFS_PWD=
    HDFS_URL=not/a/valid/file
    ls: Cannot access hdfs://localhost:9000//not/a/valid/file: No such file or directory.
    

Get Files

  1. –> hget /user/stu/output

    hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/output/
    

  2. –> hget -vf /user/stu/output

    hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/output/
    HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
    HDFS_PWD=
    HDFS_URL=user/stu/output/
    LOCAL_URL=/Users/stu/Data/Hdfs-2011-08-28/user/stu/output/
    LOCAL_DIR=/Users/stu/Data/Hdfs-2011-08-28/user/stu
    

Put Files

  1. –> hput /user/stu/output

    put: Target hdfs://localhost:9000/user/stu/output is a directory
    

  2. –> hput -vf /user/stu/output

    HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
    HDFS_PWD=
    HDFS_URL=user/stu/output
    LOCAL_URL=/Users/stu/Data/Hdfs-2011-08-28/user/stu/output
    HDFS_DIR=user/stu
    

Tab Completion

  1. –> hls <TAB>

    Users       jobtracker  user
    –> hls *
    

  2. –> hget u<TAB>

    –> hget user/stu *
    

  3. –> hput user/stu<TAB>

    /user/stu/input   /user/stu/output
    –> hput /user/stu/ *
    

  4. –> hput user/stu/<TAB>

    /user/stu/input   /user/stu/output
    –> hput /user/stu/*
    

Examples Part 2

When the CWD is located below HDFS_PREFIX, HDFS-Tools use relative paths. For example, with CWD=$(HDFS_PREFIX)/user/stu

  1. –> hget <TAB>
    input   output
    –> hget *
    

Examples Part 3

When the CWD is not below HDFS_PREFIX, HDFS-Tools behave as though they were involked from HDFS_PREFIX. The only difference is that paths on the command line are prefixed with /. In this case, we are using CWD=~

  1. –> hls

    Found 3 items
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:50 /Users
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /jobtracker
    drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user
    

  2. –> hls <TAB>

    /Users       /jobtracker  /user
    –> hls /*
    

  3. –> hput /use<TAB>

    –> hput /user/ *
    

  4. –> hget /user/stu/input

    hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/input
    

Examples Part 4

  1. –> hconnect -c

    ENABLED:  0
    RUNNING PROCESS:
    

  2. –> hconnect -t

    ENABLED:  0
    PID:
    ssh -ND 2600 sta2013@rodin.med.cornell.edu
    Started HDFS tunnel with PID: ‘7647’
    

  3. –> hconnect -c

    ENABLED:  1
    RUNNING PROCESS:  7647 ssh -ND 2600 sta2013@rodin.med.cornell.edu
    

  4. –> hconnect

    ENABLED:  1
    PID:  7647
    Stopping HDFS tunnel with PID: ‘7647’
    kill -9 7647
    

No comments:

Post a Comment