My blog has moved to http://stuartjandrews.com!

You will be automatically redirected momentarily.

Showing posts with label Bash. Show all posts

2011-09-06

Hadoop-Filesystem-Tools

If you’re like me, you get frustrated by the amount of typing that is required to copy a file from your Hadoop filesystem to your local filesystem, e.g.:

hdfs dfs -get hdfs://xxx/very/long/path/to/a/file \
    /yyy/very/long/path/to/a/file

Also, if you are like me, you want the directory structures of the two filesystems to be mirror-images. This means you typically have to type a common path component twice, which is redundant, time consuming, and error prone.

To address this issue (and to exercise my Bash scripting skills), I hacked together a collection of shell scripts that automate this process, together called HDFS-Tools. The HDFS-Tools simplify the management of files in your Hadoop Filesystem by helping to synchronize a local copy of the filesystem with HDFS.

How Does It Work?

To enable HDFS-Tools, one must first designate a directory to hold the root of the local copy; this is done by setting the HDFS_PREFIX environment variable. Paths relative to HDFS_PREFIX in the local copy are the same as in HDFS.

Once this is done, copying data between HDFS and your local copy is simply a matter of getting or putting a file; e.g.:

hget <path>

HDFS-Tools deals with the task of expanding the path arguments to create the conventional command format, using the HDFS_PREFIX and your HDFS’s configuration. Furthermore, with some code from rapleaf’s dev blog, these commands have been augmented with filename auto-completion. Together, these features make hget, hput, etc., more convenient than using:

hdfs dfs -get <hdfs_path> <local_path>

Say goodbye to the frustration of typing long paths in HDFS. Indeed, you rarely need to type more than the commands themselves.

Filename Auto-Completion

Auto-completion is available for hls, hget, and hput, by pressing <TAB>. There may be a delay before results are displayed, as the query to the remote HDFS is issued. When the CWD is below HDFS_PREFIX, filename auto-completion displays paths relative to CWD; otherwise, they are relative to HDFS_PREFIX. In the later case, the paths are displayed with a / prefix.

Auto-completion for directories is a little clunky because a space character is appended to the result. In order to extend the path further, you must type <backspace><TAB>.

Details

HDFS-Tools consists of the following:

hpwd: List corresponding path in HDFS. When the current working directory resides under HDFS_PREFIX, the hpwd command lists the corresponding location in HDFS. The result has the form: hdfs://host/path. The command hpwd -r lists only the path component, while hpwd -p lists only the hdfs://host/ component.
hls: List files from HDFS. hls [path ..] lists files from HDFS that correspond to path; e.g. hdfs://host/[path ..]. When the current working directory resides under HDFS_PREFIX, the path is relative to it; e.g. hdfs://host/CWD/[path ..]. A recursive directory listing is produced with a -r flag.
hget: Retrieve files from HDFS. hget [path ..] copies the corresponding files from HDFS to the local filesystem. Directories will not be created unless the -p flag is present. Local files will not be overwritten, unless the -f flag is included.
hput: Copy files to HDFS. hput [path ..] copies local files to the corresponding locations in HDFS. HDFS files will not be overwritten, unless the -f flag is included.
hconnect: Connect to a remote HDFS. hconnect opens or closes an ssh tunnel for communication with remote HDFS.
henv: This is a configuration script for HDFS-Tools auto-completion.

Notes

Use option -h to display help for a command, and -v for extra debugging information.
When the current working directory is outside of HDFS_PREFIX, HDFS-Tools behave as though they have been invoked with the current working directory set to HDFS_PREFIX.
One drawback of HDFS-Tools is that filename globbing is not supported, so you can not do things like hget '[io]*'.

Installation & Setup

HDFS-Tools are available on GitHub.

Note: HDFS-Tools are configured for use with Hadoop 0.21.0.

Bare Minimum

Install these scripts somewhere on your path
HDFS_PREFIX – Select the local directory where you wish to mirror HDFS
HADOOP_CONF_DIR – Select the directory containing the active configuration, in order to lookup information on HDFS
Add the following line to your .bash_profile
```
source <HDFS-TOOLS>/henv
```

For Remote Connections

HDFS_USER – Set the user name used to connect to the remote hadoop filesystem
HDFS_HOST – Set the host
HDFS_PORT – Set the port

hconnect opens an ssh tunnel to the remote host using ssh -ND $HDFS_PORT $HDFS_USER@$HDFS_HOST

Examples Part 1

The first set of examples demonstrate the behavior of HDFS-Tools with CWD=HDFS_PREFIX, where HDFS_PREFIX=~/Data/Hdfs-2011-08-28.

List Files

–> hls

Found 3 items
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:50 /Users
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /jobtracker
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user

–> hls -v user/stu

HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
HDFS_PWD=
HDFS_URL=/user/stu/input/hdfs-site.xml
Found 2 items
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:45 /user/stu/input
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user/stu/output

–> hls -v not/a/valid/file

HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
HDFS_PWD=
HDFS_URL=not/a/valid/file
ls: Cannot access hdfs://localhost:9000//not/a/valid/file: No such file or directory.

Get Files

–> hget /user/stu/output

hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/output/

–> hget -vf /user/stu/output

hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/output/
HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
HDFS_PWD=
HDFS_URL=user/stu/output/
LOCAL_URL=/Users/stu/Data/Hdfs-2011-08-28/user/stu/output/
LOCAL_DIR=/Users/stu/Data/Hdfs-2011-08-28/user/stu

Put Files

–> hput /user/stu/output

put: Target hdfs://localhost:9000/user/stu/output is a directory

–> hput -vf /user/stu/output

HDFS_PREFIX=/Users/stu/Data/Hdfs-2011-08-28
HDFS_PWD=
HDFS_URL=user/stu/output
LOCAL_URL=/Users/stu/Data/Hdfs-2011-08-28/user/stu/output
HDFS_DIR=user/stu

Tab Completion

–> hls <TAB>

Users       jobtracker  user
–> hls *

–> hget u<TAB>
```
–> hget user/stu *
```

–> hput user/stu<TAB>

/user/stu/input   /user/stu/output
–> hput /user/stu/ *

–> hput user/stu/<TAB>

/user/stu/input   /user/stu/output
–> hput /user/stu/*

Examples Part 2

When the CWD is located below HDFS_PREFIX, HDFS-Tools use relative paths. For example, with CWD=$(HDFS_PREFIX)/user/stu

–> hget <TAB>
```
input   output
–> hget *
```

Examples Part 3

When the CWD is not below HDFS_PREFIX, HDFS-Tools behave as though they were involked from HDFS_PREFIX. The only difference is that paths on the command line are prefixed with /. In this case, we are using CWD=~

–> hls

Found 3 items
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:50 /Users
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /jobtracker
drwxr-xr-x   – stu supergroup          0 2011-09-03 21:51 /user

–> hls <TAB>

/Users       /jobtracker  /user
–> hls /*

–> hput /use<TAB>
```
–> hput /user/ *
```

–> hget /user/stu/input

hget > Local path already exists /Users/stu/Data/Hdfs-2011-08-28/user/stu/input

Examples Part 4

–> hconnect -c
```
ENABLED:  0
RUNNING PROCESS:
```

–> hconnect -t

ENABLED:  0
PID:
ssh -ND 2600 sta2013@rodin.med.cornell.edu
Started HDFS tunnel with PID: ‘7647’

–> hconnect -c

ENABLED:  1
RUNNING PROCESS:  7647 ssh -ND 2600 sta2013@rodin.med.cornell.edu

–> hconnect

ENABLED:  1
PID:  7647
Stopping HDFS tunnel with PID: ‘7647’
kill -9 7647

2011-05-17

Version Control Systems are programs that maintain a history of edits or changes to a collection of files.

I often want to use versioning for individual isolated files, or collections of files, that live somewhere in my directory tree. However, because management of many independent repositories is hard, I am reluctant to create and maintain a repository wherever the file lives if the directory itself is not under active development.

Some examples of files like this are dotfiles in my home directory, and other preference or configuration files throughout my directory tree.

Instructions

Here’s a solution that was based on a suggestion by Casey Dahlin on Doug Warner’s blog. Comments from Peter Manis, Benjamin Schmidt and Martin Geisler were also helpful here, here, and here.

The idea is to create and manage a single repository for these files that will be easy to manage. Files can be added manually. Regular commits will be made on an ongoing basis, and can even be automated.

> initialization

First, create the repo in your home directory

cd $HOME
hg init

> safety precautions

Peter Manis points out that the hg purge command can remove all files in the working directory that are not added to the repo!! He advises to explicitly disable this command for the repo by adding the following to the project-level .hgrc file located in $HOME/.hg/hgrc

[extensions]
hgext.purge = !

> list, add, forget, remove, commit, status

You can list, add, forget, remove, and commit files to the repo with the following commands

hg manifest
hg add <files>
hg forget <files>
hg remove <files>
hg commit -m "Added/removed/changed file(s)"

The status command reports on all files in the working directory whether they have been added to the repo or not. This can take a long time.

hg status

> the default repo

To access the centralized repo from directories other than your $HOME directory, set the default path in your user-level .hgrc file located in $HOME/.hgrc

[path]
default = $HOME

The centralized repo is accessible only if the current working directory (PWD) is not itself the working directory of another repo.

Warning: A danger of using this preference is that you may end up using the centralized repo when you intended to use a local repo. For example, if you accidentally call hg from a non-versioned directory.

To identify which repo you are working on, just type

hg showconfig bundle.mainreporoot

An easier solution is to add the following to your user-level .hgrc file

[hooks]
pre-status = echo "======\nMAIN REPO ROOT = $PWD\n======"
pre-manifest = echo "======\nMAIN REPO ROOT = $PWD\n======"

This way, whenevery you type hg status or hg manifest, you will be told which repo is active.

> the .hgignore file

An alternative strategy for managing repo files, is to create an .hgignore file listing the files that you do not wish to be tracked, and then add / commit everything else.

A simple .hgignore file looks like this.

syntax: glob
*~
.*.swp
*.swp
*.o
.hg

syntax: regexp
^[a-zA-Z]

.file1
^\.file2\/file3

This file excludes several standard temporary files, any file named “.file1”, and files matching “file2/file3” in the repo’s root, working directory.

After editing the .hgignore file to your liking, you can preview your choices with

# 1. show "added" files (will be included in the next commit)
hg status -a
# 2. show "unknown" files (will not be included in the next commit)
hg status -u

Then, you can use the following shorthand to: 1) add all unknown files and commit the changes to the repo, and 2) view, the resulting contents

hg commit -A -m "Added/removed/changed file(s)"
hg manifest

> .file & .directory sizes

One trick for building .hgignore is to detect and exclude LARGE dotfiles and directories. At first, I tried to these using

ls -lSd .* | head -20

However, ls does not measure the size of directories when reporting relative size. To see the largest items accounting for the total size of directories, use the following

for X in $(du -s .[a-zA-Z]* | sort -nr | cut -f 2); do du -hs $X ; done | head -20

This will produce sorted output, for example

8.2G    .Trash
 99M    .dropbox
 21M    .m2
 19M    .groovy
6.0M    .macports
3.7M    .vim
1.8M    .fontconfig
976K    .ipython
...

> resetting the repo

If you are not happy with the current manifest, and are willing to start again from scratch, use the following commands. WARNING: This will erase any history!

cd $HOME
\rm -rf .hg
hg init
hg commit -A -m "Added/removed/changed file(s)"
hg manifest

This can be used to refine the .hgignore file in order to initialize the repo

Glossary

Current Working Directory: PWD
Working Directory: “To introduce a little terminology, the .hg directory is the “real” repository, and all of the files and directories that coexist with it are said to live in the working directory. An easy way to remember the distinction is that the repository contains the history of your project, while the working directory contains a snapshot of your project at a particular point in history.” quote
Hg Init: Create a fresh repo. Fails when an existing repo exists in the working directory.
Hg Manifest: List files currently in repo
Hg Add: Add a file to repo
Hg Forget: Forget files previously added to repo, before committing
Hg Remove: Remove files previously added to repo, after they have been committed
Hg Commit: Commit all changes to repo

2011-05-06

Cmdline-Convert-Word-Doc-2-Pdf

Have you ever been in need of a convenient way to convert a Word document (.doc) into a Portable Document Format (.pdf)?

I do this fairly often, which prompted me to write this script using Google Docs.

It works by uploading the input file to a collection named “GDocs2Pdf” in your Google Docs account, and then requesting a Pdf version for download. If conversion is successful, the filename will have a .pdf extension appended and will be downloaded to the directory containing the input file.

Once installed and placed on your path, simply type gdocs2pdf.sh example-file.doc to run the script.

While I regularly use this script to convert Word documents to Pdf, other input formats are acceptable. Also, by modifying the --format=pdf argument on line 11, you can request alternative output file formats. Log into your Google Docs account to see which input and output formats are currently supported.

Requirements

A Google Docs account
A collection named “GDocs2Pdf” in your account
The GoogleCL command line utility installed^1,2 on your system
A connection to the internet

Installation

Paste the script into a file named “gdocs2pdf.sh”
Modify the path to the GoogleCL utility on line 7 as appropriate
Place it on your path, for example in your ~/bin directory
Make it executable chmod u+x gdocs2pdf.sh

Script: gdocs2pdf.sh

Please check https://github.com/tub78/GDocs2Pdf for updates to this script.

#!/bin/bash
# usage: gdocs2pdf.sh <file>
[ $# -eq 1 ] || { echo "usage: $(basename $0) <file>"; exit 1; }
TITLE=$(basename "$1")
DIR=$(dirname "$1")
FOLDER=GDocs2Pdf
GOOGLECL=/usr/local/bin/google
echo $GOOGLECL docs upload --title="$TITLE" --folder="$FOLDER" "$1"
$GOOGLECL docs upload --title="$TITLE" --folder="$FOLDER" "$1"
echo $GOOGLECL docs get --format=pdf --title="$TITLE" --folder="$FOLDER" --dest="$DIR"
$GOOGLECL docs get --format=pdf --title="$TITLE" --folder="$FOLDER" --dest="$DIR"

Notes

[1]: [GoogleCL webpage]. The utility is written in Python. Please see this page for installation instructions.
[2]: Configuration of GoogleCL is required to authorize access to your account. Configuration settings are initialized upon the first invocation of the tool, whether they are supplied as command line arguments or as answers to the interactive prompts. These defaults are used unless overridden using the flagged arguments. To change the defaults, see [Configuration Options].

Internal Structure

2011-09-06

Hadoop-Filesystem-Tools

Contents

Related

How Does It Work?

Filename Auto-Completion

Details

Notes

Installation & Setup

Bare Minimum

For Remote Connections

Examples Part 1

List Files

Get Files

Put Files

Tab Completion

Examples Part 2

Examples Part 3

Examples Part 4

2011-05-17

A-Centralized-VCS-For-Config-Files

Contents

Related

Instructions

> initialization

> safety precautions

> list, add, forget, remove, commit, status

> the default repo

> the .hgignore file

> .file & .directory sizes

> resetting the repo

Glossary

2011-05-06

Cmdline-Convert-Word-Doc-2-Pdf

Contents

Related

Requirements

Installation

Script: gdocs2pdf.sh