eTechtips » hadoop

Tag Archive

Hadoop File System Commands

When working with Hadoop you might need to move files in and out of the HDFS. The hadoop fs commands can get tedious to type out over and over again, so I came up with some aliases that work to reduce my typing.

The commands that I have found myself using most often are:

Listing the directories and files:

hadoop fs -ls <filename(s)>

hls <filename(s)>

Viewing the contents of files:

hadoop fs -cat <filename(s)>

hcat <filename(s)>

I usually pipe that into an awk script to verify data.

Removing Entries from the HDFS:

hadoop fs -rm <filename(s)>

hrm <filename(s)>

Copying files to HDFS:

hadoop fs -copyFromLocal <local filename> <target directory or filename>

hcpto <local filename> <target directory or filename>

Copying from HDFS to the local system:

hadoop fs -copyToLocal <remote file(s)> <target directory or filename>

hcpfrom <remote file(s)> <target directory or filename>

Note: The <filename(s)> should be replaced with whatever files or directories including using the ‘*’ to select multiple files. Only the copy commands should have a singular entry for the target directory.

Here are the alias entries I use:

alias h='hadoop fs'
alias hcat='hadoop fs -cat '
alias hls='hadoop fs -ls '
alias hrm='hadoop fs -rm '
alias hcpto='hadoop fs -copytoLocal'
alias hcpfrom = 'hadoop fs -copyFromLocal'

Hadoop Debugging with Counters

I have been enjoying learning Hadoop and have had to debug issues within a current process job to enhance it for new data points. It is quite the challenge to debug a distributed system but I have found two ways to get some meaningful input from the system. One way is using Counters and the other is to use MultipleOutputs to capture output from errors(This will be in a later post). This article will show how counters can be used.

The Counter method allows you to specify a set of enum values to specify the counter name:

public enum Counters {
  CHOCOLATE,
  VANILLA,
  MINT_CHOCO_CHIP
}

This will be used as the identifier for the counter that is in use. The same code can be used within the Mapper or the reducer depending on where you are looking for the counts. The difference is where in the final output of the Hadoop process that the counts show up.

if (flavor.contains("CHOCOLATE")) {
    context.getCounter(Counters.CHOCOLATE).increment(1);
}
if (flavor.contains("VANILLA")) {
    context.getCounter(Counters.VANILLA).increment(1);
}
if (flavor.contains("MINT_CHOCO_CHIP")) {
    context.getCounter(Counters.MINT_CHOCO_CHIP).increment(1);
}

This way you will find a line for each counter that was incremented during the Hadoop process.

I have used it to highlight problems with in the code or just to make sure that a segment of code has been run.