Archive for the 'BigData' Category

Hadoop: Checking for hung jobs

Thursday, September 17th, 2020

I was working on my EMR cluster and needed to test some issues with steps of our process but I don’t have access to the web tools to monitor processes. When I ran one of my hadoop processes it just sat at the beginning of the process. It turned out I had some hung processes but needed to use the Hadoop yarn command-line tool to find out what was going on.

Using the following command shows the queue for processes queued to run:

> yarn application -list

If you do find an issue with a hung process, use the following to end the process:

>yarn application -kill <application-id>

Tags:

Hadoop File System Commands

Friday, August 24th, 2018

When working with Hadoop you might need to move files in and out of the HDFS. The hadoop fs commands can get tedious to type out over and over again, so I came up with some aliases that work to reduce my typing.

The commands that I have found myself using most often are:

Listing the directories and files:

hadoop fs -ls <filename(s)>

hls <filename(s)>

Viewing the contents of files:

hadoop fs -cat <filename(s)>

hcat <filename(s)>

I usually pipe that into an awk script to verify data.

Removing Entries from the HDFS:

hadoop fs -rm <filename(s)>

hrm <filename(s)>

Copying files to HDFS:

hadoop fs -copyFromLocal <local filename> <target directory or filename>

hcpto <local filename> <target directory or filename>

Copying from HDFS to the local system:

hadoop fs -copyToLocal <remote file(s)> <target directory or filename>

hcpfrom <remote file(s)> <target directory or filename>

Note: The <filename(s)> should be replaced with whatever files or directories including using the ‘*’ to select multiple files. Only the copy commands should have a singular entry for the target directory.

Here are the alias entries I use:

alias h='hadoop fs'
alias hcat='hadoop fs -cat '
alias hls='hadoop fs -ls '
alias hrm='hadoop fs -rm '
alias hcpto='hadoop fs -copytoLocal'
alias hcpfrom = 'hadoop fs -copyFromLocal'

Tags:

Hadoop Debugging with Counters

Thursday, August 23rd, 2018

I have been enjoying learning Hadoop and have had to debug issues within a current process job to enhance it for new data points. It is quite the challenge to debug a distributed system but I have found two ways to get some meaningful input from the system. One way is using Counters and the other is to use MultipleOutputs to capture output from errors(This will be in a later post). This article will show how counters can be used.

The Counter method allows you to specify a set of enum values to specify the counter name:

public enum Counters {
  CHOCOLATE,
  VANILLA,
  MINT_CHOCO_CHIP
}

This will be used as the identifier for the counter that is in use. The same code can be used within the Mapper or the reducer depending on where you are looking for the counts. The difference is where in the final output of the Hadoop process that the counts show up.

if (flavor.contains("CHOCOLATE")) {
    context.getCounter(Counters.CHOCOLATE).increment(1);
}
if (flavor.contains("VANILLA")) {
    context.getCounter(Counters.VANILLA).increment(1);
}
if (flavor.contains("MINT_CHOCO_CHIP")) {
    context.getCounter(Counters.MINT_CHOCO_CHIP).increment(1);
}

This way you will find a line for each counter that was incremented during the Hadoop process.

I have used it to highlight problems with in the code or just to make sure that a segment of code has been run.

Tags: ,