Wednesday, October 28, 2015

Getting Started with Apache Flume

I used spring-XD for data ingestion for a while until I had to move to a different distribution of Hadoop which didn't come with XD installed. I started using Flume and while XD makes life much simpler by giving the ability to define stream and tap in one line, Flume does need at the min a conf file with 3 things (source, channel and sink defined), but has more bells.

Data Ingestion and Monitoring

Let's start off with a very simple way of getting data across (in a terminal), with out being in Hadoop ecosystem for a min

$ cp /src/path/file /dest/path/file
There is a source file (source), there is destination (sink) and the means of getting the file across (channel) is the file-system itself

$ tail -F /log/file >> /mydir/logfile
Continuous, real time data fetching of a log file using tail - simple enough. Lets write a small utility which alerts (from sending you an SMS to emailing you, the alert can be anything),  lets just print it as alter if there is error found in the log.

tail -F /path/to/log/latest.log |  python

The source (source) is the log file, (channel) is unix pipe and (sink) is the python script. The script continues to read from standard input and alerts when error is found - In less than 10 lines, we have a real time monitoring system

-----------Python listing-------------------
import sys

while 1:
        line = sys.stdin.readline()
        if "error" in line.lower():
                print "Alert!, Error Found"

Getting started with Flume

If we were to extend the python script to support various sources, channels and sinks it's going to be more than 10 lines. If the script has to guarantee every line is from source to sink is reached, in a distributed system, the number of lines grows exponentially - Enter Flume.

Below are the sources, channels and sinks supported in Flume (version

Each source, channel and sink has quite a few properties that needs to be set (some properties are mandatory). In order to specify these properties there is a configuration file required at the minimum. In a full implementation,there is configuration directory that specifies the flume environment, log4j properties etc

In a nut shell, all the 3 components needs to be set, so the conf file goes like this:

Source = Exec Source
Sink = HDFS
Channel = Memory

Exec Source => Memory => HDFS

The below flume conf file does the same as the tail -F we had above (with out monitoring)

#File Name - simplelogger.conf

log.sources = src
log.sinks = snk
log.channels = chs

#Define Source

log.sources.src.type = exec
log.sources.src.command = tail -F /path/to/log/latest.log

#Define Sink

log.sinks.snk.type = logger

#Define Channels

log.channels.chs.type = memory

#Tie Source and Sink to Channel = chs
log.sources.src.channels = chs

Nitty-Gritty: See how all the declaration start with same name "log", that's in line with "Java" (the file name to be the same as the class name) - so when the flume agent has to be started, the name should be same as the one used above:

For the above config file, the flume agent to be started as:
flume-ng agent -f simplelogger.conf -n log

If the names don't match, there is an error thrown:
"WARN node.AbstractConfigurationProvider: No configuration found for this host:nnnn" and the agent doesnt work.

Nitty-Gritty (2) - While setting the channel properties of sources and sinks, there is a very subtle difference, not sure why this was introduced, if you know why, please leave a comment

For the sink, it's singular (channel) - = chs
For the source, it's plural (channels) - log.sources.src.channels = chs

Perhaps because there can be more than one source?

Anyway, happy fluming, this is a must tool to get data from non-traditional sources into Hadoop or any other sink.

No comments:

Post a Comment