I used spring-XD for data ingestion for a while until I had to move to a different distribution of Hadoop which didn't come with XD installed. I started using Flume and while XD makes life much simpler by giving the ability to define stream and tap in one line, Flume does need at the min a conf file with 3 things (source, channel and sink defined), but has more bells.
Data Ingestion and Monitoring
Let's start off with a very simple way of getting data across (in a terminal), with out being in Hadoop ecosystem for a min
$ cp /src/path/file /dest/path/file
There is a source file (source), there is destination (sink) and the means of getting the file across (channel) is the file-system itself
$ tail -F /log/file >> /mydir/logfile
Continuous, real time data fetching of a log file using tail - simple enough. Lets write a small utility which alerts (from sending you an SMS to emailing you, the alert can be anything), lets just print it as alter if there is error found in the log.
tail -F /path/to/log/latest.log | python errormonitor.py
line = sys.stdin.readline()
if "error" in line.lower():
print "Alert!, Error Found"
Getting started with Flume
If we were to extend the python script to support various sources, channels and sinks it's going to be more than 10 lines. If the script has to guarantee every line is from source to sink is reached, in a distributed system, the number of lines grows exponentially - Enter Flume.
Below are the sources, channels and sinks supported in Flume (version 126.96.36.199.2.4.2-2)
In a nut shell, all the 3 components needs to be set, so the conf file goes like this:
Source = Exec Source
Sink = HDFS
Channel = Memory
Exec Source => Memory => HDFS
The below flume conf file does the same as the tail -F we had above (with out monitoring)
#File Name - simplelogger.conf
log.sources = src
log.sinks = snk
log.channels = chs
log.sources.src.type = exec
log.sources.src.command = tail -F /path/to/log/latest.log
log.sinks.snk.type = logger
log.channels.chs.type = memory
#Tie Source and Sink to Channel
log.sinks.snk.channel = chs
log.sources.src.channels = chs
Nitty-Gritty: See how all the declaration start with same name "log", that's in line with "Java" (the file name to be the same as the class name) - so when the flume agent has to be started, the name should be same as the one used above:
For the above config file, the flume agent to be started as:
flume-ng agent -f simplelogger.conf -n log
"WARN node.AbstractConfigurationProvider: No configuration found for this host:nnnn" and the agent doesnt work.
Nitty-Gritty (2) - While setting the channel properties of sources and sinks, there is a very subtle difference, not sure why this was introduced, if you know why, please leave a comment
For the sink, it's singular (channel) - log.sinks.snk.channel = chs
For the source, it's plural (channels) - log.sources.src.channels = chs
Perhaps because there can be more than one source?
Anyway, happy fluming, this is a must tool to get data from non-traditional sources into Hadoop or any other sink.