Flume: sending data via stream

By | December 18, 2016

It is possible to capture streaming data in HDFS files. A tool to do this is Flume. The idea is that we have 3 elements: sources that provide a stream, a channel that transports the stream and a sink where the stream ends in a file.
This can already be seen if we look at the config file:

agent1.sources = netcat-source
agent1.sinks = hdfs-sink
agent1.channels = memory-channel

# Describe/configure the source
agent1.sources.netcat-source.type = netcat
agent1.sources.netcat-source.bind = 192.168.2.60
agent1.sources.netcat-source.port = 12345
agent1.sources.netcat-source.channels = memory-channel
# Describe the sink
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = /loudacre/webtom/
agent1.sinks.hdfs-sink.channel = memory-channel
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
agent1.channels.memory-channel.type = memory
agent1.channels.memory-channel.capacity = 10000
agent1.channels.memory-channel.transactionCapacity = 10000

It all starts with a source that is a netcat stream that is sent to port 12345. The source is labelled “sources”. Then we have a sink that is labelled as “hdfs-sink”. Finally, we have the channel that is labelled “memory-channel”. This “memory-channel” is also mentioned in the sources as the channel that is used to send the stream into and it is mentioned in the sink as the faucet that delivers the data.
On another machine, we start the netcat stream with:

type "C:\Program Files (x86)\netcat\readme.txt" |   "C:\Program Files (x86)\netcat\nc.exe"  192.168.2.60 12345

This sends the content of a file as stream to a netcat proces that sends the stream to host 192.168.2.60 with port 12345. Exactly these sources were mentioned in the config file as the source of the stream.
The flume process is started with

flume-ng agent --conf /etc/flume-ng/conf --conf-file home/training/training_materials/dev1/exercises/flume/solution/bonus_netcat_tom.conf --name agent1 -Dflume.root.logger=INFO,console

We may see the data being received in files of HDFS: