It is possible to capture streaming data in HDFS files. A tool to do this is Flume. The idea is that we have 3 elements: sources that provide a stream, a channel that transports the stream and a sink where the stream ends in a file.
This can already be seen if we look at the config file:
agent1.sources = netcat-source agent1.sinks = hdfs-sink agent1.channels = memory-channel # Describe/configure the source agent1.sources.netcat-source.type = netcat agent1.sources.netcat-source.bind = 192.168.2.60 agent1.sources.netcat-source.port = 12345 agent1.sources.netcat-source.channels = memory-channel # Describe the sink agent1.sinks.hdfs-sink.type = hdfs agent1.sinks.hdfs-sink.hdfs.path = /loudacre/webtom/ agent1.sinks.hdfs-sink.channel = memory-channel agent1.sinks.hdfs-sink.hdfs.fileType = DataStream # Use a channel which buffers events in memory agent1.channels.memory-channel.type = memory agent1.channels.memory-channel.capacity = 10000 agent1.channels.memory-channel.transactionCapacity = 10000
It all starts with a source that is a netcat stream that is sent to port 12345. The source is labelled “sources”. Then we have a sink that is labelled as “hdfs-sink”. Finally, we have the channel that is labelled “memory-channel”. This “memory-channel” is also mentioned in the sources as the channel that is used to send the stream into and it is mentioned in the sink as the faucet that delivers the data.
On another machine, we start the netcat stream with:
type "C:\Program Files (x86)\netcat\readme.txt" | "C:\Program Files (x86)\netcat\nc.exe" 192.168.2.60 12345
This sends the content of a file as stream to a netcat proces that sends the stream to host 192.168.2.60 with port 12345. Exactly these sources were mentioned in the config file as the source of the stream.
The flume process is started with
flume-ng agent --conf /etc/flume-ng/conf --conf-file home/training/training_materials/dev1/exercises/flume/solution/bonus_netcat_tom.conf --name agent1 -Dflume.root.logger=INFO,console