Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data (especially from web servers) to a centralized data store like HDFS or HBase.
Flume is mainly used in Big Data ecosystems for ingesting streaming log data into Hadoop.
| Component | Description |
|---|---|
| Source | Receives data from external sources (e.g., syslogs, files, etc.) |
| Channel | Acts as a buffer (e.g., memory or file-based) |
| Sink | Sends data to a final destination (e.g., HDFS, HBase) |
➡ Flow: Source → Channel → Sink
Create a config file like flume.conf:
agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # Source: tail log file agent1.sources.src1.type = exec agent1.sources.src1.command = tail -F /var/log/syslog # Channel: memory buffer agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000 agent1.channels.ch1.transactionCapacity = 100 # Sink: HDFS agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/logs/ agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat = Text # Bind source/channel/sink agent1.sources.src1.channels = ch1 agent1.sinks.sink1.channel = ch1
Use the following command:
flume-ng agent --conf ./conf/ --conf-file flume.conf --name agent1 -Dflume.root.logger=INFO,console
Flume itself is configuration-driven, but you can extend or customize it using Java:
Custom Source / Sink in Java
public class MyCustomSource extends AbstractSource implements EventDrivenSource {
@Override
public void start() {
// logic to collect data
}
@Override
public void stop() {
// cleanup code
}
}
You register it in your Flume config:
agent1.sources.src1.type = com.mycompany.MyCustomSource
| Source | Channel | Sink |
|---|---|---|
| exec | memory | hdfs |
| netcat | file | logger |
| avro | jdbc | hbase |
| spooling dir | file roll |
wget https://downloads.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz
tar -xvzf apache-flume-1.11.0-bin.tar.gz
cd apache-flume-1.11.0-bin