🌀 Apache Flume Programming Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data (especially from web servers) to a centralized data store like HDFS or HBase.

Flume is mainly used in Big Data ecosystems for ingesting streaming log data into Hadoop.

✅ Core Use Case

Collect log data from web servers
Stream it into HDFS or HBase in near real-time
Scalable and fault-tolerant

⚙️ Flume Architecture Components

Component	Description
Source	Receives data from external sources (e.g., syslogs, files, etc.)
Channel	Acts as a buffer (e.g., memory or file-based)
Sink	Sends data to a final destination (e.g., HDFS, HBase)

➡ Flow: Source → Channel → Sink

📋 Example: Flume Agent Configuration

Create a config file like flume.conf:

agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1

# Source: tail log file
agent1.sources.src1.type = exec
agent1.sources.src1.command = tail -F /var/log/syslog

# Channel: memory buffer
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 100

# Sink: HDFS
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/logs/
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text

# Bind source/channel/sink
agent1.sources.src1.channels = ch1
agent1.sinks.sink1.channel = ch1

▶️ Running Flume

Use the following command:

flume-ng agent --conf ./conf/ --conf-file flume.conf --name agent1 -Dflume.root.logger=INFO,console

🧠 Programming with Flume

Flume itself is configuration-driven, but you can extend or customize it using Java:

Custom Source / Sink in Java

public class MyCustomSource extends AbstractSource implements EventDrivenSource {
    @Override
    public void start() {
        // logic to collect data
    }

    @Override
    public void stop() {
        // cleanup code
    }
}

You register it in your Flume config:

agent1.sources.src1.type = com.mycompany.MyCustomSource

🔌 Built-in Sources, Channels, and Sinks

Source	Channel	Sink
exec	memory	hdfs
netcat	file	logger
avro	jdbc	hbase
spooling dir		file roll

📦 Installation Steps

Install Java 8+
Download Flume:
wget https://downloads.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz

Extract and configure:

tar -xvzf apache-flume-1.11.0-bin.tar.gz
cd apache-flume-1.11.0-bin

🧭 Learning Roadmap for Flume

🟢 Beginner

Understand architecture (Source → Channel → Sink)
Set up and run simple agents

🟡 Intermediate

Use different channel and sink types
Integrate with HDFS and Hive

🔴 Advanced

Write custom sources/sinks in Java
Optimize performance & fault tolerance
Build real-time ingestion pipelines with Flume + Kafka + Spark