Apache Sqoop Tutorial

1. Introduction to Sqoop

Apache Sqoop (SQL-to-Hadoop) is an open-source tool designed for efficiently transferring bulk data between relational databases (RDBMS) and Hadoop ecosystems (HDFS, Hive, HBase, etc.).

It is widely used in Big Data projects where enterprises need to import/export structured data to and from Hadoop.

Key Features of Sqoop:

Import data from RDBMS (MySQL, Oracle, PostgreSQL, SQL Server, etc.) into HDFS.
Export data from HDFS back to RDBMS.
Integration with Hive and HBase.
Parallel data transfer using MapReduce.
Fault tolerance and scalability.

2. Sqoop Architecture

RDBMS: Source/Target database.
Sqoop Client: Issues import/export commands.
Sqoop Connectors: Drivers that enable communication between RDBMS and Hadoop.
HDFS/Hive/HBase: Storage and processing layers in Hadoop.
MapReduce Jobs: Sqoop internally generates MapReduce jobs for parallel data transfer.

3. Sqoop Installation

Prerequisites:

Hadoop installed and running.
Java installed.
RDBMS (like MySQL).

Steps:

Download Sqoop from Apache Sqoop.
Extract and configure environment variables in .bashrc:

export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

Place JDBC driver (e.g., mysql-connector-java.jar) in $SQOOP_HOME/lib/.
Test installation:

sqoop version

4. Sqoop Commands

4.1 List Databases

sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root --password root

4.2 Import Data from RDBMS to HDFS

sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--target-dir /user/hadoop/employees \
--m 1

--m 1 → Number of mappers (parallelism).

4.3 Import Data to Hive

sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--hive-import \
--create-hive-table \
--hive-table emp_hive

4.4 Export Data from HDFS to RDBMS

sqoop export \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees_export \
--export-dir /user/hadoop/employees \
--m 1

4.5 Incremental Import

For handling new records:

sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--incremental append \
--check-column id \
--last-value 100

5. Sqoop Use Cases

Importing enterprise data into Hadoop for analytics.
Exporting processed data back to RDBMS.
Data migration between databases.
ETL workflows in Big Data pipelines.

6. Sqoop Best Practices

Use parallelism with appropriate --m value for performance.
Use incremental imports for real-time sync.
Compress data using --compress to save storage.
Validate imported/exported data.
Monitor MapReduce jobs for optimization.

7. Sqoop vs Alternatives

Sqoop: Best for structured data import/export.
Apache Flume: Best for unstructured/log data ingestion.
Apache Kafka: Best for real-time streaming data pipelines.