Apache Sqoop Tutorial
1. Introduction to Sqoop
Apache Sqoop (SQL-to-Hadoop) is an open-source tool designed for efficiently transferring bulk data between relational databases (RDBMS) and Hadoop ecosystems (HDFS, Hive, HBase, etc.).
It is widely used in Big Data projects where enterprises need to import/export structured data to and from Hadoop.
Key Features of Sqoop:
- Import data from RDBMS (MySQL, Oracle, PostgreSQL, SQL Server, etc.) into HDFS.
- Export data from HDFS back to RDBMS.
- Integration with Hive and HBase.
- Parallel data transfer using MapReduce.
- Fault tolerance and scalability.
2. Sqoop Architecture
- RDBMS: Source/Target database.
- Sqoop Client: Issues import/export commands.
- Sqoop Connectors: Drivers that enable communication between RDBMS and Hadoop.
- HDFS/Hive/HBase: Storage and processing layers in Hadoop.
- MapReduce Jobs: Sqoop internally generates MapReduce jobs for parallel data transfer.
3. Sqoop Installation
Prerequisites:
- Hadoop installed and running.
- Java installed.
- RDBMS (like MySQL).
Steps:
- Download Sqoop from Apache Sqoop.
- Extract and configure environment variables in
.bashrc:
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
- Place JDBC driver (e.g.,
mysql-connector-java.jar) in $SQOOP_HOME/lib/.
- Test installation:
sqoop version
4. Sqoop Commands
4.1 List Databases
sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root --password root
4.2 Import Data from RDBMS to HDFS
sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--target-dir /user/hadoop/employees \
--m 1
--m 1 → Number of mappers (parallelism).
4.3 Import Data to Hive
sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--hive-import \
--create-hive-table \
--hive-table emp_hive
4.4 Export Data from HDFS to RDBMS
sqoop export \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees_export \
--export-dir /user/hadoop/employees \
--m 1
4.5 Incremental Import
For handling new records:
sqoop import \
--connect jdbc:mysql://localhost:3306/testdb \
--username root --password root \
--table employees \
--incremental append \
--check-column id \
--last-value 100
5. Sqoop Use Cases
- Importing enterprise data into Hadoop for analytics.
- Exporting processed data back to RDBMS.
- Data migration between databases.
- ETL workflows in Big Data pipelines.
6. Sqoop Best Practices
- Use parallelism with appropriate
--m value for performance.
- Use incremental imports for real-time sync.
- Compress data using
--compress to save storage.
- Validate imported/exported data.
- Monitor MapReduce jobs for optimization.
7. Sqoop vs Alternatives
- Sqoop: Best for structured data import/export.
- Apache Flume: Best for unstructured/log data ingestion.
- Apache Kafka: Best for real-time streaming data pipelines.
✅ With this, you can now import/export data between relational databases and Hadoop ecosystems efficiently.