Sqoop

Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. Apache Sqoop is a tool in Hadoop Ecosystem which is designed to transfer data between HDFS (Hadoop storage) and relational database servers like mysql, Oracle RDB, SQLite, Teradata, Netezza, Postgres etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to relational databases. It efficiently transfers bulk data between Hadoop and external datastores such as enterprise data warehouses, relational databases, etc. Internally It transform the data in Hadoop MapReduce,and perform the operation. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. 
 


When you have decided to move data from RDBMS to HDFS then the first product which comes into use is Apache Sqoop. When you request to bring the data to HDFS then the following things happen.
  1. Sqoop asks for metadata information from Relation DB.
  2. Relational DB returns the required request.
  3. Based on metadata information Sqoop generates java classes.
  4. Based on primary id partitioning happens in table as multiple mappers will importing data as the same time.
Sqoop provides many salient features like:
  • Full Load: Apache Sqoop can load the whole table by a single command. You can also load all the tables from a database using a single command.
  • Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of table whenever it is updated.
  • Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
  • Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
  • Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec argument. You can also load compressed table in Apache Hive.
  • Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the entire circumference.
  •  Kerberos Security Integration: Kerberos is a computer network authentication protocol which works on the basis of ‘tickets’ to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Sqoop supports Kerberos authentication.
  • Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is a NoSQL database.
  • Support for Accumulo: You can also instruct Sqoop to import the table in Accumulo rather than a directory in HDFS.
Working of Sqoop
  • When we try to use any sqoop tools.Sqoop will communicate with the database store , it will fetch meta-data information from RDBMS. Sqoop will use this meta data  for generating the java class.
  • Sqoop gets the metadata from DB store.
  • Sqoop will internally create a java class using JDBC API. Sqoop will compile the java class using JDK and create a .class file.A .jar file will be created from . class file.
  • After creating the jar files, sqoop will try to communicate with DB store again, and will try to find out the split column. Based on te split column Sqoop will fetch the data from DB
  • Finaly, Sqoop places the retrieved data into HDFS. 

No comments:

Post a Comment