Migrating SQL Server Data to Hadoop with Sqoop

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets (“big data”) across clusters of computers by using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop can gracefully handle hardware failures at the application level, allowing it to aggregate data from a large number of sources including sources which may be prone to hardware or network failure.

The Apache Hadoop™-based Services for Window Azure (currently available in a Developer Preview release) are a set of developer services that can be used to build and deploy big-data analytics on Windows Azure, powered by Hadoop.

Apache Sqoop

Big data is typically a combination of structured and unstructured data, and may come from disparate data sources based on heterogeneous technologies. If you have structured data stored in SQL Server that you would like to analyze with Hadoop, you could include your SQL Server database as a Hadoop data source, and allow map reduce processes to run against it. But this approach can complicate your Hadoop jobs, and the large volume of traffic from Hadoop cluster nodes may impact the performance of your production database. You could instead write a script to copy the SQL Server data to HDFS and the populate tables in Hive and HBase, but this can be efficient.

Apache Sqoop is the solution to these problems. Sqoop is a tool for efficiently transferring data between structured databases and Hadoop, and it handles the work by breaking it up into smaller pieces that can be processed via map reduce and then aggregated on the back end in HDFS.

For detailed guidance on how to use Sqoop to migrate SQL Server data to Hadoop on Windows Azure, see the article Hadoop on Windows Azure – Working with Big Data.