Bigdata SQL: Approaches to Solving SQL on Big Data

There are several categories of workloads that SQL-on-big-data solutions must address: SQL on batch-oriented workloads, SQL on interactive workloads, and SQL on streaming workloads. To add more complexity, data for each of these workloads can be structured or semi-structured.

There are basically four different approaches to doing SQL on big data:

1. Build a translation layer that translates SQL queries to equivalent MapReduce code and executes it on the cluster.  Apache Hive is the best example of the batch-oriented SQLon-Hadoop tool. It uses MapReduce and Apache Tez as an intermediate processing layer. It is used for running complex jobs, including ETL and production data “pipelines,” against
massive data sets.

2. Leverage existing relational engines, which incorporate all the 40-plus years of research and development in making them robust, with all the storage engine and query optimizations.  An example would be to embed MySQL/Postgres inside each of the data nodes in the Hadoop cluster and build a layer within them to access data from the underlying distributed
file system. This RDBMS engine is collocated with the data node, communicates with the data node to read data from the HDFS, and translates it to their own proprietary data format. Products such as Citus data and HAWQ leverage this architectural aspect of doing SQL on Hadoop.

3. Build a new query engine that co-resides in the same nodes as the data nodes and works with the data on HDFS directly to execute the SQL queries. This query engine uses a query splitter to route query fragments to one or more underlying data handlers (HDFS, HBase, relational, search index, etc.), to access and process the data.

Apache Drill and Impala were one the first few engines in this space to perform interactive SQL queries running over data on HDFS. This category of SQL on Hadoop engines excels at executing ad hoc SQL queries and performing data exploration and data discovery and is used directly by data
analysts to execute auto-generated SQL code from BI tools.

4. Use existing analytic databases (deployed on a separate cluster, different from the Hadoop cluster) that interact with the data nodes in the Hadoop cluster, using a proprietary connector to get data from HDFS, but execute the SQL queries within the analytical engine. These external analytical engines can be integrated to use metadata in Hive or HCatalog, to seamlessly work with the data in HDFS. Examples of such products include Vertica and Teradata.