BigData SQL: Why Is Interactive SQL So Important?

Interactive querying of big data is primarily for supporting data discovery, data exploration, and speed of thought-based data analysis capabilities. Allowing interactive ad hoc queries is essential, as more often than not, the user does not necessarily know all the questions to ask of the data ahead of time. As questions are asked and answers obtained, it raises new questions, based on thought process, to the answers to previous queries. During this process of asking new questions based on results obtained from previous queries, it is important to provide low latency; otherwise, the train of thought for a given user could be lost.
In the new world of big data, business intelligence (BI) and visual analytics tools are evolving to interactively work with big data ecosystems. Interactive workloads include the ability to execute ad hoc SQL queries on the data, through either front-end tools, APIs, or the command line. These interactive workload-class SQL engines execute ad hoc queries using a variety of optimization techniques on large-scale data sets, to return the results in the shortest possible time.
Currently, for interactive analysis, the pattern is to apply ETL (Extract, Transform, Load) to data from different source systems, prepare the data so that it is compatible for fast data access, and build a new data set that has aggregated information for best latency and I/O load characteristics. In other words, ETL is initially used to prepare the data to expedite data loading, preparing new data sets containing aggregated information, to make it efficient for BI tools to provide faster response time for downstream ad hoc analytic queries. This pattern has been used in the past in typical data warehouses and to build materialized views and data marts, but it inherently suffers from impedance mismatches and delays in the actual analysis, due to data preparation tasks.
At a high level, interactive SQL on Hadoop engines has the following challenges:
• Low latency : Response times for interactive queries should be significantly faster than with batch processes.
• ANSI SQL compatibility : Ensuring that interactive SQL engines on big data are used by the BI tools also presents the challenge of supporting the SQL ANSI standards and compatibility issues, to find broad applicability and usage.
• High concurrency : As these tools get better at providing low latency, more and more users adopt and use them, which gives rise to another challenge that these engines have to address: keeping low latency with high concurrency. It is an extremely challenging proposition to keep low-latency service-level agreements (SLAs) with high concurrency on large-scale data sets.
• Federated data sources : The ability to query data that is set across multiple different data sources, different formats, and different locations and form holistic results that select subsets of attributes from each of the data sets to get the answer.
Interactive SQL engines on big data is a fiercely competitive field in which both commercial and open source products are coming up with innovative ideas and architectures to address the challenges. This chapter will cover some of the popular SQLon-big-data engines: Spark SQL, Impala, Apache Drill, Jethro, and Vertica.