Bigdata SQL: How an SQL Engine Works in an Analytic Database

Analytic databases are used in Data Warehouse (DW) and BI applications to support low-latency complex analytic queries. These databases are based on Massively Parallel Processing (MPP) architectures.

MPP systems consist of large numbers of processors loosely coupled, with each processor having its own memory and storage attached to a network backplane. MPP systems are architected to be shared-nothing, with the processor and disk operating in parallel to divide the workload. Each processor communicates with its associated local disk to access the data and perform calculations. One processor is assigned the role of master, to coordinate and collect the intermediate results and assemble the query
response. A major weakness of this architecture is that it requires significant movement of data from disks to processors for executing the queries.

The interconnect between each processor-disk pair becomes the bottleneck, with data traffic adversely affecting query response timings. The inability of data transfer speeds to keep pace with growing data volumes creates a performance bottleneck that inhibits performance and scalability of the MPP architecture. Concurrency, i.e., multiple user queries, all coming at relatively the same time, causes lot of performance and
scheduling problems in MPP-based architectures. Typically in MPP systems, the data is automatically partitioned across multiple nodes
(worker nodes), based on a hashing algorithm on one or more columns in the data set.
The query first goes to the master node, where it is parsed and semantically analyzed and the query execution plan generated. The execution plan is relayed to each of the worker nodes, where some partition of the data set resides. Once the worker nodes are done executing the query in their partition of the data, results are transmitted to the master node, where a
coordinator consolidates the results and returns the result set to the client.