Latency : Does the product support the kind of latency SLAs you r business mandates? To make a decision requires careful consideration of the types of workloads, in addition toconcurrency tests. The best way to approach this is to conduct a POC (proof of concept) of the product, with your data sets and the cluster size that your organization can support, andbtain realistic numbers from the POC. This is a very critical step, and it is always wise to do it for your data with your workloads, rather than just to trust the TPC benchmarks orthe marketing materials of the product. One important point to remember when undertaking this POC is to make sure the product is tested for its latency at low, average, and worst-case concurrency levels, setting the expectations correctly.
Deployment modes : Does the product support the deployment models that your organization is comfortable with, whether they are cloud-, on-Premise-, or hybrid-based deployment models?
Data types : Does the product support all the data types that are part of the data sets within your organization?
Data formats : Does the product support the data formats that are currently being used in your data pipelines and ETL processes? If this is not the case, new ETL has to be written tonvert the data formats, and this could entail missing the SLAs. Data sources : Does the product work with all the data sources—streaming, files, directories, NoSQL, NewSQL, RDBMS, and mainframes—data sources that your organization relies on as raw data sources?
Dependence on HDFS : Does the product rely on Hadoopbased components, such as working with HDFS? Verifying this ensures that your organization is not adding accidental complexity to the overall deployments just to satisfy the requirements of the product you are selecting.
Schema requirements : Does the engine require schema definition up front, or can the schema be defined on read?
Query language support : This is a very important decision point. What type of query support does the engine offer out of the box, and does it satisfy the minimum query support criteria required for the system your organization is building? UDF/UDAF/UDTF support : To the preceding point, is it possible to build out unsupported features or functionality
using UDFs, UDAFs, UDTFs, and the language in which these
functionalities can be developed?
Support triggers : Does the system support triggers for alerts and notifications as a typical RDBMS would do, if this is an important criterion for your organization and workflows?
Security : What are the minimum security requirements, in terms of data encryption for data at rest, data in motion, and authentication and authorization policies and ACLs permitted by the system in consideration, and do they match what your current organizational policies are.
Concurrency : This is a very important consideration, in addition to that of latency. Understanding how the concurrency affects latency and the workload and whether the engine can be made scalable by adding more nodes with increasing concurrency must be thoroughly examined, and the system tuned, to get the best performance.
Compression : It is important to understand what kind of compression formats are supported by the engine and what the pros and cons of each of them are. Also important to evaluate are the performance characteristics of the engine for some of the chosen compression formats that are to be used by your organization.
Secondary indexes : Most SQL engines do not support secondary indexes from an architecture perspective. Depending on your use cases and your data, it is wise to include in your POC workloads that which would perform better with secondary index support. Then you should evaluate the performances of the engines without secondary index support and compare the trade-offs. Massive table join : Most MPP engines shy away or do not perform too well with joins across massive tables on a distributed platform. However, there could be cases in your organization in which you will require them. There are multiple ways to avoid this, but in the case where it cannot be avoided, it is best to learn the gotchas and test the robustness and latency of the SQL engine in consideration of such massive joins.
Optimizers : This is an internal component of most SQL engines, over which the end user has no control. However, it is still best to understand the kind of optimizations that the SQL
engine would achieve for the different workloads that your use cases require.
Performance tuning : Before selecting an SQL engine, it is vital to know the kinds of performance tuning knobs and controls available to the end user, to improve performance and identify
the bottlenecks and how they can be avoided.
Data import/export capabilities : It is important to know the different ways of ingesting and exporting data from the SQL engines and the latencies associated with them, making sure
that the processes are acceptable and the latencies are within the SLA bounds of your organization.
Data locality : How does the engine under consideration handle data locality issues? Does it totally disregard it, but still perform very well? This becomes especially important when
these engines are deployed on the cloud.
Hardware : The kind of hardware platform—in terms of CPU, memory, I/O and networking capabilities—that the SQL engine recommends for best practices and most optimum performance is an important consideration from budgetary, data growth and scalability perspectives.
Fault tolerance : Questions such as how is fault tolerance achieved by the architecture of the engine and its components are very important to consider clearly, in order to provide reliability and uptime-related metrics and SLAs to the end customers who are relying on the SQL engine for their activities.
CLI/API/ODBC/JDBC : What kind of APIs (REST) and driver support does the SQL engine have to connect with external applications? This may be a non-blocking issue in terms of
the product selection, but an important consideration when selecting the SQL engine of choice.
Tool support : Understanding the tooling support—admin, monitoring, maintenance, troubleshooting, data ingestion— of the SQL engine is very important for debugging and troubleshooting purposes, as well as for regular automation, monitoring, and admin-related functionalities.