Bigdata SQL: How to Choose an SQL-on-Big-Data Solution

Owing to the surfeit of products and tools in the SQL-on-Hadoop space, it is often very difficult to choose the right one. Tool selection is not an easy task by any measure. Some of the points to consider when selecting the tools/products are listed following. This list includes questions that have to be answered by the architectural requirements, service level agreements (SLAs), and deployment options for the tool.

• What are the latency requirements?

• What is the fault tolerance ?

• Deployment options : Does the tool have to be installed across all data nodes in the cluster? Does the tool require a separate cluster? Can the tool be used on the cloud? This can have implications from budgeting, SLA, and security perspectives.

• Hardware requirements : Does the tool require special CPU chipsets, memory requirements, or HDD/SDD requirements?

• How does the tool handle node failures? How does the tool
handle adding new nodes? How does the tool handle adding new
data sets?

• Processing requirements : Does the tool require special processing
before it can be used?

• Analytical/SQL feature capabilities : Not all tools are ANSI SQL compliant. Not all of them support Analytic/Window functions.

• Can the tool handle semi-structured/unstructured data?

• Can the tool handle streaming analytics with streaming data?

• Extensibility capabilities of the tool : How easy/difficult is it to add new features UDFs (User Defined Functions), etc., to the tool?

• Pricing : Some tools are priced according to the number of nodes, some by the data they ingest/work upon.

• Maturity/community size/customer feedback/number of customers

• Does it support real-time operational analytics?

• Can it perform reliable, real-time updates?

• Can it support concurrent user activity consistently with no deadlocks, etc.?

• Can it support authentication and integration with security frameworks?

• Does it provide connectivity through the drivers/APIs?

• Can it handle compressed data?

• Does it support secondary indexes?

• What kind of join algorithms does it use to speed up large joins?

• What kind of SQL Query and Query execution optimization does it offer.