Bigdata SQL: Unstructured Data

Data not associated with structure or metadata is classified as unstructured. Textual data (e.g., e-mail, blogs, wiki posts, word and PDF documents, social media tweets) or nontextual data (e.g., images, audio, videos) are labeled as unstructured.

More often than not, unstructured data is noisy, and one major challenge of working with it is cleaning it before it can be put to use for analytics. For example, before doing Natural Language Processing (NLP) on textual data, the data has to be tokenized (i.e., stop words must be removed and stemming algorithms applied), to get it into a form in which sophisticated algorithms can be applied to make meaning out of the textual content.

Unlike SQL on structured data, SQL on semi-structured and unstructured data requires transformation to a structure that SQL Engines can interpret and operate. The acronym SQL stands for “Structured Query Language,” which means it is a language that works on structured data.

Technologies such as Apache Drill and SparkSQL have evolved and are evolving further to bring the rich features of SQL to semi-structured data like JSON. You will see more in  which we will discuss the architecture of SQL engines in terms of how they perform SQL over semi-structured and unstructured data.