Hive has capabilities built in for extensibility, to interact with different data formats and to allow an end user to plug in new functionality for data transformations using UDFs
(user-defined functions), UDTFs (user-defined table functions) and UDAFs (user-defined aggregate functions). We will not be discussing the details of UDFs, UDTFs, and UDAFs here but will take a look at the concept of SerDe.
Serialization is the process of converting raw data into a byte stream, which is transmitted through the network, and for storage. Serialization is very important within the Hadoop ecosystem, because it reduces the data footprint, resulting in lesser storage and faster data transfer. Extensibility components such as SerDe and ObjectInspector interfaces provide Hive the capability to integrate with different data types and legacy data.
Serialization is the conversion of structured data into its raw form, while deserialization is the process of reconstructing structured form from the raw byte stream. Deserialization allows Hive to read data from a table, and serialization is writing it into HDFS. Hive has a number of built-in SerDe classes, and it supports building new custom serializers and deserializers for your own use case.
Hive was originally built to work with MapReduce file format, such as SequenceFile format and TextFormat. The whole idea of moving to the ORC file format was conceived to reduce I/O by reading only relevant columns, as required by the query, and supporting efficient columnar compression.
Before we go on to the next part, a brief comment on some of terms such as InputFormat and OutputFormat is worth providing. InputFormat defines how to read data from a raw input file into the Mapper. Because raw data can be in any format, InputFormat conveys the file format to the Mapper object. Typical InputFormats would be TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and so on.
The output side follows a similar concept, called OutputFormat, which defines to the Reducer process how to output the data.
Any record read by the InputFormat class in Hive is converted by the deserializer to a Java object, and an ObjectInspector converts this Java object to Hive s internal type system.
By default, when no InputFormat, OutputFormat, or SerDe is specified, Hive reverts to using a built-in SerDe class called LazySimpleSerDe . The ObjectInspector is the glue of the data types in Hive with the file formats. Hive has a multitude of ObjectInspectors to support Java primitives and collections of Java objects.