Bigdata SQL: How to Choose a File Format

Each file format is optimized by some goal. Choice of format is driven by use case, environment, and workload. Some factors to consider in deciding file format include the following:
• Hadoop Distribution : Note that Cloudera and Hortonworks support/favor different formats.
• Schema Evolution : Consider whether the structure of data evolves over time.
• Processing Requirements : Consider the processing load of the data and the tools to be used in processing.
• Read/Write Requirements : What are the read/write patterns, is it read-only, read-write, or write-only.
• Exporting/Extraction Requirements : Will the data be extracted from Hadoop for import into an external database engine or other platform?
• Storage Requirements : Is data volume a significant factor? Will you get significantly more bang for your storage through compression?
If you are storing intermediate data between MapReduce jobs, Sequence files are preferred. If query performance is most important, ORC (Hortonworks/Hive) or Parquet (Cloudera/Impala) are optimal, but note that these files take longer to create and cannot be updated.
Avro is the right choice if schema is going to change over time, but query
performance will be slower than with ORC or Parquet. CSV files are excellent when
extracting data from Hadoop to load into a database.