Just some stats about the compression codecs with file types (Hive stored as) to check which one to choose.
To compress data in HDFS, three are 2 ways:
Compress the source files (gz/bzip2) and store them in the hive table or copy to HDFS and build a table on it (catalog) - usually using load data (the combinations of load data and type of storage is provided below)
Secondly, data can be compressed using the map/reduce output (query output) by setting the codec that's required and using that compressed output to insert into the table
insert overwrite table seqbzip_block_tbl select * from texttable;
Select from text table generates the zip files as the output, we use that output as input to the table.
Again, different combinations and theirs results are provided in the table below (more of a truth table)
Finally, did an test to figure out the best combnation using a 9.8 GB File on a cluster (24 nodes). From storage point of view, ORCFILE/GZipCodec seems to be the go to choice. My test indicates that the time taken for a full table scan is also ORCFILE/GZipCodec the winner. Widely used combination to 'store' the data is ORC/Gzip and for queries the intermediate results's codec seems to be snappy.