Big Data: Hive Compression options and examples

Hadoop/Hive/Mapreduce paradim - all lead lots of I/O, network bandwidth between the nodes and inherently storage (not to mention redundant storage to aid fault tolerance)

It only makes sense to compress the data so we reduce all the 3 factors mentioned above. Downside? computation when compress/decompress is required and the memory associated - with what I have observed 24 CPUs - 126 GB RAM cluster, for external tables (flat files coming from RDBMS), the compression worked well, there were performance gains.

To begin with, the following parameters in hive are for compressoin. This is also a good time to mention there are 414 parameters that you can set in hive (versoin 0.14.0.2.2.4.2-2)

hive.exec.compress.intermediate
hive.exec.compress.output

hive.exec.orc.compression.strategy
hive.exec.orc.default.compress

mapred.map.output.compression.codec
mapred.output.compression.codec

Let's not worry about the orc file format for now. We are left with 4 compression parameters to play with (The compression types itself, we'll talk about it in a bit - gzip, bzip2, Snappy etc)

Check the default that's set in your installation for the above parameters, the config file is in (/etc/hive/conf/hive-site.xml)

In my installation:

hive.exec.compress.intermediate is set to false, which means the files produced in intermediate map steps are not compressed - set this to true so that we can get the I/O and Network taking up less bandwidth

hive.exec.compress.output is false, this parameter sets if the final output to HDFS will be compressed

The parameters mapred.map.output.compression.codec and mapred.output.compression.codec in the config file (/etc/hive/conf/mapred-site.xml) - MapReduce is one type of application that can run on a Hadoop platform so all the apps using map reduce framework goes in here.

If we notice how the config is split into mapreduce compression and hive over all comperession it gives a fair idea where and how compression applies.

Hive installation, as an application can use the tez and spark engine as well (controlled by set hive.execution.engin=[mr,tez,spark]), the compression for tez and spark is out of scope in this discussion. What this means is map reduce supports compression and we have a few choices to set the compression codec at map reduce frame work.

If I were to ~~draw~~ write the flow chart, it goes something like this:

hive.exec.compress.intermediate [true/false]
(should hive request execution engine to compress intermediate results?)
if true and engine is MR (hive.execution.engine)
mapreduce.map.output.compress [true/false]
(Output to be compressed before sending across network, between mappers?)
if true
mapreduce.map.output.compress.codec [Snappy/LZO/GzipCodec/BZip2Codec]
(the codec to be used for compressoin)
mapreduce.output.fileoutputformat.compress [true/false]
(Should the job output be compressed)
if true and hive.exec.compress.output [true/false]
(if hive should request execution engine to produce compressed output)
mapreduce.output.fileoutputformat.compress.type
[RECORD/BLOCK/NONE]
(at what level should the compression be done?)
mapreduce.output.fileoutputformat.compress.codec
[Snappy/LZO/GzipCodec/BZip2Codec]
(what codec should the fileformat be?)

Let's taken an example:

hive> create table alice (line String) row format delimited fields terminated by '\n';

Time taken: 0.167 seconds

#Note I have not zipped the file alice.txt (since we are only looking at output and intermediate formats). If I were to zip the input as well. I would:

$bzip2 alice.txt

hive> load data local inpath '/home/user/data/alice.txt.bz2' into table alice;

hive> load data local inpath '/home/user/data//alice.txt' into table alice;

Loading data to table fads.alice

Table fads.alice stats: [numFiles=1, totalSize=163742]

Time taken: 0.756 seconds

hive> set hive.exec.compress.output=true;

hive> set mapreduce.output.fileoutputformat.compress=true;

hive> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;

INSERT OVERWRITE DIRECTORY '/hdfs/path/alice/output' SELECT * FROM alice;

......

.....

Stage-Stage-1: Map: 1 Cumulative CPU: 1.58 sec HDFS Read: 163979 HDFS Write: 163742 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 580 msec

$ hadoop fs -ls output/

Found 1 items

-rw-r--r-- 3 user hdfs 48962 2015-10-21 15:35 output/000000_0.bz2

Big Data

Wednesday, October 21, 2015

Hive Compression options and examples

No comments:

Post a Comment