Big Data: November 2015

Monday, November 16, 2015

Hive Cheat Sheet

--Create Partitioned table from a non partitioned table
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

create table user.city (id int, name string, countrycode string, district string, population string);
load data local inpath '/home/user/data/cities.pipes' into table user.city;

create external table part_city (id int, name string, district string, population int) partitioned by (countrycode string) location '/user/user/fads/city';

insert overwrite table part_city partition(countrycode) select id, name, district, population, countrycode from city;

***Careful, insert overwrite does some weird stuff to the source table, use insert into instead

--Create Partitioned table from a non partitioned table (Compressed)
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;

create external table part_city (id string, name string, district string, population string) partitioned by (countrycode string) row format delimited lines terminated by '\n' stored as orcfile location '/user/user/fads/city';
insert overwrite table part_city partition(countrycode) select id, name, district, population, countrycode from city;

--Equivalent of drop column, in the columns () select the columns you want to retain, data will not be lost
alter table city replace columns(id string, name string, countrycode string, district string, population string);

--Create Table from existing table
create table dupcity like city;

--Get insights how the table was created
show create table city;

--Use UDF, transform and create a new table
add file /path/to/udf.py;
insert into table temp select transform(id, name, countrycode, district, population) using 'python udf.py' AS (id, name, countrycode, district, population) from city;

Examples use the cities database found in mysql example databases

python code for udf used:(Expands few abbreviated country codes)

import sys

lkp = {'VAT':'Vatican','SYR':'Syria','BRA':'Brazil','PSE':'Palestine'}

for lines in sys.stdin:

fields = lines.split('\t')

print '\t'.join(map(lkp.get, fields, fields)).strip()

Tuesday, November 10, 2015

Comparison of Compression codecs in Hive

Just some stats about the compression codecs with file types (Hive stored as) to check which one to choose.

To compress data in HDFS, three are 2 ways:

Compress the source files (gz/bzip2) and store them in the hive table or copy to HDFS and build a table on it (catalog) - usually using load data (the combinations of load data and type of storage is provided below)

Secondly, data can be compressed using the map/reduce output (query output) by setting the codec that's required and using that compressed output to insert into the table

set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.Bzip2Codec;
set mapred.output.compression.type=BLOCK;

insert overwrite table seqbzip_block_tbl select * from texttable;

Select from text table generates the zip files as the output, we use that output as input to the table.

Again, different combinations and theirs results are provided in the table below (more of a truth table)

Finally, did an test to figure out the best combnation using a 9.8 GB File on a cluster (24 nodes). From storage point of view, ORCFILE/GZipCodec seems to be the go to choice. My test indicates that the time taken for a full table scan is also ORCFILE/GZipCodec the winner. Widely used combination to 'store' the data is ORC/Gzip and for queries the intermediate results's codec seems to be snappy.