Big Data: Hive String to Date using Python UDF

After trying to get the date from various formats using HiveQL and Pig, it was time for a UDF. I dont know much Java, it seemed much easier to create a temp function from a class. Thankfully, we have Python!

So the Input was something like: (this is a sample, the actual data cannot be pasted here)

1904287	Customer 1	Jan 11, 2003
96391595	Customer 2	6/17/1969
2236067	Customer 3	08/22/54

hive> create table cust (id int, name String, mydate String) row format delimited fields terminated by '\t' lines terminated by '\n';

Time taken: 0.149 seconds

hive>

hive> load data local inpath '/home/my/data/sample.txt' into table cust;

Loading data to table fads.cust

Table fads.cust stats: [numFiles=1, numRows=0, totalSize=91, rawDataSize=0]

Time taken: 0.736 seconds

hive> select * from cust;

1904287 Customer 1 Jan 11, 2003

96391595 Customer 2 6/17/1969

2236067 Customer 3 08/22/54

Time taken: 0.043 seconds, Fetched: 3 row(s)

So now the fun begins, the data is loaded. We start coding python program to manage these dates - Note the program only handles certain types of date and this is shown as an example rather than a full fledged data conversion program. No exceptions have been handled as well.

import re

import sys

monthdict = {}

monthsstr = "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"

datelist = []

index=1

for m in monthsstr.split(' '):

monthdict[m] = index

index = index+1

#monthdict = sorted(monthdict.items(), key = lambda x : x[1])

def correct_date(mydate):

year = 0

sdate = re.sub('( +|,|/|-)',' ',mydate)

splitdate = sdate.split()

if len(splitdate[2].strip()) == 2:

year = int(splitdate[2])

if year > 14:

year+=1900

else:

year+=2000

mydate = splitdate[0] + ' ' + splitdate[1] + ' ' + str(year)

if re.search('[a-zA-Z]',sdate):

month = monthdict[splitdate[0]]

if len(splitdate[2].strip()) == 4:

mydate = str(month) + ' ' + splitdate[1] + ' ' + splitdate[2]

else:

mydate = str(month) + ' ' + splitdate[1] + ' ' + str(year)

return re.sub(' ','/',mydate.strip())

for line in sys.stdin:

fields = line.split('\t')

print fields[0], ',', fields[1] ,',',correct_date(fields[2].strip())

We read the data from stdin, becuase select statement at hive writes to stdout with \t separated field list, we split it and get the 3rd field and do our date conversion and write it back to stdout.

hive> add file /home/pgms/python/cdate.py;

Added resources: [/home/pgms/python/cdate.py]

hive> select transform(id, name, mydate) using 'python cdate.py' as mdate from cust;

*log4j details here***

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.76 sec HDFS Read: 328 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 760 msec

1904287 , Customer 1 , 1/11/2003

96391595 , Customer 2 , 6/17/1969

2236067 , Customer 3 , 08/22/2054

Time taken: 20.125 seconds, Fetched: 3 row(s)

The dates are converted, now we stored the results in HDFS

hive> insert overwrite DIRECTORY '/user/myuser/output' select transform(id, name, mydate) using 'python cdate.py' as mdate from cust;

Moving data to: /user/myuser/output

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.49 sec HDFS Read: 328 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 490 msec

Time taken: 13.334 seconds

hive>

$ hadoop fs -cat output/000000_0

1904287 , Customer 1 , 1/11/2003

96391595 , Customer 2 , 6/17/1969

2236067 , Customer 3 , 08/22/2054

Twisted, but can do thins using Python and not just Java in Hadoop.

Big Data

Wednesday, October 28, 2015

Hive String to Date using Python UDF

No comments:

Post a Comment