Wednesday, October 28, 2015

Hive String to Date using Python UDF

After trying to get the date from various formats using HiveQL and Pig, it was time for a UDF. I dont know much Java, it seemed much easier to create a temp function from a class. Thankfully, we have Python!

So the Input was something like: (this is a sample, the actual data cannot be pasted here)

1904287Customer 1Jan 11, 2003
96391595Customer 26/17/1969
2236067Customer 308/22/54

hive> create table cust (id int, name String, mydate String) row format delimited fields terminated by '\t' lines terminated by '\n';
OK
Time taken: 0.149 seconds
hive>


hive> load data local inpath '/home/my/data/sample.txt' into table cust;
Loading data to table fads.cust
Table fads.cust stats: [numFiles=1, numRows=0, totalSize=91, rawDataSize=0]
OK
Time taken: 0.736 seconds

hive> select * from cust;
OK
1904287 Customer 1      Jan 11, 2003
96391595        Customer 2      6/17/1969
2236067 Customer 3      08/22/54
Time taken: 0.043 seconds, Fetched: 3 row(s)


So now the fun begins, the data is loaded. We start coding python program to manage these dates - Note the program only handles certain types of date and this is shown as an example rather than a full fledged data conversion program. No exceptions have been handled as well. 

import re
import sys

monthdict = {}
monthsstr = "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"

datelist = []

index=1
for m in monthsstr.split(' '):
        monthdict[m] = index
        index = index+1

#monthdict = sorted(monthdict.items(), key = lambda x : x[1])

def correct_date(mydate):
        year = 0
        sdate = re.sub('( +|,|/|-)',' ',mydate)
        splitdate = sdate.split()

        if len(splitdate[2].strip()) == 2:
                year = int(splitdate[2])
                if year > 14:
                        year+=1900
                else:
                        year+=2000

                mydate = splitdate[0] + ' ' + splitdate[1] + ' ' + str(year)

        if re.search('[a-zA-Z]',sdate):
                month = monthdict[splitdate[0]]
                if len(splitdate[2].strip()) == 4:
                        mydate = str(month) + ' ' + splitdate[1] + ' ' + splitdate[2]
                else:
                        mydate = str(month) + ' ' + splitdate[1] + ' ' + str(year)

        return re.sub(' ','/',mydate.strip())


for line in sys.stdin:
        fields = line.split('\t')
        print fields[0], ',', fields[1] ,',',correct_date(fields[2].strip())

We read the data from stdin, becuase select statement at hive writes to stdout with \t separated field list, we split it and get the 3rd field and do our date conversion and write it back to stdout. 

hive> add file /home/pgms/python/cdate.py;
Added resources: [/home/pgms/python/cdate.py]

hive> select transform(id, name, mydate) using 'python cdate.py' as mdate from cust;
*log4j details here***
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.76 sec   HDFS Read: 328 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 760 msec
OK
1904287 , Customer 1 , 1/11/2003
96391595 , Customer 2  , 6/17/1969
2236067 , Customer 3 , 08/22/2054
Time taken: 20.125 seconds, Fetched: 3 row(s)

The dates are converted, now we stored the results in HDFS

hive> insert overwrite DIRECTORY  '/user/myuser/output' select transform(id, name, mydate) using 'python cdate.py' as mdate from cust;

 
Moving data to: /user/myuser/output
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.49 sec   HDFS Read: 328 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 490 msec
OK
Time taken: 13.334 seconds
hive>


$ hadoop fs -cat output/000000_0
1904287 , Customer 1 , 1/11/2003
96391595 , Customer 2  , 6/17/1969
2236067 , Customer 3 , 08/22/2054

Twisted, but can do thins using Python and not just Java in Hadoop. 




No comments:

Post a Comment