Friday, October 16, 2015

Another word count - Hive UDF with Python


For some reason word count seems to be a good example to start with regular expression, python and map/reduce, well the reason is clear, it's now become the "Hello World" for map/reduce

Input : Alice.txt download from http://www.gutenberg.org, to make reading easier, I am working with only top 10 rows using limit and head;

Steps:
1. Create a table in hive to store the book and load the text file
2. Create a python script to count the words and display
3. Use the python script as UDF (why? Regular expression, loops and other goodies available)

1. Create a table in hive to store the book and load the text file

hive> create table alice (line String) row format delimited lines terminated by '\n';
OK
Time taken: 0.196 seconds

hive> load data local inpath '/home/username/python/alice.txt' into table alice;
Loading data to table database.alice
Table database.alice stats: [numFiles=1,  totalSize=163771]
OK
Time taken: 0.757 seconds

2. Create a python script to count the words and display (I have saved it as 1udf.py)

------------------------------------------------------------------------------------
import sys
import re

count = {}
#select statement in hive prints it to stdout, so the script reads from stdin
for line in sys.stdin: 
        words =  map(lambda x: re.sub(r'[^\w\s]','',x),line.lower().strip().split(' '))
        for w in words:
                count[w] = count.get(w,0) + 1

for x in sorted(count.items(),key = lambda z: z[1], reverse=True):
        print x[0], '\t', x[1]
------------------------------------------------------------------------------------
script can be tested like so: 
$ cat alice.txt  | python 1udf.py | head -n 10
the     1804
          1251
and     912
to       800
a         685
of       625
it        541
she     538
said    462
you    429

3. Use the python script as part of TRANSFORM 

hive> add file /home/username/python/1udf.py; (Important! See errors encountered)

Added resources: [/home/username/python/1udf.py]

select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;

Stage-Stage-1: Map: 2   Cumulative CPU: 4.9 sec   HDFS Read: 164257 HDFS Write: 126 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 900 msec
OK
the     1804
        1251
and     912
to      800
a       685
of      625
it      541
she     538
said    462
you     429
Time taken: 16.577 seconds, Fetched: 10 row(s)

using TEZ engine:

hive> set hive.execution.engine=tez;
hive> select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;
Query ID = username_20151016120505_20e5f39d-7dcc-4dd2-8247-baa8e57b49b9
Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1443556515457_25523)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 5.32 s
--------------------------------------------------------------------------------
OK
the     1804
        1251
and     912
to      800
a       685
of      625
it      541
she     538
said    462
you     429
Time taken: 13.724 seconds, Fetched: 10 row(s)



Errors Encountered:
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script
Reason: I used the query as "select transform(line) using 'python /home/username/python/1udf.py' as t_line , count from alice limit 10;"
Solution: use add file /home/username/python/1udf.py so that the script is available in hive cache
then drop the absolute path in the hive query and just use the python script name

select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;

No comments:

Post a Comment