Big Data: Another word count - Hive UDF with Python

For some reason word count seems to be a good example to start with regular expression, python and map/reduce, well the reason is clear, it's now become the "Hello World" for map/reduce

Input : Alice.txt download from http://www.gutenberg.org, to make reading easier, I am working with only top 10 rows using limit and head;

Steps:
1. Create a table in hive to store the book and load the text file
2. Create a python script to count the words and display
3. Use the python script as UDF (why? Regular expression, loops and other goodies available)

1. Create a table in hive to store the book and load the text file

hive> create table alice (line String) row format delimited lines terminated by '\n';
OK
Time taken: 0.196 seconds

hive> load data local inpath '/home/username/python/alice.txt' into table alice;

Loading data to table database.alice

Table database.alice stats: [numFiles=1, totalSize=163771]

Time taken: 0.757 seconds

2. Create a python script to count the words and display (I have saved it as 1udf.py)

------------------------------------------------------------------------------------

import sys

import re

count = {}

#select statement in hive prints it to stdout, so the script reads from stdin

for line in sys.stdin:

words = map(lambda x: re.sub(r'[^\w\s]','',x),line.lower().strip().split(' '))

for w in words:

count[w] = count.get(w,0) + 1

for x in sorted(count.items(),key = lambda z: z[1], reverse=True):

print x[0], '\t', x[1]

------------------------------------------------------------------------------------

script can be tested like so:

$ cat alice.txt | python 1udf.py | head -n 10

the 1804

1251

and 912

to 800

a 685

of 625

it 541

she 538

said 462

you 429

3. Use the python script as part of TRANSFORM

hive> add file /home/username/python/1udf.py; (Important! See errors encountered)

Added resources: [/home/username/python/1udf.py]

select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;

Stage-Stage-1: Map: 2 Cumulative CPU: 4.9 sec HDFS Read: 164257 HDFS Write: 126 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 900 msec

the 1804

1251

and 912

to 800

a 685

of 625

it 541

she 538

said 462

you 429

Time taken: 16.577 seconds, Fetched: 10 row(s)

using TEZ engine:

hive> set hive.execution.engine=tez;

hive> select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;

Query ID = username_20151016120505_20e5f39d-7dcc-4dd2-8247-baa8e57b49b9

Total jobs = 1

Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1443556515457_25523)

--------------------------------------------------------------------------------

VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED

--------------------------------------------------------------------------------

Map 1 .......... SUCCEEDED 1 1 0 0 0 0

--------------------------------------------------------------------------------

VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 5.32 s

--------------------------------------------------------------------------------

the 1804

1251

and 912

to 800

a 685

of 625

it 541

she 538

said 462

you 429

Time taken: 13.724 seconds, Fetched: 10 row(s)

Errors Encountered:
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script
Reason: I used the query as "select transform(line) using 'python /home/username/python/1udf.py' as t_line , count from alice limit 10;"
Solution: use add file /home/username/python/1udf.py so that the script is available in hive cache
then drop the absolute path in the hive query and just use the python script name

select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;

Big Data

Friday, October 16, 2015

Another word count - Hive UDF with Python

No comments:

Post a Comment