For some reason word count seems to be a good example to start with regular expression, python and map/reduce, well the reason is clear, it's now become the "Hello World" for map/reduce
Input : Alice.txt download from http://www.gutenberg.org, to make reading easier, I am working with only top 10 rows using limit and head;
Steps:
1. Create a table in hive to store the book and load the text file
2. Create a python script to count the words and display
3. Use the python script as UDF (why? Regular expression, loops and other goodies available)
1. Create a table in hive to store the book and load the text file
hive> create table alice (line String) row format delimited lines terminated by '\n';
OK
Time taken: 0.196 seconds
hive> load data local inpath '/home/username/python/alice.txt' into table alice;
Loading data to table database.alice
Table database.alice stats: [numFiles=1, totalSize=163771]
OK
Time taken: 0.757 seconds
------------------------------------------------------------------------------------
import sys
import re
count = {}
#select statement in hive prints it to stdout, so the script reads from stdin
for line in sys.stdin:
words = map(lambda x: re.sub(r'[^\w\s]','',x),line.lower().strip().split(' '))
for w in words:
count[w] = count.get(w,0) + 1
for x in sorted(count.items(),key = lambda z: z[1], reverse=True):
print x[0], '\t', x[1]
------------------------------------------------------------------------------------
script can be tested like so:
$ cat alice.txt | python 1udf.py | head -n 10
the 1804
1251
and 912
to 800
a 685
of 625
it 541
she 538
said 462
you 429
3. Use the python script as part of TRANSFORM
hive> add file /home/username/python/1udf.py; (Important! See errors encountered)
Added resources: [/home/username/python/1udf.py]
select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;
Stage-Stage-1: Map: 2 Cumulative CPU: 4.9 sec HDFS Read: 164257 HDFS Write: 126 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 900 msec
OK
the 1804
1251
and 912
to 800
a 685
of 625
it 541
she 538
said 462
you 429
Time taken: 16.577 seconds, Fetched: 10 row(s)
using TEZ engine:
hive> set hive.execution.engine=tez;
hive> select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;
Query ID = username_20151016120505_20e5f39d-7dcc-4dd2-8247-baa8e57b49b9
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1443556515457_25523)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 5.32 s
--------------------------------------------------------------------------------
OK
the 1804
1251
and 912
to 800
a 685
of 625
it 541
she 538
said 462
you 429
Time taken: 13.724 seconds, Fetched: 10 row(s)
FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script
Reason: I used the query as "select transform(line) using 'python /home/username/python/1udf.py' as t_line , count from alice limit 10;"
Solution: use add file /home/username/python/1udf.py so that the script is available in hive cache
then drop the absolute path in the hive query and just use the python script name
select transform(line) using 'python 1udf.py' as t_line , count from alice limit 10;
No comments:
Post a Comment