How to use file in hadoop streaming work using python?

Question

How to use file in hadoop streaming work using python?

I want to read a list from a file in my stream stream. Here is my simple mapper.py:

#!/usr/bin/env python import sys import json def read_file(): id_list = [] #read ids from a file f = open('../user_ids','r') for line in f: line = line.strip() id_list.append(line) return id_list if __name__ == '__main__': id_list = set(read_file()) # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() line = json.loads(line) user_id = line['user']['id'] if str(user_id) in id_list: print '%s\t%s' % (user_id, line)

and here is my .py reducer

 #!/usr/bin/env python from operator import itemgetter import sys current_id = None current_list = [] id = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py id, line = line.split('\t', 1) # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_id == id: current_list.append(line) else: if current_id: # write result to STDOUT print '%s\t%s' % (current_id, current_list) current_id = id current_list = [line] # do not forget to output the last word if needed! if current_id == id: print '%s\t%s' % (current_id, current_list)

now, to run it, I say:

 hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \ -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \ -input test/input.txt -output test/output -file '../user_ids'

Job launch:

 13/11/07 05:04:52 INFO streaming.StreamJob: map 0% reduce 0% 13/11/07 05:05:21 INFO streaming.StreamJob: map 100% reduce 100% 13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:

I get an error message:

 job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201309172143_1390_m_000001 13/11/07 05:05:21 INFO streaming.StreamJob: killJob...

I, when I do not read the identifiers from the file .. / user _ids, it does not give me any errors. I think the problem is that it cannot find my file .. / user _id. I also used location in hdfs and still didn't work. Thank you for your help.

+7

python hadoop hadoop-streaming

Elham Nov 07 '13 at 10:38

source share

2 answers

Try to give the full path to the file, or by running the hadoop command, make sure that you are in the same directory as the user_ids file.

+1

Yusufali2205 Sep 29 '14 at 21:49

source share

Chris white · Accepted Answer · 2013-11-07T11:35:01+0000

 hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \ -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \ -input test/input.txt -output test/output -file '../user_ids'

Does it exist. / user _ids in your local file path when doing the job? If so, you need to make changes to the mapper code to take into account that this file will be available in the local working directory of the display device at runtime:

 f = open('user_ids','r')

How to use file in hadoop streaming work using python?

More articles: