Purpose: to classify each tweet as positive or negative and write it to the output file, which will contain the username, original tweet and mood of the tweet.
The code:
import re,math input_file="raw_data.csv" fileout=open("Output.txt","w") wordFile=open("words.txt","w") expression=r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)" fileAFINN = 'AFINN-111.txt' afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)])) pattern=re.compile(r'\w+') pattern_split = re.compile(r"\W+") words = pattern_split.split(input_file.lower()) print "File processing started" with open(input_file,'r') as myfile: for line in myfile: line = line.lower() line=re.sub(expression," ",line) words = pattern_split.split(line.lower()) sentiments = map(lambda word: afinn.get(word, 0), words)
Problem. Apparently the output.txt file
abc some tweet text 0 bcd some more tweets 1 efg some more tweet 0
Question 1: How to add a comma between userid twist settings? The output should be as follows:
abc,some tweet text,0 bcd,some other tweet,1 efg,more tweets,0
Question 2: Tweets in Bahasa Melayu (BM) and the AFINN vocabulary that I use are English words. Therefore, the classification is incorrect. Do you know any BM dictionary that I can use?
Question 3: How to pack this code in a JAR file?
Thanks.