I have 400 million tweets (in fact, I think it's almost like 450, but it doesn't matter), in the form:
T "timestamp"
U "username"
W "actual tweet"
I want to write them to a file initially in the form "username \ t tweet", and then upload them to the database. The problem is that before uploading to db there are several things I need to do: 1. Preprocess the tweet in advance to remove RT @ [names] and URLs 2. Extract the username from "http: // twitter. com / username ".
I am using python and this is the code. Please let me know how this can be done faster :)
'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
'''
def regexSub(line):
line = re.sub(regRT,'',line)
line = re.sub(regAt,'',line)
line = line.lstrip(' ')
line = re.sub(regHttp,'',line)
return line
def userName(line):
return line.split('http://twitter.com/')[1]
import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT
regRT = 'RT'
global regHttp
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')
for line1,line2,line3 in itertools.izip_longest(*[data]*3):
line1 = line1.split('\t')[1]
line2 = line2.split('\t')[1]
line3 = line3.split('\t')[1]
try:
tweet=regexSub(line3)
user = userName(line2)
except:
print 'Line2 is ',line2
print 'Line3 is',line3
processed.write(user.strip("\n")+"\t"+tweet)
I ran the code as follows:
python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt
This is the result that I get: (Warning: it is quite large, and I did not filter out small values, thinking that they might also have some value)
Sat Jan 7 03:28:51 2012 profile_dump
3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
528840744 166.402 0.000 166.402 0.000 {method 'split' of 'str' objects}
396630560 81.300 0.000 81.300 0.000 {method 'get' of 'dict' objects}
396630560 326.349 0.000 439.737 0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558 255.662 0.000 1297.705 0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558 602.307 0.000 602.307 0.000 {built-in method sub}
264420442 32.087 0.000 32.087 0.000 {isinstance}
132210186 34.700 0.000 34.700 0.000 {method 'lstrip' of 'str' objects}
132210186 27.296 0.000 27.296 0.000 {method 'strip' of 'str' objects}
132210186 181.287 0.000 1513.691 0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186 79.950 0.000 79.950 0.000 {method 'write' of 'file' objects}
132210186 55.900 0.000 113.960 0.000 TwitterScripts/Preprocessing.py:10(userName)
313/304 0.000 0.000 0.000 0.000 {len}
, (, 1, 3 ..)
, , . !