Preprocessing 400 Million Tweets in Python - Faster

I have 400 million tweets (in fact, I think it's almost like 450, but it doesn't matter), in the form:

T    "timestamp"
U    "username"
W    "actual tweet"

I want to write them to a file initially in the form "username \ t tweet", and then upload them to the database. The problem is that before uploading to db there are several things I need to do: 1. Preprocess the tweet in advance to remove RT @ [names] and URLs 2. Extract the username from "http: // twitter. com / username ".

I am using python and this is the code. Please let me know how this can be done faster :)

'''The aim is  to take all the tweets of a user and store them in a table.  Do this for all the users and then lets see what we can do with it 
   What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started 
'''
def regexSub(line):
    line = re.sub(regRT,'',line)
    line = re.sub(regAt,'',line)
    line = line.lstrip(' ')
    line = re.sub(regHttp,'',line)
    return line
def userName(line):
    return line.split('http://twitter.com/')[1]


import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT 
regRT = 'RT'
global regHttp 
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt 
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')

for line1,line2,line3 in itertools.izip_longest(*[data]*3):
    line1 = line1.split('\t')[1]
    line2 = line2.split('\t')[1]
    line3 = line3.split('\t')[1]

    #print 'line1',line1
    #print 'line2=',line2
    #print 'line3=',line3
    #print 'line3 before preprocessing',line3
    try:
        tweet=regexSub(line3)
        user = userName(line2)
    except:
        print 'Line2 is ',line2
        print 'Line3 is',line3

    #print 'line3 after processig',line3
    processed.write(user.strip("\n")+"\t"+tweet)

I ran the code as follows:

python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt

This is the result that I get: (Warning: it is quite large, and I did not filter out small values, thinking that they might also have some value)

Sat Jan  7 03:28:51 2012    profile_dump

         3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
528840744  166.402    0.000  166.402    0.000 {method 'split' of 'str' objects}
396630560   81.300    0.000   81.300    0.000 {method 'get' of 'dict' objects}
396630560  326.349    0.000  439.737    0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558  255.662    0.000 1297.705    0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558  602.307    0.000  602.307    0.000 {built-in method sub}
264420442   32.087    0.000   32.087    0.000 {isinstance}
132210186   34.700    0.000   34.700    0.000 {method 'lstrip' of 'str' objects}
132210186   27.296    0.000   27.296    0.000 {method 'strip' of 'str' objects}
132210186  181.287    0.000 1513.691    0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186   79.950    0.000   79.950    0.000 {method 'write' of 'file' objects}
132210186   55.900    0.000  113.960    0.000 TwitterScripts/Preprocessing.py:10(userName)
  313/304    0.000    0.000    0.000    0.000 {len}

, (, 1, 3 ..)

, , . !

+5
4

, multiprocessing.

, . - Process, , .

Process, , . .

Process, (T, U, W) , .

Etc. ..

. Read-transform - Write - , , multiprocessing. , , .

, , , .

, , , . - .

+7

, , . , , .

, , lex + yacc. python lex + yacc, , c .

, . , /.

, , . , , .

+3

str.lstrip, , , :

>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'

:

S.lstrip([chars]) -> string or unicode

Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
+3

, regexSub. , .

- :

regAll = re.compile(r'RT|(^[ \t]+)|((http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?)|...')

(The goal is to not only replace everything you do with re.sub, but also lstrip). I finished the template with ...: you will need to fill in the details yourself.

Then replace regexSub as follows:

line = regAll.sub(line)

Of course, only profiling will show if it will be faster, but I expect it to be, since fewer intermediate lines will be created.

+1
source

All Articles