I have a bunch of plaintext tweets that are shown below. I want to extract only the text part .
FILE DATA SAMPLE -
Fri Nov 13 20:27:16 +0000 2015 4181010297 rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273 Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it boring and your players get injured Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19 Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2! Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library Fri Nov 13 20:27:21 +0000 2015 291806707 who going to next week? Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue? @ golden bee
This is my attempt at the preprocess stage -
for filename in glob.glob('*.txt'): with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile: for tweet in infile.readlines(): temp=tweet.split(' ') text="" for i in temp: x=str(i) if x.isalpha() : text += x + ' ' print(text)
OUTPUT -
Fri Nov rt treating one of you lads to this denim simply follow rt to Fri Nov this album is so proud of i loved this it really is the Fri Nov international break is garbage boring and your players get Fri Nov get weather updates from the weather Fri Nov woah what happened to twitter this update is Fri Nov completed the daily quest in paradise island Fri Nov new henderson memorial public Fri Nov going to next Fri Nov why so golden
This conclusion is not the desired outcome, because
1. This will not allow me to get the numbers / digits in the text part of the tweet.
2. Each line starts with FRI NOV.
Could you suggest a better method to achieve the same? I'm not too familiar with regex, but I guess we could use re.search(r'2015(magic to remove tweetID)/w*',tweet)
python string text
Mayur h
source share