I am trying to write a parser for WhatsApp conversation log. The minimum log file at the end of the question.
There are two types of messages in this log: regular, where the syntax is
date time: Name: Message
As you can see, it Messagecan go to a new line, and the name can contain :.
The second type of message is the "event" message, which can be of the following types:
date time: Name joined
date time: Name left
date time: Name was removed
date time: Name changed the subject to "GroupName"
date time: Name changed the group icon
I tried to write some regular expression, but there are several difficulties that I have encountered: how to process multi-line messages, how to parse a field Name(since splitting :does not work), how to create a regular expression that only messages from senders that are currently located are recognized in a group, and finally, how to parse special messages (for example, parsing a search for the one attached to the last word is not a good idea).
How can I parse such a log file and move everything to a dictionary?
, , , , - dict:
, "" ( , ..) "", .
>>>datab[Sender1]['Events']
>>>[('Joined',data1,time1),('Left',data2,time2]
>>>datab[Sender2]['Messages']
>>>[(data1,time1,Message1),(data2,time2,Message2)]
, !
29/03/14 15:48:05: John Smith changed the subject to "Test"
29/03/14 16:10:39: John Smith joined
29/03/14 16:10:40: Person:2 joined
29/03/14 16:10:40: John Smith: Hello!
29/03/14 16:11:40: Person:2: some random words,
29/03/14 16:12:40: Person3 joined
29/03/14 16:13:40: John Smith: Hello!Test message with newline
Another line of the same message
Another line.
29/03/14 16:14:43: Person:2: Test message using as last word joined
29/03/14 16:15:57: Person3 left
29/03/14 16:17:16: Person3 joined
29/03/14 16:18:21: Person:2 changed the group icon
29/03/14 16:19:16: Person3 was removed
29/03/14 16:20:43: Person:2: Test message using as last word left