How to break the color codes used by mIRC users?

I am writing an IRC bot in Python using irclib, and I am trying to register messages on certain channels. The problem is that some mIRC users and some bots write using color codes .
Any idea on how I can remove these parts and leave only a clear ascii text message?

+6
python irc
source share
7 answers

Regular expressions are my purest bet, in my opinion. If you have not used them before, this one is a good resource. For more information on the Python regex library, go here .

import re regex = re.compile("\x03(?:\d{1,2}(?:,\d{1,2})?)?", re.UNICODE) 

The regular expression looks for ^ C (this is \ x03 in ASCII , you can confirm by running chr (3) with the line command), and then optionally look for one or two characters [0-9], then optionally come a comma, and then another or two characters [0-9].

(?: ...) says to forget about keeping what was found in brackets (since we don’t need to do it) ,? means a match of 0 or 1 and {n, m} means a match of n with m of the previous grouping. Finally, \ d means match [0-9].

The rest can be decoded using the links to which I refer above.

 >>> regex.sub("", "blabla \x035,12to be colored text and background\x03 blabla") 'blabla to be colored text and background blabla' 
Decision

chaos' is similar, but it may end up consuming more than two numbers, and also won’t remove any free C characters that may be floating around (like the one that closes the color command)

+12
source share

Secondary and following sentences are defective because they look for numbers after any character, but not after the color code character.

I improved and combined all the posts with the following consequences:

  • we remove the inverse character
  • delete color codes without leaving numbers in the text.

Decision:

regex = re.compile("\x1f|\x02|\x12|\x0f|\x16|\x03(?:\d{1,2}(?:,\d{1,2})?)?", re.UNICODE)

+7
source share
 p = re.compile("\x03\d+(?:,\d+)?") p.sub('', text) 
+1
source share

As I found this question helpful, I decided that I had contributed.

I added a few words to the regex

 regex = re.compile("\x1f|\x02|\x03|\x16|\x0f(?:\d{1,2}(?:,\d{1,2})?)?", re.UNICODE) 

\x16 removed the "reverse" character. \x0f gets rid of another bold character.

+1
source share

AutoDl-irssi had a very good one written in perl, here it is in python:

def stripMircColorCodes(line) : line = re.sub("\x03\d\d?,\d\d?","",line) line = re.sub("\x03\d\d?","",line) line = re.sub("[\x01-\x1F]","",line) return line

+1
source share

I even had to add ' \x0f ', no matter what it has

 regex = re.compile("\x0f|\x1f|\x02|\x03(?:\d{1,2}(?:,\d{1,2})?)?", re.UNICODE) regex.sub('', msg) 
0
source share

I know that I wrote that I want to use regex because it can be cleaner, I created a non-regex solution that works fine.

 def colourstrip(data): find = data.find('\x03') while find > -1: done = False data = data[0:find] + data[find+1:] if len(data) <= find+1: done = True try: assert int(data[find]) data = data[0:find] + data[find+1:] except: done = True try: assert not done assert int(data[find]) data = data[0:find] + data[find+1:] except: if not done and (data[find] != ','): done = True if (len(data) > find+1) and (data[find] == ','): try: assert not done assert int(data[find+1]) data = data[0:find] + data[find+1:] data = data[0:find] + data[find+1:] except: done = True try: assert not done assert int(data[find]) data = data[0:find] + data[find+1:] except: pass find = data.find('\x03') data = data.replace('\x1d','') data = data.replace('\x1f','') data = data.replace('\x16','') data = data.replace('\x0f','') return data datastring = '\x0312,4This is coolour \x032,4This is too\x03' print(colourstrip(datastring)) 

Thanks for helping everyone.

0
source share

All Articles