Python check if utf-8 string is uppercase

I am having problems with .isupper () when I have a utf-8 encoded string. I have many text files that I convert to xml. Although the text is very variable, the format is static. words in all caps should be enclosed in <title> tags and everything else <p> . It is much more complicated than that, but it should be enough for my question.

My problem is that it is a utf-8 file. This is necessary since there will be several many non-English characters in the latest release. This may be the time for a quick example:

inputText.txt

Renew

Bacon ipsum dolor sit amet strip steak chicken t-bone nostrud aute pancetta ham hock incididunt aliqua. Dolor short loin former chicken, hamburger chuck ut. In labum eiusmod short loin, spare ribs enim sausage. Cutting is therefore a flank. Tempor officia citrus clippings. At pancetta do, ut dolore t-bone sint pork pariatur smoking chickens. Nostrud tail of a child, ullamco venison prays pork chop intentional consectetur fugiat representation of official utensils.

Desiredoutput

  <title>RÉSUMÉ</title> <p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage. Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip. </p> 

Code example

  #!/usr/local/bin/python2.7 # yes this is an alt-install of python import codecs import sys import re from xml.dom.minidom import Document def main(): fn = sys.argv[1] input = codecs.open(fn, 'r', 'utf-8') output = codecs.open('desiredOut.xml', 'w', 'utf-8') doc = Documents() doc = parseInput(input,doc) print>>output, doc.toprettyxml(indent=' ',encoding='UTF-8') def parseInput(input, doc): tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines for i in range(len(tokens)): # THIS IS MY PROBLEM. .isupper() is never true. if str(tokens[i]).isupper(): title = doc.createElement('title') tText = str(tokens[i]).strip('[\']') titleText = doc.createTextNode(tText.title()) doc.appendChild(title) title.appendChild(titleText) else: p = doc.createElement('p') pText = str(tokens[i]).strip('[\']') paraText = doc.createTextNode(pText) doc.appendChild(p) p.appenedChild(paraText) return doc if __name__ == '__main__': main() 

Ultimately, it’s pretty straight forward, I would accept criticism or suggestions for my code. Who would not? In particular, I'm not happy with str(tokens[i]) , maybe there is a better way to iterate over a list of strings?

But the purpose of this question is to figure out the most efficient way to check if the utf-8 string is uppercase. Perhaps I should study creating a regex for this.

Remember, I did not run this code, and it may not work like that. I took the details from the working code and maybe something did not understand. Notify me and I will fix it. finally notice that i am not using lxml

+7
source share
3 answers

The main reason for the failure of your published code (even with only ascii characters!) Is that re.split () will not break into a zero width match . r'\b' matches null characters:

 >>> re.split(r'\b', 'foo-BAR_baz') ['foo-BAR_baz'] >>> re.split(r'\W+', 'foo-BAR_baz') ['foo', 'BAR_baz'] >>> re.split(r'[\W_]+', 'foo-BAR_baz') ['foo', 'BAR', 'baz'] 

In addition, you will need flags=re.UNICODE to make sure that the Unicode definitions are used \b and \W , etc. And using str() , where you did, is not necessary at best.

Thus, it was not a Unicode problem as such. However, some respondents tried to solve this problem as a Unicode problem with varying degrees of success ... here I take on the Unicode problem:

A common solution to this problem is to follow the standard tips that apply to all text problems: decode your input from bytes to Unicode strings as early as possible. Perform all processing in Unicode. Encode Unicode output to byte strings as late as possible.

So: byte_string.decode('utf8').isupper() is the way to go. byte_string.decode('ascii', 'ignore').isupper() such as byte_string.decode('ascii', 'ignore').isupper() should be avoided. they can be all (complex, unnecessary, prone to failure) - see below.

Some codes:

 # coding: ascii import unicodedata tests = ( (u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase (u'R\xc9SUM\xc9', True), # RESUME with accents (u'R\xe9sum\xe9', False), # Resume with accents (u'R\xe9SUM\xe9', False), # ReSUMe with accents ) for ucode, expected in tests: print print 'unicode', repr(ucode) for uc in ucode: print 'U+%04X %s' % (ord(uc), unicodedata.name(uc)) u8 = ucode.encode('utf8') print 'utf8', repr(u8) actual1 = u8.decode('utf8').isupper() # the natural way of doing it actual2 = u8.decode('ascii', 'ignore').isupper() # @jathanism print expected, actual1, actual2 

Exiting Python 2.7.1:

 unicode u'\u041c\u041e\u0421\u041a\u0412\u0410' U+041C CYRILLIC CAPITAL LETTER EM U+041E CYRILLIC CAPITAL LETTER O U+0421 CYRILLIC CAPITAL LETTER ES U+041A CYRILLIC CAPITAL LETTER KA U+0412 CYRILLIC CAPITAL LETTER VE U+0410 CYRILLIC CAPITAL LETTER A utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90' True True False unicode u'R\xc9SUM\xc9' U+0052 LATIN CAPITAL LETTER R U+00C9 LATIN CAPITAL LETTER E WITH ACUTE U+0053 LATIN CAPITAL LETTER S U+0055 LATIN CAPITAL LETTER U U+004D LATIN CAPITAL LETTER M U+00C9 LATIN CAPITAL LETTER E WITH ACUTE utf8 'R\xc3\x89SUM\xc3\x89' True True True unicode u'R\xe9sum\xe9' U+0052 LATIN CAPITAL LETTER R U+00E9 LATIN SMALL LETTER E WITH ACUTE U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+006D LATIN SMALL LETTER M U+00E9 LATIN SMALL LETTER E WITH ACUTE utf8 'R\xc3\xa9sum\xc3\xa9' False False False unicode u'R\xe9SUM\xe9' U+0052 LATIN CAPITAL LETTER R U+00E9 LATIN SMALL LETTER E WITH ACUTE U+0053 LATIN CAPITAL LETTER S U+0055 LATIN CAPITAL LETTER U U+004D LATIN CAPITAL LETTER M U+00E9 LATIN SMALL LETTER E WITH ACUTE utf8 'R\xc3\xa9SUM\xc3\xa9' False False True 

The only differences with Python 3.x are the syntax - the principle (all processing in Unicode) remains unchanged.

+9
source

As you can see from one comment above, for each character it is not true that one of the checks islower () vs isupper () will always be true and the other false. For example, Unified Han characters are considered β€œletters,” but not lowercase, not uppercase, not heading.

Therefore, your refined requirements for text processing in upper and lower case in different ways should be refined. I guess this is the difference between uppercase letters and all other characters. It may be hair splitting, but you are talking about a non-English text here.

First, I recommend using Unicode strings (built-in unicode ()) exclusively for the parts of the processing of strings of your code. Discipline your mind to think of β€œright” strings as byte strings, because that's exactly what they are. All string literals not written by u"like this" are byte strings.

This line of code:

 tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] 

will become:

 tokens = [re.split(u'\\b', unicode(line.strip(), 'UTF-8')) for line in input if line != '\n'] 

You would also check tokens[i].isupper() , not str(tokens[i]).isupper() . Based on what you posted, it seems likely that other parts of your code will need to be modified to work with character strings instead of byte strings.

+2
source

A simple solution. I think,

 tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines 

becomes

 tokens = [line.strip() for line in input if line != '\n'] 

then I can go without the need for str() or unicode() as far as I can tell.

 if tokens[i].isupper(): #do stuff 

The word token and re.split at word boundaries is a legacy when I was already messing around with nltk this week. But ultimately I process the lines, not the tokens / words. This may change. but for now it seems to be working. I will now leave this question open, hoping for alternative solutions and comments.

0
source

All Articles