I am having problems with .isupper () when I have a utf-8 encoded string. I have many text files that I convert to xml. Although the text is very variable, the format is static. words in all caps should be enclosed in <title> tags and everything else <p> . It is much more complicated than that, but it should be enough for my question.
My problem is that it is a utf-8 file. This is necessary since there will be several many non-English characters in the latest release. This may be the time for a quick example:
inputText.txt
Renew
Bacon ipsum dolor sit amet strip steak chicken t-bone nostrud aute pancetta ham hock incididunt aliqua. Dolor short loin former chicken, hamburger chuck ut. In labum eiusmod short loin, spare ribs enim sausage. Cutting is therefore a flank. Tempor officia citrus clippings. At pancetta do, ut dolore t-bone sint pork pariatur smoking chickens. Nostrud tail of a child, ullamco venison prays pork chop intentional consectetur fugiat representation of official utensils.
Desiredoutput
<title>RΓSUMΓ</title> <p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage. Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip. </p>
Code example
#!/usr/local/bin/python2.7 # yes this is an alt-install of python import codecs import sys import re from xml.dom.minidom import Document def main(): fn = sys.argv[1] input = codecs.open(fn, 'r', 'utf-8') output = codecs.open('desiredOut.xml', 'w', 'utf-8') doc = Documents() doc = parseInput(input,doc) print>>output, doc.toprettyxml(indent=' ',encoding='UTF-8') def parseInput(input, doc): tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines for i in range(len(tokens)): # THIS IS MY PROBLEM. .isupper() is never true. if str(tokens[i]).isupper(): title = doc.createElement('title') tText = str(tokens[i]).strip('[\']') titleText = doc.createTextNode(tText.title()) doc.appendChild(title) title.appendChild(titleText) else: p = doc.createElement('p') pText = str(tokens[i]).strip('[\']') paraText = doc.createTextNode(pText) doc.appendChild(p) p.appenedChild(paraText) return doc if __name__ == '__main__': main()
Ultimately, itβs pretty straight forward, I would accept criticism or suggestions for my code. Who would not? In particular, I'm not happy with str(tokens[i]) , maybe there is a better way to iterate over a list of strings?
But the purpose of this question is to figure out the most efficient way to check if the utf-8 string is uppercase. Perhaps I should study creating a regex for this.
Remember, I did not run this code, and it may not work like that. I took the details from the working code and maybe something did not understand. Notify me and I will fix it. finally notice that i am not using lxml