In Python, what is the most efficient way to cut a UTF-8 string for REST delivery?

Question

In Python, what is the most efficient way to cut a UTF-8 string for REST delivery?

I will start by saying that I understand what UTF-8 encoding is, that it is basically, but not quite unicode, and that ASCII is a smaller character set. I also understand that if I have:

se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV" print len(se_body) #will return the number of characters in the string, in my case '1500' print sys.getsizeof(se_body) #will return the number of bytes, which will be 3050

My code uses a RESTful API, which I do not control. The RESTful API job should analyze the passed parameter for Bible references from the text and has an interesting quirk - it takes only 2000 characters at a time. If more than 2000 characters are sent, my API call will return 404. Again, to emphasize, I use some other API, so please do not tell me to "fix the server side". I cant:)
My solution is to take a string and put it in bits of less than 2000 characters, let me scan each fragment, and then I will collect and mark as necessary. I would like to be kind to the mentioned service and pass as few pieces as possible, which means that each piece should be large.
My problem arises when I pass a string with Hebrew or Greek characters. (Yes, biblical answers often use Greek and Hebrew!) If I set the block size to 1000 characters, I can always pass it safely, but it seems very small. In most cases, I should be able to crop it more.
My question is this: without resorting to too many heroes, what is the most efficient way I can put UTF-8 in the right size?

Here is the code:

 # -*- coding: utf-8 -*- import requests import json biblia_apikey = '************' refparser_url = "http://api.biblia.com/v1/bible/scan/?" se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV" se_body = se_body.decode('utf-8') nchunk_start=0 nchunk_size=1500 found_refs = [] while nchunk_start < len(se_body): body_chunk = se_body[nchunk_start:nchunk_size] if (len(body_chunk.strip())<4): break; refparser_params = {'text': body_chunk, 'key': biblia_apikey } headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'} refparse = requests.get(refparser_url, params = refparser_params, headers=headers) if (refparse.status_code == 200): foundrefs = json.loads(refparse.text) for foundref in foundrefs['results']: foundref['textIndex'] += nchunk_start found_refs.append( foundref ) else: print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url) print " returned text is: =>{0}<=".format(refparse.text) nchunk_start += (nchunk_size-50) #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks for ref in found_refs: print ref print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']] בַּיֹּום הַשְּׁבִיעִי מִכָּל-מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה: The word translated as & quot; rest & quot; in English, is actually the conjugated word from which we get the English word` Sabbath # -*- coding: utf-8 -*- import requests import json biblia_apikey = '************' refparser_url = "http://api.biblia.com/v1/bible/scan/?" se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV" se_body = se_body.decode('utf-8') nchunk_start=0 nchunk_size=1500 found_refs = [] while nchunk_start < len(se_body): body_chunk = se_body[nchunk_start:nchunk_size] if (len(body_chunk.strip())<4): break; refparser_params = {'text': body_chunk, 'key': biblia_apikey } headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'} refparse = requests.get(refparser_url, params = refparser_params, headers=headers) if (refparse.status_code == 200): foundrefs = json.loads(refparse.text) for foundref in foundrefs['results']: foundref['textIndex'] += nchunk_start found_refs.append( foundref ) else: print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url) print " returned text is: =>{0}<=".format(refparse.text) nchunk_start += (nchunk_size-50) #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks for ref in found_refs: print ref print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

I know how to cut a string ( body_chunk = se_body[nchunk_start:nchunk_size] ), but I'm not sure how I would decide to trim the same string according to the UTF-8 bit length.

When I'm done, I need to pull out the selected links (I'm going to add SPAN tags). This is what it will now look like:

 {u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'} Genesis 2:2 {u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'} Genesis 8:9

0

python string rest unicode utf-8

Affable geek Mar 28 '14 at 16:54

source share

1 answer

jfs · Accepted Answer · 2014-03-28T20:05:30+0000

There may be several sizes:

The size in memory returned by sys.getsizeof() , for example,
```
 >>> import sys >>> sys.getsizeof(b'a') 38 >>> sys.getsizeof(u'Α') 56 
```
ie, a byte string containing one byte of b'a' may require 38 bytes in memory.
You should not worry about this unless your local machine has a memory problem.

The number of bytes in the text encoded as utf-8:

 >>> unicode_text = u'Α' # greek letter >>> bytestring = unicode_text.encode('utf-8') >>> len(bytestring) 2

The number of Unicode codes in the text:

 >>> unicode_text = u'Α' # greek letter >>> len(unicode_text) 1

In general, you may also be interested in the number of grapheme clusters ("visual symbols") in the text:

 >>> unicode_text = u'̈' # cyrillic letter >>> len(unicode_text) # number of Unicode codepoints 2 >>> import regex # $ pip install regex >>> chars = regex.findall(u'\\X', unicode_text) >>> chars [u'\u0435\u0308'] >>> len(chars) # number of "user-perceived characters" 1

If API limits are defined p. 2 (the number of bytes in utf-8 encoded bytestring), then you can use the answers of the question related to @Martijn Pieters : Unicode truncation so that it matches the maximum size when encoding for transferring over the network . The first answer should work:

 truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')

There is also the possibility that the length is limited by the length of the URL:

 >>> import urllib >>> urllib.quote(u'\u0435\u0308'.encode('utf-8')) '%D0%B5%CC%88'

To truncate it:

 import re import urllib urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000] # remove `%` or `%X` at the end urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded) truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')

The problem with the length of the URL can be solved using the 'X-HTTP-Method-Override' http header, which will convert the GET request to a POST request if the service supports it. Here is an example of code that uses the Google Translate API .

If this is allowed in your case, you can compress the html text by decoding the html links and using the NFC Unicode normalization form to combine some Unicode codes:

 import unicodedata from HTMLParser import HTMLParser unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))

In Python, what is the most efficient way to cut a UTF-8 string for REST delivery?

More articles: