There may be several sizes:
The size in memory returned by sys.getsizeof() , for example,
>>> import sys >>> sys.getsizeof(b'a') 38 >>> sys.getsizeof(u'Α') 56
ie, a byte string containing one byte of b'a' may require 38 bytes in memory.
You should not worry about this unless your local machine has a memory problem.
The number of bytes in the text encoded as utf-8:
>>> unicode_text = u'Α'
The number of Unicode codes in the text:
>>> unicode_text = u'Α'
In general, you may also be interested in the number of grapheme clusters ("visual symbols") in the text:
>>> unicode_text = u'̈'
If API limits are defined p. 2 (the number of bytes in utf-8 encoded bytestring), then you can use the answers of the question related to @Martijn Pieters : Unicode truncation so that it matches the maximum size when encoding for transferring over the network . The first answer should work:
truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')
There is also the possibility that the length is limited by the length of the URL:
>>> import urllib >>> urllib.quote(u'\u0435\u0308'.encode('utf-8')) '%D0%B5%CC%88'
To truncate it:
import re import urllib urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
The problem with the length of the URL can be solved using the 'X-HTTP-Method-Override' http header, which will convert the GET request to a POST request if the service supports it. Here is an example of code that uses the Google Translate API .
If this is allowed in your case, you can compress the html text by decoding the html links and using the NFC Unicode normalization form to combine some Unicode codes:
import unicodedata from HTMLParser import HTMLParser unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))
jfs
source share