I recently ran into this with the clipboard and Microsoft Excel
With the ever-growing multilingual content used in data science, there is simply no safe way to use utf-8 anymore (in my case, excel assumed UTF-16 because most of my data included traditional Chinese (mandarin)?).
According to Microsoft Docs , the following specifications are used in Windows:
|----------------------|-------------|-----------------------| | Encoding | Bom | Python encoding kwarg | |----------------------|-------------|-----------------------| | UTF-8 | EF BB BF | 'utf-8' | | UTF-16 big-endian | FE FF | 'utf-16-be' | | UTF-16 little-endian | FF FE | 'utf-16-le' | | UTF-32 big-endian | 00 00 FE FF | 'utf-32-be' | | UTF-32 little-endian | FF FE 00 00 | 'utf-32-le' | |----------------------|-------------|-----------------------|
I came up with the following approach, which seems to work well for detecting encoding using a byte order mark at the beginning of the file:
def guess_encoding_from_bom(filename, default='utf-8'): msboms = dict((bom['sig'], bom) for bom in ( {'name': 'UTF-8', 'sig': b'\xEF\xBB\xBF', 'encoding': 'utf-8'}, {'name': 'UTF-16 big-endian', 'sig': b'\xFE\xFF', 'encoding': 'utf-16-be'}, {'name': 'UTF-16 little-endian', 'sig': b'\xFF\xFE', 'encoding': 'utf-16-le'}, {'name': 'UTF-32 big-endian', 'sig': b'\x00\x00\xFE\xFF', 'encoding': 'utf-32-be'}, {'name': 'UTF-32 little-endian', 'sig': b'\xFF\xFE\x00\x00', 'encoding': 'utf-32-le'})) with open(filename, 'rb') as f: sig = f.read(4) for sl in range(3, 0, -1): if sig[0:sl] in msboms: return msboms[sig[0:sl]]['encoding'] return default
I understand that for this it is necessary to open the file for reading twice (once in binary format and once in the form of encoded text), but the API does not really make it easy to do otherwise in this particular case.
Anyway, I think this is a little more reliable than just assuming utf-8, and obviously automatic encoding detection doesn't work like that ...