Reading utf-8 character from byte stream

Given a stream of bytes (generator, file, etc.), how can I read a single utf-8 character?

  • This operation should consume the bytes of this character from the stream.
  • This operation should not consume stream bytes that exceed the first character.
  • This operation should succeed with respect to any Unicode character.

I could approach this by copying my own utf-8 decoding function, but I would prefer not to reinvent the wheel, as I am sure that this function should already be used elsewhere for parsing utf-8 strings.

+5
source share
1 answer

Wrap the stream in TextIOWrapper with encoding='utf8' , then call .read(1) on it.

It is assumed that you started with BufferedIOBase or something like a duck type compatible with it (i.e. it has a read() method). If you have a generator or an iterator, you may need to adapt the interface.

Example:

 from io import TextIOWrapper with open('/path/to/file', 'rb') as f: wf = TextIOWrapper(f, 'utf-8') wf._CHUNK_SIZE = 1 # Implementation detail, may not work everywhere wf.read(1) # gives next utf-8 encoded character f.read(1) # gives next byte 
+2
source

All Articles