Character encoding of Microsoft Word DOC and DOCX files?

I am not very familiar with the encoding used by Microsoft Word. If someone wants to save a .doc or .docx file from Word, what is the standard encoding that is used?

I assume this is not UTF-8, since the resulting text (inserted into a text file encoded with UTF-8) does not follow certain punctuation marks (e.g. quotation marks).

For example, an open, intelligent quote from Word when pasted into a UTF-8 text file results in the character ì . If Word is really encoded in UTF-8, then how does Word try to display the actual UTF-8 character?

Edit

After a little digging, I see that the Microsoft Word.docx file is actually a compressed format. Unzipping unpacks several XML files.

However, the inability of the UTF-8 encoded text file to abide by these smart quotes is still perplexing. Any helpful info would be helpful.

+8
ms-word utf-8 character-encoding
source share
1 answer

These days, the docx file is really a collection of compressed XML files. One of these files is the document.xml file, which starts with the following line (i.e. the xml prolog):

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 

As you can see, this is UTF-8 encoding.

EDIT

UTF-8 supports the full Unicode character set. Just for the sake of completeness, this does not mean that all UTF-8 characters can actually be used in an XML file. Even a CDATA unit has its limitations. But, having said all this, preserving 'or' is not a problem.

And more importantly, the file format has nothing to do with the behavior of copying and pasting the application itself.

However, this is how the word will store the characters "and".

xml and hex

CORRECTION

It’s a little confusing, but I just realized that under a “smart quote” you are probably referring to the mechanism that Word should represent curly quotes. In my previous answer, I thought you meant “quotation marks,” and that’s completely different. - Sorry for the confusion.

Well, anyway, here is the unicode for these smart quotes:

the utf smart quotes

Let's put them in a plain text file encoded in UTF-8. The result is not so impressive:

  • U+2018 is encoded in UTF-8 as E2 80 98
  • U+2019 is encoded in UTF-8 as E2 80 99
  • U+201C is encoded in UTF-8 as E2 80 9C
  • U+201D is encoded in UTF-8 as E2 80 9D

So, I went 1 step further and put them in a word file. I entered a line with normal quotes, and another with smart quotes.

 " this is a test " " this is another test " 

And then I saved the thing and looked at how it was saved in the Word xml structure. And in fact, it is precisely stored, as expected.

enter image description here

+1
source share

All Articles