Your file is stored in UTF-16 (Unicode). The first character in your file is "L", which is the 0x4C code point. The first 4 bytes of your file: FF FE 4C 00 , which are bytes (BOM) and the letter L encoded in UTF-16, as two bytes.
fgets does not support Unicode, so it looks for the newline character '\n' , which is byte 0x0A. This will most likely happen on the first byte of a Unicode newline (two bytes 0A 00 ), but it can also happen on a variety of other characters than the newline, such as U + 010A (LATIN CAPITAL LETTER A WITH DOT ABOVE) or all in Gurmukha or Gujarati scenarios (U + 0A00 to U + 0AFF).
In any case, the data that ends in the wah buffer has many built-in zeros and looks something like FF FE 4C 00 47 00 4F 00 4F 00 0A 00 . NUL (0x00) is the terminator of the C line, so when you try to print it using printf , it stops at the first null value, and all you see is \377\376L . \377\376 is the octal representation of the FF FE bytes.
The fix for this is to convert your text file to a single-byte encoding such as ISO 8859-1 or UTF-8. Note that single-byte encodings (with the exception of UTF-8) cannot encode the full range of Unicode characters, so if you need Unicode, I highly recommend using UTF-8. In addition, you can convert your program in Unicode format, but then you can no longer use many standard library functions (for example, fgets and printf ), and you need to use wchar_t everywhere instead of char .
Adam rosenfield
source share