Decode a file stream using UTF-8

I have an XML document that is very large (about 120 M) and I don’t want to load it into memory right away. My goal is to check if this file uses valid UTF-8 encoding.

Any ideas for a quick check without reading the entire file in memory as byte[] ?

I am using VSTS 2008 and C #.

There is an XMLDocument when using XMLDocument to load an XML document that contains invalid byte sequences, but there are no exceptions to reading the entire contents into a byte array and then checking for UTF-8, any ideas?

Here is a screenshot showing the contents of my XML file, or you can download a copy of the file from here

enter image description here

EDIT 1:

 class Program { public static byte[] RawReadingTest(string fileName) { byte[] buff = null; try { FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read); BinaryReader br = new BinaryReader(fs); long numBytes = new FileInfo(fileName).Length; buff = br.ReadBytes((int)numBytes); } catch (Exception ex) { Console.WriteLine(ex.Message); } return buff; } static void XMLTest() { try { XmlDocument xDoc = new XmlDocument(); xDoc.Load("c:\\abc.xml"); } catch (Exception ex) { Console.WriteLine(ex.Message); } } static void Main() { try { XMLTest(); Encoding ae = Encoding.GetEncoding("utf-8"); string filename = "c:\\abc.xml"; ae.GetString(RawReadingTest(filename)); } catch (Exception ex) { Console.WriteLine(ex.Message); } return; } } 

EDIT 2: When using new UTF8Encoding(true, true) will be an exception, but when using new UTF8Encoding(false, true) exception will not be thrown. I got confused because it should be the second parameter that controls whether an exception is thrown (if there are invalid byte sequences), why does the first parameter matter?

  public static void TestTextReader2() { try { // Create an instance of StreamReader to read from a file. // The using statement also closes the StreamReader. using (StreamReader sr = new StreamReader( "c:\\a.xml", new UTF8Encoding(true, true) )) { int bufferSize = 10 * 1024 * 1024; //could be anything char[] buffer = new char[bufferSize]; // Read from the file until the end of the file is reached. int actualsize = sr.Read(buffer, 0, bufferSize); while (actualsize > 0) { actualsize = sr.Read(buffer, 0, bufferSize); } } } catch (Exception e) { // Let the user know what went wrong. Console.WriteLine("The file could not be read:"); Console.WriteLine(e.Message); } } 
+6
c # validation encoding utf-8
source share
3 answers
 var buffer = new char[32768] ; using (var stream = new StreamReader (pathToFile, new UTF8Encoding (true, true))) { while (true) try { if (stream.Read (buffer, 0, buffer.Length) == 0) return GoodUTF8File ; } catch (ArgumentException) { return BadUTF8File ; } } 
+5
source share

@ George2 I think they mean a solution similar to the following (which I have not tested yet).

Handling the transition between buffers (i.e. caching extra bytes / partial characters between reads) is the responsibility and internal detail of the implementation of the StreamReader implementation.

 using System; using System.IO; using System.Text; class Test { public static void Main() { try { // Create an instance of StreamReader to read from a file. // The using statement also closes the StreamReader. using (StreamReader sr = new StreamReader( "TestFile.txt", Encoding.UTF8 )) { const int bufferSize = 1000; //could be anything char[] buffer = new char[bufferSize]; // Read from the file until the end of the file is reached. while (bufferSize == sr.Read(buffer, bufferSize, 0)) { //successfuly decoded another buffer's-worth of data } } } catch (Exception e) { // Let the user know what went wrong. Console.WriteLine("The file could not be read:"); Console.WriteLine(e.Message); } } } 
+3
source share

Wouldn't that work?

 StreamReader reader = new StreamReader(file); Console.WriteLine(reader.CurrentEncoding.ToString()); //You get the default encoding reader.Read(); Console.WriteLine(reader.CurrentEncoding.ToString()); //You get the right encoding. reader.Close(); 

If someone cannot explain why?

0
source share

All Articles