I have an XML document that is very large (about 120 M) and I donβt want to load it into memory right away. My goal is to check if this file uses valid UTF-8 encoding.
Any ideas for a quick check without reading the entire file in memory as byte[] ?
I am using VSTS 2008 and C #.
There is an XMLDocument when using XMLDocument to load an XML document that contains invalid byte sequences, but there are no exceptions to reading the entire contents into a byte array and then checking for UTF-8, any ideas?
Here is a screenshot showing the contents of my XML file, or you can download a copy of the file from here

EDIT 1:
class Program { public static byte[] RawReadingTest(string fileName) { byte[] buff = null; try { FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read); BinaryReader br = new BinaryReader(fs); long numBytes = new FileInfo(fileName).Length; buff = br.ReadBytes((int)numBytes); } catch (Exception ex) { Console.WriteLine(ex.Message); } return buff; } static void XMLTest() { try { XmlDocument xDoc = new XmlDocument(); xDoc.Load("c:\\abc.xml"); } catch (Exception ex) { Console.WriteLine(ex.Message); } } static void Main() { try { XMLTest(); Encoding ae = Encoding.GetEncoding("utf-8"); string filename = "c:\\abc.xml"; ae.GetString(RawReadingTest(filename)); } catch (Exception ex) { Console.WriteLine(ex.Message); } return; } }
EDIT 2: When using new UTF8Encoding(true, true) will be an exception, but when using new UTF8Encoding(false, true) exception will not be thrown. I got confused because it should be the second parameter that controls whether an exception is thrown (if there are invalid byte sequences), why does the first parameter matter?
public static void TestTextReader2() { try { // Create an instance of StreamReader to read from a file. // The using statement also closes the StreamReader. using (StreamReader sr = new StreamReader( "c:\\a.xml", new UTF8Encoding(true, true) )) { int bufferSize = 10 * 1024 * 1024; //could be anything char[] buffer = new char[bufferSize]; // Read from the file until the end of the file is reached. int actualsize = sr.Read(buffer, 0, bufferSize); while (actualsize > 0) { actualsize = sr.Read(buffer, 0, bufferSize); } } } catch (Exception e) { // Let the user know what went wrong. Console.WriteLine("The file could not be read:"); Console.WriteLine(e.Message); } }
c # validation encoding utf-8
George2
source share