C #: loop using encodings

I read files in different formats and languages, and currently I'm using a small encoding library to try and find the correct encoding ( http://www.codeproject.com/KB/recipes/DetectEncoding.aspx ).

This is pretty good, but still not enough time. (Multilingual files)

Most of my potential users have very little understanding of the encoding (the best I can hope for is β€œit has something to do with the characters”) and are unlikely to be able to select the correct encoding in the list, so I would like to let them cycle through different encodings until the right one is found by simply clicking on the button.

Display problems? Click here to try a different encoding! (Good thing the concept anyway)

What would be the best way to implement something like this?


Edit: Looks like I didn't express enough. "In a loop through encoding," I do not mean "how to encode encodings"?

What I had in mind was "how to let the user execute different encodings in sequence without reloading the file?"

The idea is more like this: let them say that the file was loaded with the wrong encoding. Some strange characters are displayed. The user will click the "Next Encoding" or "Previous Encoding" button, and the string will be converted to another encoding. The user just needs to press until the correct encoding is found. (no matter what encoding looks good to the user, everything will be fine). While the user can click "Next", he has reasonable chances to solve his problem.

What I have found so far involves converting the string to bytes using the current encoding, and then converting the bytes to the next encoding, converting those bytes to characters, and then converting the char to a string ... Doable, but I wonder if there are more an easy way to do this.

For example, if there was a method that would read a string and return it using a different encoding, something like "render (string, encoding)".


Thanks for answers!

+6
c # utf-8 character-encoding
source share
6 answers

Read the file as bytes and use the Encoding.GetString method.

byte[] data = System.IO.File.ReadAllBytes(path); Console.WriteLine(Encoding.UTF8.GetString(data)); Console.WriteLine(Encoding.UTF7.GetString(data)); Console.WriteLine(Encoding.ASCII.GetString(data)); 

Therefore, you should download the file only once. You can use each encoding based on the original bytes of the file. The user can choose the right one, and you can use the result of Encoding.GetEncoding (...). GetString (data) for further processing.

+14
source share

(deleted original answer after clarifying the question)

For example, if there was a method to read a string and return it using another encoding, something like "render (string, encoding)".

I do not think you can reuse string data. The fact is that if the encoding was incorrect, this line can be considered corrupt. It can very easily contain gibberish among likely promising characters. In particular, many encodings can forgive the presence / absence of a specification / preamble, but can you recode it? without him?

If you are happy to take a chance (I would not do this), you could simply transcode your local string with the latest encoding:

 // I DON'T RECOMMEND THIS!!!! byte[] preamble = lastEncoding.GetPreamble(), content = lastEncoding.GetBytes(text); byte[] raw = new byte[preamble.Length + content.Length]; Buffer.BlockCopy(preamble, 0, raw, 0, preamble.Length); Buffer.BlockCopy(content, 0, raw, preamble.Length, content.Length); text = nextEncoding.GetString(raw); 

In fact, I believe that the best thing you can do is to keep the original byte[] - to offer different visualizations (using different encodings) until they like them. Something like:

 using System; using System.IO; using System.Text; using System.Windows.Forms; class MyForm : Form { [STAThread] static void Main() { Application.EnableVisualStyles(); Application.Run(new MyForm()); } ComboBox encodings; TextBox view; Button load, next; byte[] data = null; void ShowData() { if (data != null && encodings.SelectedIndex >= 0) { try { Encoding enc = Encoding.GetEncoding( (string)encodings.SelectedValue); view.Text = enc.GetString(data); } catch (Exception ex) { view.Text = ex.ToString(); } } } public MyForm() { load = new Button(); load.Text = "Open..."; load.Dock = DockStyle.Bottom; Controls.Add(load); next = new Button(); next.Text = "Next..."; next.Dock = DockStyle.Bottom; Controls.Add(next); view = new TextBox(); view.ReadOnly = true; view.Dock = DockStyle.Fill; view.Multiline = true; Controls.Add(view); encodings = new ComboBox(); encodings.Dock = DockStyle.Bottom; encodings.DropDownStyle = ComboBoxStyle.DropDown; encodings.DataSource = Encoding.GetEncodings(); encodings.DisplayMember = "DisplayName"; encodings.ValueMember = "Name"; Controls.Add(encodings); next.Click += delegate { encodings.SelectedIndex++; }; encodings.SelectedValueChanged += delegate { ShowData(); }; load.Click += delegate { using (OpenFileDialog dlg = new OpenFileDialog()) { if (dlg.ShowDialog(this)==DialogResult.OK) { data = File.ReadAllBytes(dlg.FileName); Text = dlg.FileName; ShowData(); } } }; } } 
+4
source share

Could you let the user enter a few words (with "special" characters) that should appear in the file?

You can search for all encodings yourself to see if these words are present.

0
source share

Beware of the infamous Notepad bug . He's going to bite you with everything you try, though ... You can find good discussions about coding and their problems in MSDN (and other places).

0
source share

You must save the original data as a byte array or MemoryStream, which you can then transfer to a new encoding as soon as you have already converted your data to a string that you cannot reliably return to the original representation.

0
source share

How about something like this:

 public string LoadFile(string path) { stream = GetMemoryStream(path); string output = TryEncoding(Encoding.UTF8); } public string TryEncoding(Encoding e) { stream.Seek(0, SeekOrigin.Begin) StreamReader reader = new StreamReader(stream, e); return reader.ReadToEnd(); } private MemoryStream stream = null; private MemorySteam GetMemoryStream(string path) { byte[] buffer = System.IO.File.ReadAllBytes(path); return new MemoryStream(buffer); } 

Use a LoadFile on the first try; then use TryEncoding afterwards.

0
source share

All Articles