Manual conversion of ASCII and .NET characters

I am working on writing code to clear user input on my ASP.NET site. I need to clear the input to remove all references to the ASCII characters 145, 146, 147, 148, which are sometimes entered by my Mac users who copy and paste the contents that they write in the word processor on their poppies.

My problem is the following three lines, which I suppose should output the same text.

string test1 = Convert.ToChar(147).ToString(); string test2 = String.Format("'{0}'", Convert.ToChar(147)); char[] characters = System.Text.Encoding.ASCII.GetChars(new byte[] { 147 }); string test3 = new string(characters); 

However, when I set the ASP TextBox equal to the following

 txtShowValues.Text = test1 + "*" + test2 + "*" + test3; 

I get an empty value for test1, test2 works correctly, and test3 produces as "?".

Can someone explain what is going on differently. I hope this helps me understand how .NET uses ASCII values ​​for characters greater than 128 so that I can write a good cleanup script.

EDIT
The values ​​I mentioned (145 - 148) are curly quotes. Thus, one left, one right, double left, double right.

By “working correctly” I mean that it displays an italic quote in my browser.

SECOND EDIT
The following code (mentioned in the answer) also displays curly quotes. Therefore, perhaps the problem was using ASCII in test 3.

 char[] characters2 = System.Text.Encoding.Default.GetChars(new byte[] { 147 }); string test4 = new string(characters2); 

THIRD CHANGE
I found a mac that I could lend, and was able to repeat the problem. When I copy and paste text that contains quotes from Word into my web application on mac, it inserts curly quotes (147 and 148). When I click, saving italic quotes is stored in the database, so I will use the code that you helped me to brighten up this content.

FOUTH EDIT
Spent some time writing another sample code based on the answers here and noticed that it has something to do with multi-threaded text blocks in ASP.NET. There was good information here, so I decided to just ask a new question: ASP.NET A multi-line text box allowing input over UTF-8

+6
character-encoding ascii
source share
3 answers

Character 147 - U + 0093 SET TRANSMIT STATE. Like all Unicode characters in the range 0-255, it matches the ISO-8859-1 character of the same number. ISO-8859-1 assigns this invisible control code 147.

What you are thinking is not “ASCII or even“ ISO-8859-1, ”but the code for Windows is 1252. This is a non-standard encoding that is similar to 8859-1 but assigns 128-159 characters to different typographic extensions, such as smart quotes, not basically useless control codes. On code page 1252, character 147 is " , aka U + 201C LEFT DOUBLE QUOTATION MARK.

If you want to convert Windows code pages (often mistakenly known as "ANSI") to Unicode characters, you need to specify the desired code page, for example:

 System.Text.Encoding.getEncoding(1252).GetChars(new byte[] { 147 }) 

System.Text.Encoding.Default will give you the default encoding on your server. For a server in the Western European region, this will be 1252. This will not be the case elsewhere. As a rule, it’s nice to have a dependency on the default code page of the standard in a server application.

In any case, you should not receive bytes of type 147 representing " in the input file of the web application. This will only happen if your page is encoded in code page 1252 (and just to confuse and mislead even more when you say that your page is in ISO-8859-1 format, browsers will silently use code page 1252. Your page may also be in 1252 if you did not specify any encoding for it (browser assumed other locales will guess different pages of code, all this will be a big mess).

Make sure you use UTF-8 for all encodings in your web application, and mark your pages as such . Today, all web applications must use UTF-8.

+10
source share

.NET uses unicode (UCS-2), which is the same as ASCII only for values ​​below 128.

ASCII does not define values ​​above 127.

I think you might be thinking of ANSI, which defines values ​​above 127 as (mostly) language characters needed for most European languages. or OEM (the original IBM pc character set), which defines characters> 127 as (mostly) characters.

The difference in how characters higher than 127 are interpreted is called a code page or encoding. (hence System.Text.Encoding). This way you can probably get test 3 if you use a different encoding, perhaps System.Text.Encoding.Default .

Edit: Well, now that we know that the encoding you want is ANSI, it clears everything that happens.

The rule for character conversion is to replace characters that cannot be encoded as some other character — usually this field. But for ASCII there is no box character, so does it use? instead. This explains test 3.

test1 and 2 use Convert.ToChar with an integer constant. Which will interpret the input as a UNICODE character, not an ANSI character, so no conversion is applied. The Unicode character 147 is a non-printable character.

+3
source share

I get question marks for all three of them in a console application (.NET 3.5SP1). As far as I know, they should all be equivalent. John Knoller is right about ASCII versus ANSI.

Have you tried using one of the GetBytes () encoding classes in the source string and iteration, deleting (by copying the "good" bytes to another buffer) values ​​that you don't want?

eg. (using Linq):

 byte[] original = System.Text.Encoding.ASCII.GetBytes(badString); byte[] clean = (from b in original where b < 145 || b > 148 select b).ToArray<byte>(); string cleanString = System.Text.Encoding.ASCII.GetString(clean); 

ASCII is probably erroneous to use here, to be honest; if the source text is Unicode, it can probably do bad things (if you, for example, get UTF-16).

0
source share

All Articles