tag and ...">

"appears on the page instead of" ""

’ displayed on my page instead of ' .

I have a Content-Type installed in UTF-8 both in the <head> tag and in my HTTP headers:

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 

enter image description here

In addition, my browser is configured for Unicode (UTF-8) :

enter image description here

So what is the problem, and how can I fix it?

+84
encoding utf-8 mojibake
Mar 19 '10 at 13:04 on
source share
11 answers

Make sure the browser and editor use UTF-8 encoding instead of ISO-8859-1 / Windows-1252.

Or use &rsquo; .

+34
Mar 19 '10 at 13:06
source share

So what a problem

This is the symbol ' ( RIGHT SINGLE QUOTATION MARK - U + 2019), which was encoded as CP-1252 instead of UTF-8 . If you check the encodings table, you will see that this character is in UTF-8, consisting of bytes 0xE2 , 0x80 and 0x99 . If you check the layout of the CP-1252 code page , you will see that each of these bytes denotes separate characters â , and .




and how to fix it?

Use UTF-8 instead of CP-1252 to read, write, store and display characters.




I have a Content-Type installed in UTF-8 both in the <head> tag and in my HTTP headers:

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 

This only tells the client which encoding to use to interpret and display characters. This does not give an indication to your own program, which the encoding should use to read, write, store and display characters. The exact answer depends on the server platform / database / programming language used. Note that the one set in the HTTP response header takes precedence over HTML meta tags. The HTML meta tag will be used only when the page is opened from the local disk file system, and not from HTTP.




In addition, my browser is configured for Unicode (UTF-8) :

This only makes the client use encoding to interpret and display characters. But the problem is that you are already sending ’ (encoded in UTF-8) to the client instead of ' . The client correctly displays ’ using the encoding UTF-8. If the client was not correctly installed for use, for example, ISO-8859-1, most likely you saw ââ¬â¢ .




I am using ASP.NET 2.0 with a database.

This is most likely where your problem is. You need to check with an independent database tool what the data looks like.

If the ' character is present, then you are not connecting to the database correctly. To use UTF-8, you must specify the database connector.

If your database contains ’ , then this is your database that is corrupted. Most likely, the tables are not configured to use UTF-8 . Instead, they use the default encoding of the database, which is configuration dependent. If this is your problem, just change the table to use UTF-8. If your database does not support this, you will need to recreate the tables. It’s good practice to set the table encoding when creating it.

Most likely you are using SQL Server, but there is MySQL code (copied from this article ):

 CREATE DATABASE db_name CHARACTER SET utf8; CREATE TABLE tbl_name (...) CHARACTER SET utf8; 

If your table, however, is already UTF-8, you need to take a step back. Who or what puts the data there. That's the problem. One example would be a form for representing HTML forms that are incorrectly encoded / decoded.




Here are some more links to learn more about the problem:

+153
Mar 19 '10 at 13:08
source share

I have some documents where shown as … and ê shown as ê . Here's how it came about (python code):

 # Adam edits original file using windows-1252 windows = '\x85\xea' # that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX # Beth reads it correctly as windows-1252 and writes it as utf-8 utf8 = windows.decode("windows-1252").encode("utf-8") print(utf8) # Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version twingled = utf8.decode("windows-1252").encode("utf-8") print(twingled) # detwingle by reading as utf-8 and writing as windows-1252 (it really utf-8) detwingled = twingled.decode("utf-8").encode("windows-1252") assert utf8==detwingled 

To fix the problem, I used python code as follows:

 with open("dirty.html","rb") as f: dt = f.read() ct = dt.decode("utf8").encode("windows-1252") with open("clean.html","wb") as g: g.write(ct) 

(Since someone inserted the modified version into the correct UTF-8 document, I actually had to extract only the modified part, separate it and paste it back. For this, I used BeautifulSoup.)

Most likely you have Charlie in content creation than in web server configuration. You can also force your web browser to twist the page by selecting the windows-1252 encoding for the utf-8 document. Your web browser cannot host a document saved by Charlie.

Note : the same problem can occur with any other single-byte codepage (e.g. latin-1) instead of windows-1252.

+10
Oct 24 '13 at 18:16
source share

If your content type is already UTF8, then most likely the data is already being sent in the wrong encoding. If you are retrieving data from a database, make sure the database connection is using UTF-8.

If this is data from a file, make sure the file is correctly encoded as UTF-8. You can usually set this in the “Save As ...” dialog box of the editor of your choice.

If the data is already broken when you view it in the source file, most likely it was a UTF-8 file, but somewhere in the path it was saved in the wrong encoding.

+5
Mar 19 '10 at 13:08
source share

' (Unicode U+2019 RIGHT SINGLE QUOTATION MARK encoding) is encoded in UTF-8 as bytes:

0xE2 0x80 0x99 .

’ (Unicode encodings U+00E2 U+20AC U+2122 ) is encoded in UTF-8 as bytes:

0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2 .

These are the bytes that your browser actually receives to create ’ when processed as UTF-8.

This means that your source data goes through two encoding conversions before sending it to the browser:

  • The original character ' ( U+2019 ) is first encoded as UTF-8 bytes:

    0xE2 0x80 0x99

  • then these separate bytes were incorrectly interpreted and decoded in Unicode codecoints U+00E2 U+20AC U+2122 one of the Windows-125X encodings (1252, 1254, 1256 and 1258). All cards are 0xE2 0x80 0x99 to U+00E2 U+20AC U+2122 ), and then these code points are encoded as UTF-8 bytes:

    0xE2U+00E20xC3 0xA2
    0x80U+20AC0xE2 0x82 0xAC
    0x99U+21220xE2 0x84 0xA2

You need to find where the extra conversion is in step 2 and delete it.

+5
Jun 19 '15 at 0:02
source share

You have a mismatch in the encoding of your character; your string is encoded in one encoding (UTF-8), and everything that interprets this page uses another (for example, ASCII).

Always include your encoding in your http headers and make sure it matches your encoding code definition.

Example HTTP header:

 Content-Type text/html; charset=utf-8 

Configure encoding in asp.net

 <configuration> <system.web> <globalization fileEncoding="utf-8" requestEncoding="utf-8" responseEncoding="utf-8" culture="en-US" uiCulture="de-DE" /> </system.web> </configuration> 

Configuring encoding in jsp

+4
Mar 19 '10 at 13:09
source share

This sometimes happens when a string is converted from Windows-1252 to UTF-8 twice .

We had it in the Zend / PHP / MySQL application, where such characters appeared in the database, probably due to a MySQL connection that did not specify the correct character set. We had to:

  • Make sure that Zend and PHP exchanged data with the database in UTF-8 (not by default)

  • Repair broken characters with a few SQL queries, such as ...

     UPDATE MyTable SET MyField1 = CONVERT(CAST(CONVERT(MyField1 USING latin1) AS BINARY) USING utf8), MyField2 = CONVERT(CAST(CONVERT(MyField2 USING latin1) AS BINARY) USING utf8); 

    Do this to use as many tables / columns as possible.

You can also fix some of these lines in PHP if necessary. Please note that since the characters were encoded twice, we really need to do the inverse conversion from UTF-8 back to Windows-1252, which first confused me.

 mb_convert_encoding('’', 'Windows-1252', 'UTF-8'); // returns ' 
+3
Jul 15 '16 at 9:05
source share

If someone got this error on a WordPress website, you need to change the db charset wars configuration:

 define('DB_CHARSET', 'utf8mb4_unicode_ci'); 

instead:

 define('DB_CHARSET', 'utf8mb4'); 
+1
Mar 08 '16 at 9:13
source share

You must have text to copy / paste from Word Document. Word document uses Smart Quotes. You can replace it with a special character (& rsquo;) or simply enter your HTML editor (').

I am sure this will solve your problem.

-one
Sep 04 '15 at 10:41
source share

The same thing happened to me with a “-” symbol (a long minus sign).
I used this simple replacement, so enable it:

 htmlText = htmlText.Replace('–', '-'); 
-3
Oct. 14 '13 at 8:49
source share

Instead of the pound sign, I used: and pound; without space. This solved this problem for me.

For Euro: Euro; without space.

-four
Feb 13 '14 at 20:08
source share



All Articles