Decoding UTF problems?

I am working on my Android project and I have an exotic problem that makes me crazy. I am trying to convert String to Utf-16 or Utf-8. I use this piece of code to achieve it, but it gives me an array with some negative members!

Java Code :

String Tag="سیر"; String Value=""; try{ byte[] bytes = Tag.getBytes("UTF-16"); for(int i=0;i<bytes.length;i++){ Value=Value+String.valueOf(bytes[i])+","; } 

Array Elements : Array Elements [-1,-2,51,6,-52,6,49,6] . I checked the UTF-16 table. It does not have a negative number, and I also used a website that converts words to UTF-16M. This gave me the "0633 06CC 0631" HEX . If you change this number to decimal, you will see the following: "1577 1740 1585" . as you can see, there is no negative number! So my first question is: what are these negative numbers ?!

Why do I want to convert a word to UTF-8 or UTF-16?

I'm working on the project. This project has two parts. The first part is an Android application that sends keywords to the server. Words are sent by customers. My clients use (Persian, فارسی) characters. the second part is a web application created by C # and it should answer my clients.

Problem . When I send these words to the server, it works on the stream "????" instead of the right word. I tried many ways to solve this problem, but they could not solve it. after that I decided to send utf-16 or utf-8 from the string to the server and convert it to the correct word. Therefore, I chose the method that I described at the top of my post.

Is my original code reliable?

Yes it is. If I use English characters, it responds very well.

What are my source codes?

Java codes that send the parameter to the server:

  protected String doInBackground(String...Urls){ String Data=""; HttpURLConnection urlConnection = null; try{ URL myUrl=new URL("http://10.0.2.2:80/Urgence/SearchResault.aspx?Tag="+Tag); urlConnection = (HttpURLConnection)myUrl.openConnection(); BufferedReader in = new BufferedReader (new InputStreamReader(urlConnection.getInputStream())); String temp=""; // Data is used to store Server Response while((temp=in.readLine())!=null) { Data=Data+temp; } } 

C # codes that answer customers:

  string Tag = Request.QueryString["Tag"].ToString(); SqlConnection con = new SqlConnection(WebConfigurationManager.ConnectionStrings["conStr"].ToString(); SqlCommand cmd = new SqlCommand("FetchResaultByTag"); cmd.CommandType = CommandType.StoredProcedure; cmd.Parameters.AddWithValue("@NewsTag",Tag); cmd.Connection = con; SqlDataReader DR; String Txt = ""; try { con.Open(); DR = cmd.ExecuteReader(); while (DR.Read()) { Txt = Txt + DR.GetString(0) + "-" + DR.GetString(1) + "-" + DR.GetString(2) + "-" + DR.GetString(3) + "/"; } //Response.Write(Txt); con.Close(); } catch (Exception ex) { con.Close(); Response.Write(ex.ToString()); } 

* What do you think? do you have any ideas **

+5
source share
2 answers

I solved it. I changed my code first java.i converted my string to UTF-8 using the URLEncoder class.

new java code:

 try{ Tag=URLEncoder.encode(Tag,"UTF-8"); } catch(Exception ex){ Log.d("Er>encodeing-Problem",ex.toString()); } 

after that I sent it as a String request via the Http protocol

 protected String doInBackground(String...Urls){ String Data=""; HttpURLConnection urlConnection = null; try{ URL myUrl=new URL("http://10.0.2.2:80/Urgence/SearchResault.aspx?Tag="+Tag); urlConnection = (HttpURLConnection)myUrl.openConnection(); BufferedReader in = new BufferedReader (new InputStreamReader(urlConnection.getInputStream())); String temp=""; // Data is used to store Server Response while((temp=in.readLine())!=null) { Data=Data+temp; } 

and in the end I caught on the server and decrypted it.

new C # code:

  string Tag = Request.QueryString["Tag"].ToString(); SqlConnection con = new SqlConnection(WebConfigurationManager.ConnectionStrings["conStr"].ToString()); SqlCommand cmd = new SqlCommand("FetchResaultByTag"); cmd.CommandType = CommandType.StoredProcedure; cmd.Parameters.AddWithValue("@NewsTag", HttpUtility.UrlDecode(Tag)); cmd.Connection = con; SqlDataReader DR; String Txt = ""; try { con.Open(); DR = cmd.ExecuteReader(); while (DR.Read()) { Txt = Txt + DR.GetString(0) + "-" + DR.GetString(1) + "-" + DR.GetString(2) + "-" + DR.GetString(3) + "/"; } Response.Write(Txt); con.Close(); } catch (Exception ex) { con.Close(); Response.Write(ex.ToString()); } 
+3
source

My first question is: what are these negative numbers ?!

They are a byte representation of the individual bytes in each 16-bit value of your text. In Java, the byte type is a signed value, similar to int or long , but with only 8 bits of information. It can represent values ​​from -128 to 127 . They are only "negative" when interpreting the Java value byte .

Of course, like bytes in UTF16 encoded text, this interpretation is pointless. You should only interpret them as UTF16 encoded text. But negative numbers are the natural result of misinterpreting UTF16 text, as if it were just a simple array of signed bytes.

It looks like you did something like int i = -1; uint j = (uint)i; int i = -1; uint j = (uint)i; (in C # ... Java does not have unsigned integer types per se), and then asked why j not negative, and instead has a value of 4,294,967,295 . Well, j is an unsigned data type; the bit pattern used for -1 as a signed int is used for 4,294,967,295 as an unsigned uint .

If this previous paragraph does not make sense to you, you will need to read it yourself to find out how computers store numbers and what is the difference between signed and unsigned data types.


The output array of your code, [-1,-2,51,6,-52,6,49,6] , is actually four 16-bit values ​​in order of bytes: 0xFEFF , 0x0633 , 0x06CC and 0x0631 . Each of these 16-bit values ​​represents a Unicode code point.

The first is used as a byte order character for UTF16 encoded text. This is a Unicode character that is specifically used to indicate whether UTF16 encoded bytes are faint or big-endian. The other three are characters from your actual string.

But when you pull out the bytes separately and view them separately, if you consider them as signed byte values, any value greater than 0x7F (if you consider them as unsigned bytes) is a negative number as a significant byte value. Thus, 0xFF , 0xFE and 0xCC displayed as negative numbers (each of which is greater than 0x7F ). But in reality they still make up only one half of the actual 16-bit Unicode code point value.

Note that even those code point values ​​may appear negative if interpreted incorrectly. In your example, only one will look negative - 0xFEFF is 0xFEFF if interpreted as a 16-bit signed value, even if the Unicode code point is actually decimal 65279 - but there are many other Unicode characters that have a value higher than 0x7FFFF (decimal 32767 ), and will be displayed negative if we consider them as a 16-bit value.

The bottom line is that computers do not know anything about numbers. They just have bits conveniently grouped into bytes and different word sizes. When you want to know what these bits mean, you need to make sure that you tell the computer the correct, useful representation to use when displaying the bits. If you do not, you will get a different interpretation of these bits that does not match their intended representation. Trash, trash.


Now, assuming that you understand all of the above, consider your broader question:

When I send these words to the server, it works on the stream "????" instead of the right word. I tried many ways to solve this problem, but they could not solve it.

The first question to ask yourself is "how do I interpret these bytes? How do I show them to the user?" You did not share any code that really was appropriate in this regard, but you said that when you use only the Latin alphabet ("English characters"), it works fine. Assuming you have tested the Latin script with UTF16, then this tells me that the main I / O is working correctly; the main thing that you can make a mistake is the byte order, but if this happens, even Latin characters will be distorted.

So, the most likely explanation for "????" The one you are describing is that you simply do not view the text in a context where Persian characters can be displayed. For example, write them to the console window using the Console class. The font used in the console window does not support Unicode-oriented rendering, so it just does not display Persian characters. Similar problems arise in other contexts, including, for example, Notepad (depending on which font is used) and other editors.


So sorry. All of the above is just a long way to tell you: "everything is in order, you probably just do not use the right tool to check your results."

Please note that without a good, minimal, complete code example that reliably reproduces any problem that you perceive, it is actually impossible to say exactly what is happening. If, after reading and understanding this answer, you still think that something is wrong with your code, you need to spend time creating a good example code that clearly demonstrates the real problem. One line of code is worth a thousand words, and the correct code example is worth its weight in gold (mix a couple of completely inapplicable metaphors :)).

+1
source

All Articles