How to get data from a character

Question

How to get data from a character

I am working on a project in Unity that uses Assembly C #. I'm trying to get a special character like é, but in the console it just displays an empty character: "For example, the translation." How are you? “Must return” Cómo Estás? "But it returns“ Cmo Ests. ”I put the“ Cmo Ests ”return string in an array of characters and realized that it was an empty character. I use Encoding.UTF8, and when I do this:

char ch = '\u00e9'; print (ch);

He will print "é". I tried to get bytes from a given string using:

 byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);

When translating "How are you?" a string of bytes is returned, but for special characters like é, I get a series of bytes 239, 191, 189, which is a replacement character.

What information do I need from the characters to determine exactly what the character is? Do I need to do something with the information Google gives me, or is it something else? I need a general case that I can place in my program and will work for any input line. If anyone can help, he will be very grateful.

Here is the code referenced:

 using System; using System.Collections.Generic; using System.Linq; using System.Text; using UnityEngine; using System.Collections; using System.Net; using HtmlAgilityPack; public class Dictionary{ string[] formatParams; HtmlDocument doc; string returnString; char[] letters; public char[] charString; public Dictionary(){ formatParams = new string[2]; doc = new HtmlDocument(); returnString = ""; } public string Translate(String input, String languagePair, Encoding encoding) { formatParams[0]= input; formatParams[1]= languagePair; string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams); string result = String.Empty; using (WebClient webClient = new WebClient()) { webClient.Encoding = encoding; result = webClient.DownloadString(url); } doc.LoadHtml(result); input = alter (input); string temp = doc.DocumentNode.SelectSingleNode("//span[@title='"+input+"']").InnerText; charString = temp.ToCharArray(); return temp; } // Use this for initialization void Start () { } string alter(string inputString){ returnString = ""; letters = inputString.ToCharArray(); for(int i=0; i<inputString.Length;i++){ if(letters[i]=='\''){ returnString = returnString + "&#39;"; }else{ returnString = returnString + letters[i]; } } return returnString; } }

+7

c # .net-assembly utf-8 unity3d google-translate

Cameron Barge Nov 09 '12 at 15:35

source share

5 answers

Simon mourier · Answer 1 · 2012-11-20T18:24:35+0000

Perhaps you should use a different API / URL. This function below uses a different url that returns JSON data and seems to work better:

  public static string Translate(string input, string fromLanguage, string toLanguage) { using (WebClient webClient = new WebClient()) { string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage); string result = webClient.DownloadString(url); // I used JavaScriptSerializer but another JSON parser would work JavaScriptSerializer serializer = new JavaScriptSerializer(); Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result); Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0]; return (string)sentences["trans"]; } }

If I ran this in a console application:

  Console.WriteLine(Translate("How are you?", "en", "es"));

He will display

 ¿Cómo estás?

Codechops · Answer 2 · 2012-11-09T16:08:38+0000

In fact, you have it. Just insert the encoded letter using \ u and it works.

 string mystr = "C\u00f3mo Est\u00e1s?";

Neil white · Answer 3 · 2012-11-20T15:09:42+0000

I don't know much about the GoogleTranslate API, but as I understand it, you have a problem with Unicode normalization.

Take a look at System.String.Normalize() and friends.

Unicode is very complex, so I will simplify! Many characters can be represented differently in Unicode, that is: "é" can be represented as "é" (one character) or as the character "e" + "accent" (two characters) or, depending on what comes back from the API, quite another.

The Normalize function converts your string to a unit with the same Textual value, but is potentially different from a binary value that can fix your output problem.

Shaz · Answer 4 · 2012-11-22T11:42:23+0000

I had the same problem while working on one of my projects [Translation of localization of language resources]

I did the same and used .. System.Text.Encoding.UTF8.GetBytes (), and because of the utf8 encoding, they received special characters, for example, yours, for example, 239, 191, 189 in the result line.

Please take a look at my solution ... hope this helps

Do not use encoding at all. Google translation will return correctly, as well as in the line itself. do some string manipulation and read the string as it is ...

Generic Solution [works for every language translation that supports Google]

 try { //Don't use UtF Encoding // use default webclient encoding var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2)); var webClient = new WebClient(); string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding.. int start = result.IndexOf("id=result_box"); int end = result.IndexOf("id=spell-place-holder"); int length = end - start; result = result.Substring(start, length); result = reverseString(result); start = result.IndexOf(";8669#&");//◄ end = result.IndexOf(";8569#&"); //► length = end - start; result = result.Substring(start +7 , length - 8); objDic2.Text = reverseString(result); //hard code substring; finding the correct translation within the string. dictList.Add(objDic2); } catch (Exception ex) { lblMessages.InnerHtml = "<strong>Google translate exception occured no resource saved..." + ex.Message + "</strong>"; error = true; } public static string reverseString(string s) { char[] arr = s.ToCharArray(); Array.Reverse(arr); return new string(arr); }

as you can see from the code, the encoding was not done, and I send 2 special charachters as “►” + txtNewResourceValue.Text.Trim () + “◄” to determine the beginning and end of the return transfer from Google.

I also tested hough my language tool. Am I getting Cómo Estás? when sending How do you feel about the Google translation ... :)

Regards [Shaz]

--------------------------- Edited ------------------- --- ---

public string Translate (String input, String languagePair) {

  try { //Don't use UtF Encoding // use default webclient encoding //input [string to translate] //Languagepair [eg|es] var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair); var webClient = new WebClient(); string result = webClient.DownloadString(url); //get all data from google translate int start = result.IndexOf("id=result_box"); int end = result.IndexOf("id=spell-place-holder"); int length = end - start; result = result.Substring(start, length); result = reverseString(result); start = result.IndexOf(";8669#&");//◄ end = result.IndexOf(";8569#&"); //► length = end - start; result = result.Substring(start + 7, length - 8); //return transalted string return reverseString(result); } catch (Exception ex) { return "Google translate exception occured no resource saved..." + ex.Message"; } }

byteflux · Answer 5 · 2012-11-26T16:38:50+0000

There are several issues with your approach. First of all, the UTF8 encoding is a multibyte encoding. This means that if you use any character without an ASCII character (with char 127), you will get a series of special characters that tell the system that it is a Unicode char. So actually your sequence 239, 191, 189 indicates a single character that is not an ASCII character. If you use UTF16, you get fixed-size encodings (2-byte encodings) that actually display the character in unsigned short (0-65535).

The char type in C # is a double-byte type, so it's actually an unsigned character. This contrasts with other languages such as C / C ++, where the char type is a 1-byte type.

So, in your case, if you really don't need to use byte [] arrays, you should use char [] arrays. Or, if you want to encode characters so that they can be used in HTML, you can simply iterate over the characters and check if the character code is> 128, then you can replace it with & hex; character code.

How to get data from a character

More articles: