How to recognize if a string contains unicode characters?

Question

How to recognize if a string contains unicode characters?

I have a string and I want to know if it contains Unicode characters inside or not. (if it fully contains ASCII or not)

How can i achieve this?

Thank!

+21

c # asp.net unicode

Himberjack Dec 16 '10 at 10:13

source share

6 answers

All C# / VB.NET string datatypes consist of Unicode characters.

+5

Mitch Wheat Dec 16 '10 at 10:15

source share

ASCII defines only character codes in the range 0-127 . Unicode explicitly defined as overlapping in the same range with ASCII. Thus, if you look at the character codes in your string and contains everything above 127, the string contains Unicode characters that are not ASCII characters.

Please note that ASCII only includes the English alphabet. Thus, if you (for some reason) need to apply the same approach to strings that may contain accented characters (for example, in Spanish text), ASCII is not enough, and you need to look for another differentiator.

ANSI character set [*] extends ASCII characters with the aforementioned Latin letters in the range 128-255 . However, Unicode does not overlap with ANSI in this range, so technically the Unicode string can contain characters that are not part of ANSI but have the same character code (in particular, in the range 128-159 , as you can see from the table I'm associated with).

Regarding the actual code for this, the answer to @chibacity should work, although you have to change it to cover strict ASCII, because it will not work for ANSI.

[*] Also known as Latin 1 Windows (Win-1252)

+5

Franci Penov Dec 16 '10 at

source share

While it contains characters, it contains Unicode characters.

From System.String :

Represents text as a series of Unicode characters.

 public static bool ContainsUnicodeChars(string text) { return !string.IsNullOrEmpty(text); }

Usually you have to worry about different Unicode encodings when you need to:

Encode a string into a byte stream with a specific encoding.
Decode a string from a byte stream with a specific encoding.

As soon as you get into the string, although the encoding from which the string was originally presented, if any, does not matter.

Each character in the string is identified by the scanned Unicode value, also called the Unicode code point or serial number (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding , and the numeric value of each encoding element is represented by a Char object.

You may also find the following questions:

How can you strip non-ASCII characters from a string? (in c #)

C # Make sure the string contains only ASCII

And this article by John Skeet: Unicode and .NET

+2

Ani Dec 16 '10 at 10:16

source share

If a string contains only ASCII characters, the serialization + deserialization step using ASCII encoding should return the same string, so checking for one liner in C # might look like this.

 String s1="testभारत"; bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1))==s1;

+1

zingh Aug 22 '17 at 20:17

source share

This is another solution without using lambda expressions. This is in VB.NET, but you can easily convert it to C #:

  Public Function ContainsUnicode(ByVal inputstr As String) As Boolean Dim inputCharArray() As Char = inputstr.ToCharArray For i As Integer = 0 To inputCharArray.Length - 1 If CInt(AscW(inputCharArray(i))) > 255 Then Return True Next Return False End Function

0

Yiannis Mpourkelis Oct 26 '16 at 3:01

source share

Tim Lloyd · Accepted Answer · 2010-12-16 10:25

If my assumptions are correct, you want to know if your string contains any non-ANSI characters. You can get it as follows.

public void test() { const string WithUnicodeCharacter = "a hebrew character:\uFB2F"; const string WithoutUnicodeCharacter = "an ANSI character:Æ"; bool hasUnicode; //true hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter); Console.WriteLine(hasUnicode); //false hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter); Console.WriteLine(hasUnicode); } public bool ContainsUnicodeCharacter(string input) { const int MaxAnsiCode = 255; return input.Any(c => c > MaxAnsiCode); }

Update

This will detect extended ASCII. If you find only a true range of ASCII characters (up to 127), you can get false positives for extended ASCII characters that do not denote Unicode. I mentioned this in my example.

How to recognize if a string contains unicode characters?

More articles: