Replace Unicode character "�" with space

I am doing bulk loading of information from a CSV file, and I need to replace this character without ascii "ï ½" for normal space. "

The character "�" corresponds to "\ uFFFD" for C / C ++ / JAVA, which seems to be called a CHANGE CHARACTER. In addition, in the official C # documentation there are types like U + FEFF, 205F, 200B, 180E, 202F.

I am trying to replace this method

public string Errors=""; public void test(){ string textFromCsvCell= ""; string validCharacters="^[0-9A-Za-z().:%-/ ]+$"; textFromCsvCell="This is my text from csv file"; //ALl spaces aren't normal space " " string cleaned = textFromCsvCell.Replace("\uFFFD", "\"") if (Regex.IsMatch(cleaned, validCharacters )) //All code for insert else Errors=cleaned; //print Errors } 

The validation method shows me this text:

"This is my�texto from the csv file

I am also trying to find some solutions

Torture Solution 1: Using Trim

  Regex.Replace(value.Trim(), @"[^\S\r\n]+", " "); 

Try Solution 2: Use Replace

  System.Text.RegularExpressions.Regex.Replace(str,@"\s+"," "); 

Try Solution 3: Using Trim

  String.Trim(new char[]{'\uFEFF','\u200B'}); 

Try Solution 4: Add [\ S \ r \ n] to validCharacters

  string validCharacters="^[\S\r\n0-9A-Za-z().:%-/ ]+$"; 

Nothing works

Does anyone have an idea? How can i replace it? I will be very grateful for the help, thanks

Sources:

http://www.fileformat.info/info/unicode/char/0fffd/index.htm

Attempt to replace all spaces with one space

Marking bytes with Strip Byte from a string in C #

C # Regex - remove extra spaces, but keep new lines

EDITED

This is the source line:

"GLUCOSE CONTINUOUS MONITORING SYSTEM"

in 0x ... notations

SYSTEM OF0xA0MONITORING CONTINUED GLUCOSE

Decision

Go here, Unicode code converter: http://r12a.imtqy.com/apps/conversion/ Look at the conversions and replace

In my case, I am doing a simple replacement:

  string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE"; //value containt non-breaking whitespace //value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE" string cleaned = ""; string pattern = @"[^\u0000-\u007F]+"; string replacement = " "; Regex rgx = new Regex(pattern); cleaned = rgx.Replace(value, replacement); if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){ //all code for insert else //Errors message 

This expression represents all possible spaces: space, tab, page break, line break and carriage return

 [ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000] 

Links https://developer.mozilla.org/en/docs/Web/JavaScript/Guide/Regular_Expressions

+5
source share
2 answers

Using String.Replace:

How about a simple String.Replace() ?

I suggested that the only characters you want to remove are the ones you mentioned in the question: � , and you want to replace them with regular space.

 string text = "imp�ortant"; string cleaned = text.Replace('\u00ef', ' ') .Replace('\u00bf', ' ') .Replace('\u00bd', ' '); // Returns 'imp ortant' 

Or using Regex.Replace:

 string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " "); // Returns 'imp ortant' 

Try: Dotnet Fiddle

+1
source

Define a range of Ascii characters and replace anything in that range.


We want to find only Unicode characters, so we will match the Unicode character and replace.

 Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ") 

The above pattern will match all not ^ in the set [ ] this range \u0000-\u007F (ASCII characters (all past \ u007F is Unicode)) and replace it with space.

Result

 This is my te xt from csv file 

You can customize the range of \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.

+1
source

All Articles