Regex Rectangular Correspondence - Java

Question

Regex Rectangular Correspondence - Java

The Java Regular Expression API claims that \s will match a space. Therefore, the regular expression \\s\\s must match two spaces.

 Pattern whitespace = Pattern.compile("\\s\\s"); matcher = whitespace.matcher(modLine); while (matcher.find()) matcher.replaceAll(" ");

The goal is to replace all instances of two consecutive spaces with one space. However, this does not actually work.

Do I have a serious misunderstanding of regular expressions or the term "spaces"?

+93

java regex whitespace

Glenn Nelson Jan 19 2018-11-11T00:

source share

8 answers

You cannot use \s in Java to match spaces in your own character set, because Java does not support the Unicode space property - although this is strictly necessary to match RTS1.2 UTS # 18! He, unfortunately, does not comply with the standards.

Unicode defines 26 code points as \p{White_Space} : 20 of them are different types of \pZ GeneralCategory = Separator, and the remaining 6 are \p{Cc} GeneralCategory = Control.

Empty space is a fairly stable property, and the same ones exist almost always. However, Java does not have a property that conforms to the Unicode standard for them, so you should instead use code like this:

 String whitespace_chars = "" /* dummy empty string for homogeneity */ + "\\u0009" // CHARACTER TABULATION + "\\u000A" // LINE FEED (LF) + "\\u000B" // LINE TABULATION + "\\u000C" // FORM FEED (FF) + "\\u000D" // CARRIAGE RETURN (CR) + "\\u0020" // SPACE + "\\u0085" // NEXT LINE (NEL) + "\\u00A0" // NO-BREAK SPACE + "\\u1680" // OGHAM SPACE MARK + "\\u180E" // MONGOLIAN VOWEL SEPARATOR + "\\u2000" // EN QUAD + "\\u2001" // EM QUAD + "\\u2002" // EN SPACE + "\\u2003" // EM SPACE + "\\u2004" // THREE-PER-EM SPACE + "\\u2005" // FOUR-PER-EM SPACE + "\\u2006" // SIX-PER-EM SPACE + "\\u2007" // FIGURE SPACE + "\\u2008" // PUNCTUATION SPACE + "\\u2009" // THIN SPACE + "\\u200A" // HAIR SPACE + "\\u2028" // LINE SEPARATOR + "\\u2029" // PARAGRAPH SEPARATOR + "\\u202F" // NARROW NO-BREAK SPACE + "\\u205F" // MEDIUM MATHEMATICAL SPACE + "\\u3000" // IDEOGRAPHIC SPACE ; /* A \s that actually works for Javas native character set: Unicode */ String whitespace_charclass = "[" + whitespace_chars + "]"; /* A \S that actually works for Javas native character set: Unicode */ String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as a template in your replaceAll .

Sorry for all this. Javas regular expressions just don't work very well with their own set of custom characters, so you really need to jump through exotic hoops to make them work.

And if you think that a space is bad, you should see what you have to do so that \w and \b finally behave correctly!

Yes, it is possible, and yes, it is a crazy mess. This is even charity. The easiest way to get a standardized regular expression library for Java is to use the JNI for the ICU. This is what Google does for Android, because OraSuns is not consistent.

If you don’t want to do this, but still want to stick with Java, I have a frontend rewrite library that I wrote that “fixes” Javas templates, at least to meet the requirements of RL1.2a in UTS # 18., Unicode Regular Expressions .

+173

tchrist Jan 19 '11 at 2:16

source share

For Java (not php, not javascript, not another):

 txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

+12

surfealokesea Jun 11 '13 at 10:27

source share

when I posted a question on the Regexbuddy forum (regex developer application), I got a more accurate answer to my question in Java:

"Posted by Jan Goyvaerts

In Java, the abbreviated \ s, \ d, and \ w only include ASCII characters .... This is not a bug in Java, but just one of many things you need to know about when working with regular expressions. To match all Unicode spaces as well as line breaks, you can use [\ s \ p {Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \ p {javaSpaceChar} (which matches the same characters as [\ s \ p {Z}]).

... \ s \ s will match two spaces, if only ASCII input. The real problem is with the OP code, as indicated by the accepted answer in this question.

+5

Tuomas Nov 03 '14 at 12:01

source share

Seems to work for me:

 String s = " abc"; System.out.println("\"" + s.replaceAll("\\s\\s", " ") + "\"");

will print:

 " abc"

I think you intended to do this instead of your code:

 Pattern whitespace = Pattern.compile("\\s\\s"); Matcher matcher = whitespace.matcher(s); String result = ""; if (matcher.find()) { result = matcher.replaceAll(" "); } System.out.println(result);

+4

Mihai Toader Jan 19 '11 at 2:01

source share

 Pattern whitespace = Pattern.compile("\\s\\s"); matcher = whitespace.matcher(modLine); boolean flag = true; while(flag) { //Update your original search text with the result of the replace modLine = matcher.replaceAll(" "); //reset matcher to look at this "new" text matcher = whitespace.matcher(modLine); //search again ... and if no match , set flag to false to exit, else run again if(!matcher.find()) { flag = false; } }

+1

Mike Sep 15 '11 at 12:51

source share

For your purpose you can use this snnippet:

 import org.apache.commons.lang3.StringUtils; StrintUtils.StringUtils.normalizeSpace(string);

this normalizes the interval to a single, and also removes leading and trailing spaces.

For your purpose you can use this snnippet:

 import org.apache.commons.lang3.StringUtils; StrintUtils.StringUtils.normalizeSpace(string);

this normalizes the interval to a single, and also removes leading and trailing spaces.

String sampleString = "Hello world!"; sampleString.replaceAll ("\ s {2}", ""); // replace exactly two consecutive spaces

sampleString.replaceAll ("\ s {2,}", ""); // replace two or more consecutive spaces

+1

Rashid Mv May 18 '18 at 19:42

source share

Using spaces in REs is a pain, but I think they work. The OP problem can also be solved using the StringTokenizer or split () method. However, to use RE (uncomment println () to see how the splitter breaks String), here is a sample code:

 import java.util.regex.*; public class Two21WS { private String str = ""; private Pattern pattern = Pattern.compile ("\\s{2,}"); // multiple spaces public Two21WS (String s) { StringBuffer sb = new StringBuffer(); Matcher matcher = pattern.matcher (s); int startNext = 0; while (matcher.find (startNext)) { if (startNext == 0) sb.append (s.substring (0, matcher.start())); else sb.append (s.substring (startNext, matcher.start())); sb.append (" "); startNext = matcher.end(); //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() + // ", sb: \"" + sb.toString() + "\""); } sb.append (s.substring (startNext)); str = sb.toString(); } public String toString () { return str; } public static void main (String[] args) { String tester = " ab cdef gh ij kl"; System.out.println ("Initial: \"" + tester + "\""); System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\""); }}

It produces the following (compiling with javac and running on the command line):

% java Two21WS Initial: "ab cdef gh ij kl" Two21WS: "ab cdef gh ij kl"

-3

Manidip Sengupta Jan 19 '11 at 4:10

source share

Raph Levien · Accepted Answer · 2011-01-19 02:02

Yes, you need to take the result of matcher.replaceAll ():

 String result = matcher.replaceAll(" "); System.out.println(result);

Regex Rectangular Correspondence - Java

More articles: