Regular expression for splitting in a forward slash

I have a parse tree that contains some information. To extract the information I need, I use code that breaks a line based on a slash ( / ), but this is not perfect code. I explain the details here:

I used this code in another project before and it worked perfectly. But now the syntax trees of my new dataset are more complex, and the code sometimes makes the wrong decisions.

The parse tree looks something like this:

 (TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) ) 

As you can see, tree leaves are words right in front of the slash. To get these words, I used this code before:

 parse_tree.split("/"); 

But now, in my new data, I see such examples:

1) (TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )

where there are multiple slashes due to website addresses (in this case, only the last slash is a word delimiter).

2) (NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )

If a slash is the word itself.

Could you help me replace the current simple regular expression with an expression that can handle these cases?

To summarize what I need, I would say that I need a regular expression that can be split based on a slash, but it should be able to handle two exceptions: 1) if there is a website address, it should be broken based on the last slash. 2) If there are two consecutive slashes, they must be split based on the second division (and the first slash should NOT be considered as a separator, this word).

+7
java regex
source share
3 answers

I achieved what you requested in this article:

http://www.rexegg.com/regex-best-trick.html

To summarize, here is a strategy:

1st, you will need to create a regex in this format:

 NotThis | NeitherThis | (IWantThis) 

After that, your $ 1 capture group will only contain slashes that are of interest to you for executing sections.

Then you can replace them with something less likely, and after that you will perform the separation in this substituted terms.

So, bearing in mind this strategy, here is the code:

Regex:

 \\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/) 

Explanation:

NotThis term will be a double slash with lookAhead (to make only the 1st slash)

 \\/(?=\\/) 

None of these terms is just a basic check of the url with lookahead so as not to capture the last \ /

 (?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/) 

IWant This term is just a slash:

 (\\/) 

In Java code, you can do all this as follows:

 Pattern p = Pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)"); Matcher m = p.matcher("(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )\n(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )\n(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )"); StringBuffer b= new StringBuffer(); while (m.find()) { if(m.group(1) != null) m.appendReplacement(b, "Superman"); else m.appendReplacement(b, m.group(0)); } m.appendTail(b); String replaced = b.toString(); System.out.println("\n" + "*** Replacements ***"); System.out.println(replaced); String[] splits = replaced.split("Superman"); System.out.println("\n" + "*** Splits ***"); for (String split : splits) System.out.println(split); 

Output:

 *** Replacements *** (TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 ISupermanPRP ) (VP~did~3~1 didSupermanVBD notSupermanRB (VP~read~2~1 readSupermanVB (NPB~article~2~2 theSupermanDT articleSupermanNN .SupermanPUNC. ) ) ) ) ) (TOP SourceSupermanNN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmSupermanX .Superman. ) (NPB~sister~2~2 YourSupermanPRP$ sisterSupermanNN /SupermanPUNC: ) *** Splits *** (TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I PRP ) (VP~did~3~1 did VBD not RB (VP~read~2~1 read VB (NPB~article~2~2 the DT article NN . PUNC. ) ) ) ) ) (TOP Source NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm X . . ) (NPB~sister~2~2 Your PRP$ sister NN / PUNC: ) 
+3
source share

You should be able to use a negative lookbehind with a regular expression. To do this, you need a larger selection of source data, but it seems to work for your two cases:

  String pattern = "(?<![\\:\\/])\\/"; String s1 = "(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )"; List<String> a = (List<String>) Arrays.asList(s1.split(pattern)); System.out.println("first case:"); System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n"))); System.out.println("\n"); String s2 = "(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )"; a = (List<String>) Arrays.asList(s2.split(pattern)); System.out.println("second case"); System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n"))); 

It is output:

 first case: (TOP Source, NN http://www.alwatan.com.sa, daily, 2007-01-31, first_page, first_page01.htm, X ., . ) second case (NPB~sister~2~2 Your, PRP$ sister, NN , /PUNC: ) 
+1
source share

Filter your matches to not include the regex matching below that matches any http / https / ftp url, you can include as many protocols as you want

 (?<protocol>http(s)?|ftp)://(?<server>([A-Za-z0-9-]+\.)*(?<basedomain>[A-Za-z0-9-]+\.[A-Za-z0-9]+))+ ((/?)(?<path>(?<dir>[A-Za-z0-9\._\-]+)))* and then match instances of multiple slashes with (/)+ the'+' here is a greedy match which means it will match as many consecutive slashes as it can whether it be // // or // 

hope this helps

0
source share

All Articles