I achieved what you requested in this article:
http://www.rexegg.com/regex-best-trick.html
To summarize, here is a strategy:
1st, you will need to create a regex in this format:
NotThis | NeitherThis | (IWantThis)
After that, your $ 1 capture group will only contain slashes that are of interest to you for executing sections.
Then you can replace them with something less likely, and after that you will perform the separation in this substituted terms.
So, bearing in mind this strategy, here is the code:
Regex:
\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)
Explanation:
NotThis term will be a double slash with lookAhead (to make only the 1st slash)
\\/(?=\\/)
None of these terms is just a basic check of the url with lookahead so as not to capture the last \ /
(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)
IWant This term is just a slash:
(\\/)
In Java code, you can do all this as follows:
Pattern p = Pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)"); Matcher m = p.matcher("(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )\n(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )\n(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )"); StringBuffer b= new StringBuffer(); while (m.find()) { if(m.group(1) != null) m.appendReplacement(b, "Superman"); else m.appendReplacement(b, m.group(0)); } m.appendTail(b); String replaced = b.toString(); System.out.println("\n" + "*** Replacements ***"); System.out.println(replaced); String[] splits = replaced.split("Superman"); System.out.println("\n" + "*** Splits ***"); for (String split : splits) System.out.println(split);
Output:
*** Replacements *** (TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 ISupermanPRP ) (VP~did~3~1 didSupermanVBD notSupermanRB (VP~read~2~1 readSupermanVB (NPB~article~2~2 theSupermanDT articleSupermanNN .SupermanPUNC. ) ) ) ) ) (TOP SourceSupermanNN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmSupermanX .Superman. ) (NPB~sister~2~2 YourSupermanPRP$ sisterSupermanNN /SupermanPUNC: ) *** Splits *** (TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I PRP ) (VP~did~3~1 did VBD not RB (VP~read~2~1 read VB (NPB~article~2~2 the DT article NN . PUNC. ) ) ) ) ) (TOP Source NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm X . . ) (NPB~sister~2~2 Your PRP$ sister NN / PUNC: )