Jsoup getElementsByAttributeValueMatching

[on the JSoup talk page suggested asking your question here.]

So, I'm not a regular expression expert, but I wonder what results I get from jsoup getElementsByAttributeValueMatching () method.

If I have an html page that has (among others) the following links:

<a href="/tweb/tiles/twr/EIDS_AT_20130108T134335/01/">Parent Directory</a> <a href="1357681618315/">1357681618315/</a> <a href="1357681649996/">1357681649996/</a> 

And I request with:

 Elements dirs = baseDir.getElementsByAttributeValueMatching("href", Pattern.compile("[0-9]+/")); 

hoping to get only 2 links that have only numbers (and a slash at the end.)

However, I am returning all 3 links.

I wrote a quick test program to test Java Matcher's answer to this 3-line href regular expression, and it returns only two numbers with numbers that I would expect:

 String a = "/tweb/tiles/twr/EIDS_AT_20130108T134335/01/"; String b = "1357681618315/"; String c = "1357681649996/"; Pattern p = Pattern.compile("[0-9]+/"); System.out.println("a:"+ p.matcher(a).matches()); System.out.println("b:"+ p.matcher(b).matches()); System.out.println("c:"+ p.matcher(c).matches()); 

returns:: with false b: true with: true

So my question is: what am I missing?

thanks linus

+4
source share
2 answers

Jsoup uses Matcher#find() , not Matcher#matches() . So you need to put ^ and $ yourself.

 Elements dirs = baseDir.getElementsByAttributeValueMatching( "href", Pattern.compile("^[0-9]+/$")); 

Here are excerpts from the relevance javadoc explaining the difference (emphasis mine):

to find

...

Return:

true if and only if the subsequence of the input sequence matches this pairing pattern

corresponds to

...

Return:

true if and only if the entire environment sequence matches this matching pattern

As for why Jsoup uses find() instead of matches() , this is a question you should ask your creator.

+5
source

you can use [attr*=valContaining] and [attr~=regex] when we use select in jsoup.

Elements dirs = baseDir.select ([attr ~ = regex]);

attr ----> regex attribute -----> regex applies to the value of this attribute

refer to the docs here https://jsoup.org/apidocs/org/jsoup/select/Selector.html

0
source

All Articles