Extract all images from HTML using JAVA

I want to get a list of all image URLs from the HTML source of a webpage (both abosulte and relative URLs). I used Jsoup to parse HTML, but did not provide it with all the images. For example, when I parse the google.com HTML source, it shows null images. At google.com links to HTML source images in the form.

"background:url(/intl/en_com/images/srpr/logo1w.png)

And at rediff.com, image links are in shape.

videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bappi-da-the-first-indian-in-grammy-jury/2684982","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/v3np2zgbla4vdccf.D.0.bappi.jpg","Bappi Da - the first Indian In Grammy jury","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:33)"); j = 1 videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bebo-shahid-jab-they-met-again-/2681664","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/ra8p9eeig8zy5qvd.D.0.They-Met-Again.jpg","Bebo-Shahid : Jab they met again!","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:17)");

All images are not displayed in the "img" tags. I also want to extract images that are not even in the "img" tags, as shown in the above HTML source.

How can i do this..? Please help me with this .. Thanks

+6
java
source share
2 answers

It will be a little complicated, I think. You basically need a library that loads a web page, builds the DOM of the page, and runs any javascript that can change the DOM. After all this, you should extract all possible images from the DOM. Another possible option is to intercept all calls to the library to load resources, examine the URL and specify the URL of this image.

My suggestion was to start the game with HtmlUnit (http://htmlunit.sourceforge.net/gettingStarted.html.) It does a good job of creating the DOM. I'm not sure what types of hooks he has in order to intercept methods that load resources. Of course, if it does not provide you with hooks, you can always use AspectJ or just change the source code of HtmlUnit. Good luck, that sounds like a pretty interesting problem. You must post your decision when you find out.

+1
source share

If you just want every image to be mentioned on the page, can't you just scan the HTML and any associated javascript or CSS with a simple regular expression? How likely is it that you will get [-:_./%a-zA-Z0-9]*(.jpg|.png|.gif) in HTML / JS / CSS, and not an image? I think not very likely. And in any case, you must allow broken links.

Karthikโ€™s suggestion would be more correct, but I think itโ€™s more important for you to get absolutely everything and filter out uninteresting images.

0
source share

All Articles