Extract HTML from URLs

Question

Extract HTML from URLs

I am using Boilerpipe to extract text from a URL using this code:

URL url = new URL("http://www.example.com/some-location/index.html"); String text = ArticleExtractor.INSTANCE.getText(url);

The text string contains only the text of the html page, but I need to extract all the HTML from it.

Is there anyone who has used this library and knows how to extract HTML code?

Learn more about the library in the demo page .

+7

java string html url extract

Wassim AZIRAR Mar 6 '11 at 21:32

source share

3 answers

Just use KeepEverythingExtractor instead of ArticleExtractor .

But this is the wrong tool for the wrong operation. You just want to load the HTML content URL (right?), Rather than retrieve the content. So why use a content extractor?

+1

Konrad Rudolph Mar 6 '11 at 21:50

source share

With Java 7 and the Scanner trick, you can do the following:

 public static String toHtmlString(URL url) throws IOException { Objects.requireNonNull(url, "The url cannot be null."); try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) { sc.useDelimiter("\\A"); if (sc.hasNext()) { return sc.next(); } else { return null; // or empty } } }

+1

Paul vargas Apr 25 '15 at 21:42

source share

Goran jovic · Accepted Answer · 2011-03-06T21:49:40+0000

For something so simple that you really don't need an external library:

  URL url = new URL("http://www.google.com"); InputStream is = (InputStream) url.getContent(); BufferedReader br = new BufferedReader(new InputStreamReader(is)); String line = null; StringBuffer sb = new StringBuffer(); while((line = br.readLine()) != null){ sb.append(line); } String htmlContent = sb.toString();

Extract HTML from URLs

More articles: