Extract HTML from URLs

I am using Boilerpipe to extract text from a URL using this code:

URL url = new URL("http://www.example.com/some-location/index.html"); String text = ArticleExtractor.INSTANCE.getText(url); 

The text string contains only the text of the html page, but I need to extract all the HTML from it.

Is there anyone who has used this library and knows how to extract HTML code?

Learn more about the library in the demo page .

+7
source share
3 answers

For something so simple that you really don't need an external library:

  URL url = new URL("http://www.google.com"); InputStream is = (InputStream) url.getContent(); BufferedReader br = new BufferedReader(new InputStreamReader(is)); String line = null; StringBuffer sb = new StringBuffer(); while((line = br.readLine()) != null){ sb.append(line); } String htmlContent = sb.toString(); 
+10
source

Just use KeepEverythingExtractor instead of ArticleExtractor .

But this is the wrong tool for the wrong operation. You just want to load the HTML content URL (right?), Rather than retrieve the content. So why use a content extractor?

+1
source

With Java 7 and the Scanner trick, you can do the following:

 public static String toHtmlString(URL url) throws IOException { Objects.requireNonNull(url, "The url cannot be null."); try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) { sc.useDelimiter("\\A"); if (sc.hasNext()) { return sc.next(); } else { return null; // or empty } } } 
+1
source

All Articles