How to read text from a web page using Java?

I want to read text from a web page. I do not want to receive the HTML code of the web page. I found this code:

try { // Create a URL for the desired page URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history"); // Read all the text returned by the server BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream())); String str; while ((str = in.readLine()) != null) { str = in.readLine().toString(); System.out.println(str); // str is one line of text; readLine() strips the newline character(s) } in.close(); } catch (MalformedURLException e) { } catch (IOException e) { } 

but this code gives me the HTML code of the webpage. I want to get all the text inside this page. How to do this with Java?

+8
java
source share
4 answers

You can look at jsoup for this:

 String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html); String text = doc.body().text(); // "An example link" 

This example is an extract from one on your site.

+13
source share

Use JSoup .

You can analyze the content using CSS style selectors.

In this example you can try

 Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); String textContents = doc.select(".newsText").first().text(); 
+4
source share

You will need to take the content that you get with your current code, then analyze it and find the tags that contain the desired text. The sax parser is well suited for this job.

Or, if this is not a specific piece of text, you simply remove all the tags so that only the text remains. I think you could use regexp for this.

0
source share

You can also use the HtmlCleaner jar. Below is the code.

 HtmlCleaner cleaner = new HtmlCleaner(); TagNode node = cleaner.clean( url ); System.out.println( node.getText().toString() ); 
0
source share

All Articles