How to extract text content of a web page in java?

Question

How to extract text content of a web page in java?

I am looking for a method to extract text from a web page (originally html) using jdk or another library. please, help

thank

+5

java

Radi Jun 14 '10 at 10:59

source share

3 answers

jsoup. .

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

CSS.

+12

Pascal Thivent 14 . '10 11:12

, ( java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}

+2

Itay Maman 14 . '10 11:13

polygenelubricants · Accepted Answer · 2010-06-14T11:04:08+0000

Use an HTML parser , if at all possible; There are many available for Java.

Or you can use regular expression, as many people do. However, this is not recommended unless you are doing very simplified processing.

How to extract text content of a web page in java?

Related Questions

More articles: