How to extract text content of a web page in java?

I am looking for a method to extract text from a web page (originally html) using jdk or another library. please, help

thank

+5
source share
3 answers

Use an HTML parser , if at all possible; There are many available for Java.

Or you can use regular expression, as many people do. However, this is not recommended unless you are doing very simplified processing.

Related Questions

Extract text:

:

+12

jsoup. .

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

CSS.

+12

, ( java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}
+2

All Articles