How to use HTML Parser to get full information about all tags on an HTML page

Question

How to use HTML Parser to get full information about all tags on an HTML page

I use HTML Parser to develop the application. The code below cannot get the whole set of tags on the page. There are some tags that are missing, as well as attributes and text from them are also missing. Please help me explain why this is happening ..... or suggest me another way ....

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

+5

java screen-scraping

user275965 Feb 18 '10 at 10:32

source share

5 answers

bakkal · Answer 1 · 2010-07-07T21:56:00+0000

I do this fairly reliably with HTML Parser (assuming the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we don’t have one.

General idea:

, (div, meta, span ..) , , . :

 <span class="price"> $7.95</span>

"", span class "".

HTML Parser .

filter = new HasAttributeFilter("class", "price");

, Nodes, instanceof, , , , span -

if (node instanceof Span) // or any other supported element.

.

HTML Parser , :

:

<meta name="description" content="Amazon.com: frankenstein: Books"/>

:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

BalusC · Answer 2 · 2010-02-18T16:43:57+0000

:

, , , .. , -, amazon.com. ?

1: robots. , http://amazon.com/robots.txt. URL-, , Disallow User-Agent *, . , , , //-, . , , / . , 2.

2: , - , , HTML-. -, , , (JSON XML) . -. , 3.

3: , HTML/CSS/JS, webdeveloper, Firebug, , HTML/CSS/JS, rightclick > View Page . , JS/Ajax / , . HTML, JS (, , ). , , , , , - .

Riduidel · Answer 3 · 2010-02-18T16:12:49+0000

, , Swing HtmlDocument. , . , , , , NekoHtml.

gicappa · Answer 4 · 2010-02-18T16:33:59+0000

Or another simple library you can use is jtidy, which can clear your html before parsing it. Hope this helps.

http://sourceforge.net/projects/jtidy/

Ciao!

Anish · Answer 5 · 2013-06-05T06:44:48+0000

Tag on google page <title>Google</title> I'm trying to get text content in title tag. But I do not get the conclusion. It shows Build Successfull and displays the output as "TITLE". I need an output like "GOOGLE".

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.IsEqualFilter;
import org.htmlparser.tags.MetaTag;
import org.htmlparser.tags.TitleTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
public class MM {
public static void main(String[] args) {
       Parser parser=new Parser();


       try
       {
           parser.setResource("http://www.google.com");
          TitleTag title=new TitleTag();
          String tagtext=title.getTitle();
          System.out.println(tagtext);


       }

       }catch (ParserException e) {

        }

    }
}

How to use HTML Parser to get full information about all tags on an HTML page

More articles: