How are DOM parsers implemented?

Question

How are DOM parsers implemented?

My experience tells me that you cannot use RegExp to parse HTML / XML , and I completely agree! it

Dirty
Not durable and easily broken
Pure evil

They all say “use the DOM parser” of some kind, which is good for me. But now I was curious. How it works?

I searched for the source of the DOMDocument class and could not find it.

This question comes from the fact that filter_var() , for example, is considered a good alternative for checking email with RegExp, but when you look at the source, you will see that it actually uses RegExp!

So, if you want to create a DOM Parser in PHP? How could you parse HTML? How did they do it?

+7

dom php

Madara uchiha May 05 '12 at 8:29

source share

2 answers

The good news is you don’t need to reinvent the wheel. The libxml library is used in the PHP DOMDocument extension, and source code is available. Look there I suggest.

And btw., Regular expressions are not always wrong, but you need to use them correctly, others you go straight to the kitchen of hell, become a serial killer or visit chutullu or what this guy is called. Therefore, I suggest the following read: REX: XML Shallow Parsing with regular expressions .

But if you do everything right, regular expressions will help you with parsing. You just have to know what you are doing.

+1

hakre May 05 '12 at 17:14

source share

Sampson · Accepted Answer · 2012-05-05T18:25:58+0000

I think you should check out the article How Browsers Work: Behind the Scenes of Modern Web Browsers . This is a long read, but worth your while. In particular, the Parser HTML section.

While I can’t do an article of justice, perhaps a brief summary will be useful to spend alone until they have time to read and digest this masterpiece. I must admit that I am a beginner in this field, I have very little experience. Having developed for the Internet professionally for about 10 years, the way the browser processes and interprets my code has long been a black box.

HTML, XHTML, CSS or JavaScript - make your choice. Everyone has a grammar as well as a dictionary. Another great example is English. We have grammar rules that we expect from people, books, etc. We also have a dictionary consisting of nouns, verbs, adjectives, etc.

Browsers interpret the document by studying its grammar, as well as their vocabulary. When he encounters objects that he ultimately does not understand, he will let you know (raising exceptions, etc.). We are doing the same thing as we say.

I like StackOverflow, but if I could change one, that would be an absolute violation ...

Notice in the example above how you immediately begin to parse words and the relationships between words. The beginning makes sense: "I like StackOverflow." Then we come to "... if I could change," and we stopped immediately. "Modified" does not belong here. Probably the author was referring to "change." Now the vocabulary is correct, but the grammar is erroneous. A little later, we are faced with "to be," which can also violate the grammar rule, and a little further, we are faced with the word "absolutamente", which is not part of the English dictionary - another mistake.

Think of all this in terms of a DOCTYPE. Right now, I opened the XHTML 1.0 Strict Doctype source on my second monitor. Among its internal elements are the following lines:

 <!ENTITY % heading "h1|h2|h3|h4|h5|h6">

Defines header objects. And while I stick to XHTML, I can use any of them in my document ( <h1>Hello World</h1> ). But if I try to do this, say H7 , the browser will stumble upon the dictionary as "foreign" and tell me:

"Row 7, column 8: element" h7 "undefined"

Perhaps when parsing a document, we come across <table . We know that now we are dealing with a table element, which has its own set of dictionaries, such as tbody , tr , etc. As long as we know the language, grammar rules, etc., we know when something is wrong. Returning to XHTML 1.0 Strict Doctype, we find the following:

 <!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> <!ELEMENT caption %Inline;> <!ELEMENT thead (tr)+> <!ELEMENT tfoot (tr)+> <!ELEMENT tbody (tr)+> <!ELEMENT colgroup (col)*> <!ELEMENT col EMPTY> <!ELEMENT tr (th|td)+> <!ELEMENT th %Flow;> <!ELEMENT td %Flow;>

Given this link, we can save the current check against any source that we will analyze. If the author writes tread , instead of thead , we have a standard by which we can determine that the error. When problems are not resolved, and we cannot find rules for matching certain uses of grammar and vocabulary, we inform the author that their document is invalid.

I do not deal with this scientific justice, but I hope that it will serve - if not more than that - it will be enough for you to find it inside you to sit down and read the article referenced by the beginning of this answer, and perhaps sit down Explore the different DTDs we face every day.

How are DOM parsers implemented?

More articles: