How to parse html paired tags correctly?

the question arises of analyzing the html stream obtained during loading / markup, so that you can get the components of the tt tags, i.e. when you find

<div id="one">my text</div> 

you should end up with something like <div id = "one">, {my text} and </div> in the same container, something like

[<div id="one"> {my text} </div>] 

or even better

[<div> [id {one}] {my text} </div>]

the problem of parsing is the coincidence of the pair html tags, in html the tag can be an empty tag with attributes, but without content and, therefore, without an end tag or a regular tag, possibly with attributes and content and, therefore, with an end tag, but both tag types are just a tag

I mean, when you find a sequence, for example, <p> a few words </p> you have a P tag in the same way, you get the same sequence as <p / "> only P tag, in in the first case, you have the associated text and the end tag, and in the latter you don’t do that, that’s all

In other words, the attributes and content of html are properties of the tag element in html, so presenting this in json gives you the following:

tag: { name: "div" attributes: { id: "one } content: "my text" }

this means that you need to identify the contents of the tag in order to assign it to the correct tag, which in terms of parsing means identifying matching tags (opening tag and end tag)

In rebol, you can easily parse the html sequence, for example:

<div id="yo">yeah!</div><br/>

with the rule:

[ some [ tag! string! tag! | tag! ]]

but with this rule you will match html

<div id="yo">yeah!</div><br/> 

as well as

<div id="yo">yeah!</p><br/> 

like the same

, rebol (AFAIK) , - :

[ some [ set t1 tag! set s string! set t2 tag!#t1/1 | tag! ] ]

t1/1 () rebol, ( )

, , , :

tags: copy []
html: {<div id="yo">yeah!</p><br/>}
parse html [ some [ set t1 tag! set s string! set t2 tag! (tag: first make block! t1 if none <> find t2 tag [append/only tags reduce [t1 s] ]) | tag! (append/only tags reduce [t1])]]

, ,

+6
1

, :

parse ["a" "a"] [some [set s string! s ]]
parse ["a" "a" "b" "b"] [some [set s string! s ]]

- , (/), :

parse [<p> "some text" </p>] [some [ set t tag! set s string! t ]
parse [<div id="d1"> "some text" </div>] [some [ set t tag! set s string! t ]

, </p> <p> </div> < div id = "d1" >

:

parse load/markup "<p>preug</p>something<br />" [
    some [
        set t tag! (
            b: copy t remove/part find b " " tail b
            insert b "/"
        )
        set s string!
        b (print [t s b])
    |
        tag!
    |
        string!
    ]
]

zen- , ; -)

+1

All Articles