Word count with embedded html in php

I have quite large paragraphs (5000-6000 words) containing text and inline html tags. I want to break this large paragraph into pieces of 1500 words (ignoring the html markup in it), i.e. 1500 should include only actual words, not any markup. Using the strip_tags function, I can count the number of words (ignoring html markup), but I cannot figure out how to break it into pieces of 1,500 words (still including html markup). for instance

 This is <b> a </b> paragraph which <a href="#"> has some </a> some text to be broken in <h1> 5 words </h1>. 

The result should be

 1 = This is <b> a </b> paragraph which 2 = <a href="#"> has some </a> some text to 3 = be broken in <h1> 5 words </h1>. 
+4
source share
3 answers

Consider using the explode () function wisely. Or better, but longer - a regular expression that will match a word or tag with all the text inside it. You should consider elements inside html tags as indestructible objects. For example, you can write a function that breaks your large paragraph into the following array of entities:

 $data = array( array( "count" => 2, "text" => "This is "), array( "count" => 1, "text" => "<b> a </b>"), array( "count" => 2, "text" => " paragraph which"), ... etc. ); 

Then you should write a loop that will do small paragraphs from the $ data array.

In addition, sometimes you may not be able to make your paragraph exactly 1,500 words long. It can be more or less because you do not have to separate html tags.

+2
source

I think you will need to parse your html if you want to guarantee valid markup. In this case, this question should be a really useful starting point.

+1
source

Use XML DOM Parser or HTML DOM Parser .

  • Iterate over all nodes
  • Count words for each node
  • If words exceeds N
    • create a new parent type node
    • paste this like sibling after parent
    • move the current and all subsequent brothers and sisters.
  • go to next element
0
source

All Articles