PHP substr () function that allows you to set start and end point and supports HTML formatting?

Question

PHP substr () function that allows you to set start and end point and supports HTML formatting?

With the normal substr() function in PHP, you have the opportunity to decide where you want to "start" cutting the string, and also set it as the length setting. The length is probably the most used, but in this case I need to trim about 120 characters from the start. The problem is that I need to keep the html in the string intact and cut only the actual text inside the tags.

I found several user-defined functions for it, but I did not find any that allows you to set the starting point, for example. where you want to start cutting the string.

Here I found: Using PHP substr () and strip_tags () while maintaining formatting and without breaking HTML

So, I basically need the substr() function, which works exactly the same as the original, except for saving the formatting.

Any suggestions?

Example content for change:

 <p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

After disconnecting 5 from the start:

 <p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

And 5 from the beginning And the end:

 <p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.1</p>

Yes, will you catch my drift?

I would prefer if he cut off the whole word, if he stopped cutting in the middle of one, but this is not very important.

** Edit: ** Fixed quotes.

+4

html split php formatting substr

qwerty Jan 03 '13 at 14:14

source share

3 answers

Here's a start using DOMDocument (xml / html parser), RecursiveIteratorIterator (to make it easy to bypass recursive structures) and custom iterator DOMNodeList implementations to play nicely with RecursiveIteratorIterator .

All this is pretty sloppy (it doesn’t return a copy, but acts on the DOMNode / DOMDocument ), and it does not have the bizarre functionality of ordinary substr() , for example, negative values for $start and / or $length , but it seems to still be does. I am sure there are errors. But this should give you an idea of how to do this with DOMDocument .

Custom Iterators:

 class DOMNodeListIterator implements Iterator { protected $domNodeList; protected $position; public function __construct( DOMNodeList $domNodeList ) { $this->domNodeList = $domNodeList; $this->rewind(); } public function valid() { return $this->position < $this->domNodeList->length; } public function next() { $this->position++; } public function key() { return $this->position; } public function rewind() { $this->position = 0; } public function current() { return $this->domNodeList->item( $this->position ); } } class RecursiveDOMNodeListIterator extends DOMNodeListIterator implements RecursiveIterator { public function hasChildren() { return $this->current()->hasChildNodes(); } public function getChildren() { return new self( $this->current()->childNodes ); } }

Actual function:

 function DOMSubstr( DOMNode $domNode, $start = 0, $length = null ) { if( $start == 0 && ( $length == null || $length >= strlen( $domNode->nodeValue ) ) ) { return; } $nodesToRemove = array(); $rii = new RecursiveIteratorIterator( new RecursiveDOMNodeListIterator( $domNode->childNodes ), RecursiveIteratorIterator::SELF_FIRST ); foreach( $rii as $node ) { if( $start <= 0 && $length !== null && $length <= 0 ) { /* can't remove immediately * because this will mess with * iterating over RecursiveIteratorIterator * so remember for removal, later on */ $nodesToRemove[] = $node; continue; } if( $node->nodeType == XML_TEXT_NODE ) { if( $start > 0 ) { $count = min( $node->length, $start ); $node->deleteData( 0, $count ); $start -= $count; } if( $start <= 0 ) { if( $length == null ) { break; } else if( $length <= 0 ) { continue; } else if( $length >= $node->length ) { $length -= $node->length; continue; } else { $node->deleteData( $length, $node->length - $length ); $length = 0; } } } } foreach( $nodesToRemove as $node ) { $node->parentNode->removeChild( $node ); } }

Using:

 $html = <<<HTML <p>Just a short text sample with <a href="#">a link</a> and some trailing elements such as <strong>strong text<strong>, <em>emphasized text</em>, <del>deleted text</del> and <ins>inserted text</ins></p> HTML; $dom = new DomDocument(); $dom->loadHTML( $html ); /* * this is particularly sloppy: * I pass $dom->firstChild->nextSibling->firstChild (ie <body>) * because the function uses strlen( $domNode->nodeValue ) * which will be 0 for DOMDocument itself * and I didn't want to utilize DOMXPath in the function * but perhaps I should have */ DOMSubstr( $dom->firstChild->nextSibling->firstChild, 8, 25 ); /* * passing a specific node to DOMDocument::saveHTML() * only works with PHP >= 5.3.6 */ echo $dom->saveHTML( $dom->firstChild->nextSibling->firstChild->firstChild );

+1

Decent dabbler Jan 03 '13 at 16:59

source share

You can try this if its longer text (due to runtime).

but in this case I need to trim about 120 characters from the start.

Exactly. Enter the text or take it from somewhere and enter the number of the character that it should erase from the very beginning.

And please, I can’t stress it enough: its solution for short lines and its not the best way to do it, but its full working code!

 <?php $text = "<a href='blablabla'>m</a>ylinks...<b>not this code is working</b>......"; $newtext = ""; $delete = 13; $tagopen = false; while ($text != ""){ $checktag=$text[0]; $text=substr( $text, 1 ); if ($checktag =="<" || $tagopen == TRUE){ $newtext .= $checktag; if ($checktag == ">"){ $tagopen = FALSE; } else{ $tagopen = TRUE; } } elseif ($delete > 0){ $delete = $delete -1 ; } else { $newtext .= $checktag; } } echo $newtext; ?>

it returns:

 <a href='blablabla'></a><b> this code is working</b>......

0

Yunalescar Jan 03 '13 at 14:53

source share

Francis avila · Accepted Answer · 2013-01-04T00:31:52+0000

There are so many complications that you are asking (essentially, generate a valid html subset based on line offset) that it would be better if you reformulate your problem in such a way that it is expressed as the number of text characters that you want to save, not like cutting an arbitrary string that has html in it. If you do, this problem will become much simpler because you can use real HTML parser. You have nothing to worry about:

Randomly cutting elements in half.
Random cut allows in half.
Not counting the text inside the elements.
Ensure that the symbol object is considered the only symbol.
Make sure all items are properly closed.
Make sure you are not destroying the string because you are using substr() in the utf-8 string.

This can be done using regular expressions (using the u flag) and mb_substr() and a tag stack (I did this before), but there are many edge cases, and you are usually tedious.

However, the DOM solution is quite simple: go through all the text nodes, counting the lengths of the lines, and delete or fine-tune their text content as necessary. The code below does this:

 $html = <<<'EOT' <p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p> EOT;

 function substr_html($html, $start, $length=null, $removeemptyelements=true) { if (is_int($length)) { if ($length===0) return ''; $end = $start + $length; } else { $end = null; } $d = new DOMDocument(); $d->loadHTML('<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><title></title></head><body>'.$html.'</body>'); $body = $d->getElementsByTagName('body')->item(0); $dxp = new DOMXPath($d); $t_start = 0; // text node start pos relative to all text $t_end = null; // text node end pos relative to all text // copy because we may modify result of $textnodes $textnodes = iterator_to_array($dxp->query('/descendant::*/text()', $body)); // PHP 5.2 doesn't seem to implement Traversable on DOMNodeList, // so `iterator_to_array()` won't work. Use this instead: // $textnodelist = $dxp->query('/descendant::*/text()', $body); // $textnodes = array(); // for ($i = 0; $i < $textnodelist->length; $i++) { // $textnodes[] = $textnodelist->item($i); //} //unset($textnodelist); foreach($textnodes as $text) { $t_end = $t_start + $text->length; $parent = $text->parentNode; if ($start >= $t_end || ($end!==null && $end < $t_start)) { $parent->removeChild($text); } else { $n_offset = max($start - $t_start, 0); $n_length = ($end===null) ? $text->length : $end - $t_start; if (!($n_offset===0 && $n_length >= $text->length)) { $substr = $text->substringData($n_offset, $n_length); if (strlen($substr)) { $text->deleteData(0, $text->length); $text->appendData($substr); } else { $parent->removeChild($text); } } } // if removing this text emptied the parent of nodes, remove the node! if ($removeemptyelements && !$parent->hasChildNodes()) { $parent->parentNode->removeChild($parent); } $t_start = $t_end; } unset($textnodes); $newstr = $d->saveHTML($body); // mb_substr() is to remove <body></body> tags return mb_substr($newstr, 6, -7, 'utf-8'); } echo substr_html($html, 480, 30);

This will output:

 <p> of "de Finibus</p> <p>Bonorum et Mal</p>

Note that this does not confuse the fact that your "substring" spans multiple p elements.

PHP substr () function that allows you to set start and end point and supports HTML formatting?

More articles: