Process HTML block ignoring content in specific tags

Question

Process HTML block ignoring content in specific tags

On a blog, I want to pass all the text for a blog entry through a PHP script in order to process quotes and some other elements in pretty typographic characters.

The text of the blog contains HTML, in particular, code fragments contained in blocks will be highlighted <pre><code> ... </code></pre>. Blocks codecan be displayed randomly and in several places inside the text (like stack overflow!)

I do not want these code blocks to be handled by typographic scripts, which I will use. The processing itself is not accurate; it can be selectively applied.

I was able to write a regex to find these blocks:

preg_match_all('/(<pre><code>(.*?)<\/code><\/pre>)/s', $text, $matches);

But I'm not sure if the best way is to process the rest of the text, and then fold these blocks back to their correct places.

Thank you for your help!

+1

php regex

Darren newton Jul 20 '09 at 19:25

source share

4 answers

, , preg_match_all(), preg_split() :

$pattern = '/(<pre><code>(.*?)<\/code><\/pre>)/s';

// get the code blocks
preg_match_all($pattern, $text, $matches);
$code_blocks = $matches[0];

// split up the text around the code blocks into an array
$unprocessed = preg_split($pattern, $text);
$processed_text = '';
foreach($unprocessed as $block) {

    // process the text here
    $processed_text .= process($block); 

    // add the next code block
    if(!empty($code_blocks)) $processed_text .= array_shift($code_blocks);
}

// any remaining
$processed_text .= implode('', $code_blocks);

process(), , , . , - .

, HTMLPurifier, HTML ( , ).

+1

rojoca 21 . '09 0:26

Textile, HTML. , , , .

0

Jesse Kochis 20 . '09 19:55

source share

If you just want to convert quotes or a small list of elements, I would just use string_replace.

$ text = <<

Some code here

Heredoc;

$ search_and_replace = array ('"', '"', '' ',' ''); $ newtest = str_replace (array_keys ($ search_and_replace), $ search_and_replace, $ text);

Unless you're looking for something like strip_tags that lets you specify the HTML tags you want to keep.

0

Brent baisley Jul 20 '09 at 19:56

source share

Pascal MARTIN · Accepted Answer · 2009-07-20T19:43:41+0000

The first solution that comes to my mind is as follows:

extract all codes
delete the codes, replacing them with a special marker that your string manipulations will not affect - this marker should be really special (and you can check for its absence in the input string, by the way)
do your string manipulation
return codes where there are markers.

In the code, it could be something like this: (sorry, this is quite a long time - and I did not include any checks, it's up to you to add them)

$str = <<<A
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sodales lacus et erat accumsan consectetur. Sed lacinia enim vitae erat suscipit fermentum. Quisque lobortis nisi et lacus imperdiet ac malesuada dui imperdiet. <pre><code>ThIs Is 
CoDe 1</code></pre>Donec vestibulum commodo quam rhoncus luctus. Nam vitae ipsum sed nibh dignissim condimentum. Sed ultrices fermentum dapibus. Vivamus mattis nisi nec enim convallis quis aliquet arcu accumsan. Suspendisse potenti. Nullam eget fringilla nunc. Nulla porta justo justo. Nunc consectetur egestas malesuada. Mauris ac nisi ipsum, et accumsan lorem. Quisque interdum accumsan pellentesque. Sed at felis metus. Nulla gravida tincidunt tortor, <pre><code>AnD cOdE 2</code></pre>nec aliquam tortor ultricies vel. Integer semper libero eu magna congue eget lacinia purus auctor. Nunc volutpat ultricies feugiat. Nullam id mauris eget ipsum ultricies ullamcorper non vel risus. Proin volutpat volutpat interdum. Nulla orci odio, ornare sit amet ullamcorper non, condimentum sagittis libero. <pre><code>aNd
CoDe
NuMbEr 3
</code></pre>Ut non justo at neque convallis luctus ultricies amet. 
A;
var_dump($str);

// Extract the codes
$matches = array();
preg_match_all('#<pre><code>(.*?)</code></pre>#s', $str, $matches);
var_dump($matches);

// Remove the codes
$str_nocode = preg_replace('#<pre><code>.*?</code></pre>#s', 'THIS_IS_A_NOCODE_MARKER', $str);
var_dump($str_nocode);

// Do whaterver you want with $str_nocode
$str_nocode = strtoupper($str_nocode);
var_dump($str_nocode);

// And put back the codes :
$str_codes = $str_nocode;
foreach ($matches[0] as $code) {
    $str_codes = preg_replace('#THIS_IS_A_NOCODE_MARKER#', $code, $str_codes, 1);
}
var_dump($str_codes);

I tried:

code in one line,
code on 2 lines,

: , , ...

, : -)

: , HTML ... , - DOMDocument::loadHTML , ?

Process HTML block ignoring content in specific tags

More articles: