Replace any "\" characters that are * not * inside the "<code>" tags
First of all: this , this , this , and this answered my question. Therefore, I will open a new one.
Read please
Good. I know that regular expressions are not a way to parse common HTML. Note that generated documents are written using a limited, controlled subset of HTML. And people who write documents know what they are doing. These are all IT professionals!
Given the controlled syntax, you can parse the documents that I have here using regular expressions.
I am not trying to download arbitrary documents from the Internet and parse them!
And if the parsing fails, the document is edited, so it will parse. The problem that I am addressing here is more general than this (i.e. Do not replace templates inside two other templates).
A bit of background (you can skip this ...)
At our office, we must "print" our documentation. Therefore, why did some come up with all this in Word documents. Until now, fortunately, we are not quite there yet. And if I do, we may not need it.
Current state (... and this)
Most documents are stored in the TikiWiki database. I created a daft PHP script that converts documents from HTML (via LaTeX) to PDF. One of the necessary features of the selected Wiki system was the WYSIWYG editor. Which, as expected, leaves us documents with a less formal DOM.
Therefore, I transliterate the document using "simple" regular expressions. So far, everything is working (mostly), but I ran into one problem that I haven't figured out yet myself.
Problem
Some special characters need to be replaced with LaTeX markup. For exaple, the \ character should be replaced with $\backslash$ (if someone does not know another solution?).
Except in the verbatim block!
I am replacing <code> tags with verbatim tags. But if this code block contains a backslash (as is the case with Windows folder names), the script still replaces these backslashes.
I believe I can solve this using negative LookBehinds and / or LookAheads. But my attempts did not work.
Of course, I will be better off with a real parser. In fact, this is something in my brain map, but now it is beyond the scope. The script works well enough for our limited area of ββexpertise. Creating a parser would require me to start a lot from scratch.
My attempt
Input example
The Hello \ World document is located in: <code>C:\documents\hello_world.txt</code> Expected Result
The Hello $\backslash$ World document is located in: \begin{verbatim}C:\documents\hello_world.txt\end{verbatim} This is the best I could come up with so far:
<?php $patterns = array( "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'), ); foreach( $patterns as $name => $p ){ $tex_input = preg_replace( $p[0], $p[1], $tex_input ); } ?> Note that this is just an excerpt, and [^$] is another LaTeX requirement.
Another attempt that seemed to work:
<?php $patterns = array( "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'), ); foreach( $patterns as $name => $p ){ $tex_input = preg_replace( $p[0], $p[1], $tex_input ); } ?> ... in other words: leaving a negative view.
But it looks more error prone than with lookbehind and lookahead.
Related question
As you may have noticed, the pattern is /.../U ( /.../U ). So will this fit as little as possible inside the <code> block? Given the views?
If I am, I will try to find an HTML parser and do it.
Another option is to try to cut the string in <code>.*?</code> and other parts .
and will update other parts and recombine it.
$x="The Hello \ World document is located in:\n<br> <code>C:\documents\hello_world.txt</code>"; $r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE); for($i=0;$i<count($r);$i+=2) $r[$i]=str_replace("\\","$\\backslash$",$r[$i]); $x=implode($r); echo $x; Here are the results.
The Hello $\backslash$ World document is located in: C:\documents\hello_world.txt Sorry if my approach doesn't suit you.
I suppose I could solve this using negative LookBehinds and / or LookAheads.
You're wrong. Regular expressions do not replace the parser .
I would suggest you hook up html via htmltidy, then read it using dom-parser, and then convert dom to your target output format. Is there anything preventing you from passing this route?
Parser FTW, OK. But if you cannot use the parser, and you can be sure that the <code> tags are never nested, you can try the following:
- Locate the
<code>.*?</code>sections of your file (you may need to enable dot-match-newlines mode). - Replace all backslashes inside this section with something unique, such as
#?#?#?# - Replace the section found in 1 with this new section
- Replace all backslashes with
$\backslash$ - Replace als
<code>with\begin{verbatim}and that's all</code>with\end{verbatim} - Replace
#?#?#?#With\
FYI, regular expressions in PHP do not support variable length lookbehind. Thus, this is a difficulty for conditional matching between two boundaries.
Pandoc? Pandoc converts between a bunch of formats. You can also combine a bunch of flies together and then hide them. Maybe a few shell scripts combined with your scpping php scripts?
With your "expected input" and pandoc -o text.tex test.html output is:
The Hello \textbackslash{} World document is located in: \verb!C:\documents\hello_world.txt! pandoc can read from stdin, write to stdout, or directly to a file.
If your <code> blocks are not nested, this regular expression will detect a backslash after ^ beginning of a line or </code> without a <code> between them.
((?:^|</code>)(?:(?!<code>).)+?)\\ | | | | | \-- backslash | \-- least amount of anything not followed by <code> \-- start-of-string or </code> And replace it with:
$1$\backslash$ You need to run this regular expression in single line mode, therefore . matches newlines. You will also have to run it several times, specifying a global replacement is not enough. Each replacement replaces only the first valid backslash after the start of a line or </code> .
Write a parser based on an HTML or XML parser, such as a DOMDocument . Move the parsed DOM and replace \ with each text node that is not a descendant of a code node with $\backslash$ and each node that is a code node with \begin{verbatim} β¦ \end{verbatim} .