How to find / replace text in html while maintaining html tags / structure

Question

How to find / replace text in html while maintaining html tags / structure

I use regular expressions to convert text as I want, but I want to save HTML tags. for example, if I want to replace "stack overflow" with "stack stream", this should work as expected: if the input is stack <sometag>overflow</sometag> , I should get stack <sometag>underflow</sometag> (i.e. line substitution is in progress, but tags still exist ...

+6

python html html-parsing

vbfoobar Dec 6 '09 at 17:44

source share

6 answers

meder omuraliev · Answer 1 · 2009-12-06T17:46:11+0000

Use the DOM library, not regular expressions when working with HTML:

lxml: parser, document and HTML serializer. You can also use BeautifulSoup and html5lib for parsing.
BeautifulSoup: a parser, document, and HTML serializer.
html5lib: a parser. It has a serializer.
ElementTree: document object and XML serializer
cElementTree: document object implemented as an extension of C.
HTMLParser: a parser.
Genshi: Includes a parser, document, and HTML serializer.
xml.dom.minidom: a document model built into the standard library that html5lib can parse.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ .

Of these, I would recommend lxml, html5lib, and BeautifulSoup.

duffymo · Answer 2 · 2009-12-06T17:46:16+0000

Beautiful Soup or HTMLParser is your answer.

akaihola · Answer 3 · 2009-12-06T21:14:12+0000

Please note that arbitrary replacements cannot be made unambiguously. Consider the following examples:

one)

HTML:

 A<tag>B</tag>

Pattern → Replacement:

 AB -> AXB

Possible results:

 AX<tag>B</tag> A<tag>XB</tag>

2)

HTML:

 A<tag>A</tag>A

Pattern → Replacement:

 A+ -> WXYZ

Possible results:

 W<tag />XYZ W<tag>X</tag>YZ W<tag>XY</tag>Z W<tag>XYZ</tag> WX<tag />YZ WX<tag>Y</tag>Z WX<tag>YZ</tag> WXY<tag />Z WXY<tag>Z</tag> WXYZ

Which algorithms work for your case greatly depends on the nature of the possible search patterns and the desired rules for handling ambiguity.

jfs · Answer 4 · 2009-12-06T17:56:00+0000

Use an html parser like lxml or BeautifulSoup . Another option is to use XSLT transformations ( XSLT in Jython ).

akaihola · Answer 5 · 2009-12-06T20:52:46+0000

I don’t think that the recommendations of the DOM / HTML analyzer library published so far concern a specific problem in this example: overflow should be replaced with underflow only when the underflow document precedes the processed document, regardless of whether or not there is between them tags. However, such a library is a necessary part of the solution.

Assuming tags never appear in the middle of words, one solution should be

process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (for example, words)
make a document in plain text
search and replace plain text with regular expressions that use groups to match, store, and mark unique identifiers at the beginning of each token.
extract all markers with marked unique identifiers from plain text
process the DOM by removing unique identifiers and replacing token matching marked unique identifiers with corresponding modified tokens
return the processed DOM back to HTML

Example:

In 1. HTML DOM,

 stack <sometag>overflow</sometag>

becomes a DOM

 #1;stack <sometag>#2;overflow</sometag>

and in 2. simple text is created:

 #1;stack #2;overflow

Regular expression needed in 3., #(\d+);stack\s+#(\d+);overflow\b and replacement #\1;stack %\2;underflow . Please note that only the second word is marked with a change of # to % in the unique identifier, since the first word does not change.

In 4. the word underflow with a unique identifier with number 2 is extracted from the received plain text, since it was marked with a change of # to % .

In 5. all identifiers are #(\d+); are removed from the text DOM nodes by looking at their numbers among the extracted words. The number 1 not found, so #1;stack is replaced with just stack . The number 2 found with the changed word underflow , so #2;overflow is replaced with underflow .

Finally, in 6. The DOM returns to the `underflow stack HTML document.

Kastor · Answer 6 · 2010-05-15T08:55:58+0000

Fun stuff to try. This sort of works. I like my friends when I attach this script to the text area and let them "translate" things. I think you could use it for something really. Fur. Check the code a few times, if you intend to use it, it works, but I'm new to all of this. I think 2 or 3 weeks have passed since I started learning php.

 <?php $html = ('<div style="border: groove 2px;"><p>Dear so and so, after reviewing your application I. . .</p><p>More of the same...</p><p>sincerely,</p><p>Important Dude</p></div>'); $oldWords = array('important', 'sincerely'); $newWords = array('arrogant', 'ya sure'); // function for oldWords function regex_oldWords_word_list(&$item1, $key) { $item1 = "/>([^<>]+)?\b$item1(tionally|istic|tion|ance|ence|less|ally|able|ness|ing|ity|ful|ant|est|ist|ic|al|ed|er|et|ly|y|s|d|'s|'d|'ve|'ll)?\b([^<>]+)?/"; } // function for newWords function format_newWords_results(&$item1, $key) { $item1 = ">$1<span style=\"color: red;\"><em> $item1$2</em></span>$3"; } // apply regex to oldWords array_walk($oldWords, 'regex_oldWords_word_list'); // apply formatting to newWords array_walk($newWords, 'format_newWords_results'); //HTML is not always as perfect as we want it $poo = array('/ /', '/>([a-zA-Z\']+)/', '/'/', '/;([a-zA-Z\']+)/', '/"([a-zA-Z\']+)/', '/([a-zA-Z\']+)</', '/\.\.+/', '/\. \.+/'); $unpoo = array(' ', '> $1', '\'', '; $1', '" $1', '$1 <', '. crap taco.', '. crap taco with cheese.'); //and maybe things will go back to normal sort of $repoo = array('/> /', '/; /', '/" /', '/ </'); $muck = array('> ', ';', '"',' <'); //before echo ($html); //I don't know what was happening on the free host but I had to keep stripping slashes //This is where the work is done anyway. $html = stripslashes(preg_replace($repoo , $muck , (ucwords(preg_replace($oldWords , $newWords , (preg_replace($poo , $unpoo , (stripslashes(strtolower(stripslashes($html))))))))))); //after echo ('<hr/> ' . $html); //now if only there were a way to keep it out of the area between //<style>here</style> and <script>here</script> and tell it that english isn't math. ?>

How to find / replace text in html while maintaining html tags / structure

one)

2)

Example:

More articles: