How to find / replace text in html while maintaining html tags / structure

I use regular expressions to convert text as I want, but I want to save HTML tags. for example, if I want to replace "stack overflow" with "stack stream", this should work as expected: if the input is stack <sometag>overflow</sometag> , I should get stack <sometag>underflow</sometag> (i.e. line substitution is in progress, but tags still exist ...

+6
python html html-parsing
source share
6 answers

Use the DOM library, not regular expressions when working with HTML:

  • lxml: parser, document and HTML serializer. You can also use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer.
  • ElementTree: document object and XML serializer
  • cElementTree: document object implemented as an extension of C.
  • HTMLParser: a parser.
  • Genshi: Includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library that html5lib can parse.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ .

Of these, I would recommend lxml, html5lib, and BeautifulSoup.

+9
source share

Beautiful Soup or HTMLParser is your answer.

+3
source share

Please note that arbitrary replacements cannot be made unambiguously. Consider the following examples:

one)

HTML:

 A<tag>B</tag> 

Pattern β†’ Replacement:

 AB -> AXB 

Possible results:

 AX<tag>B</tag> A<tag>XB</tag> 

2)

HTML:

 A<tag>A</tag>A 

Pattern β†’ Replacement:

 A+ -> WXYZ 

Possible results:

 W<tag />XYZ W<tag>X</tag>YZ W<tag>XY</tag>Z W<tag>XYZ</tag> WX<tag />YZ WX<tag>Y</tag>Z WX<tag>YZ</tag> WXY<tag />Z WXY<tag>Z</tag> WXYZ 

Which algorithms work for your case greatly depends on the nature of the possible search patterns and the desired rules for handling ambiguity.

+3
source share

Use an html parser like lxml or BeautifulSoup . Another option is to use XSLT transformations ( XSLT in Jython ).

+1
source share

I don’t think that the recommendations of the DOM / HTML analyzer library published so far concern a specific problem in this example: overflow should be replaced with underflow only when the underflow document precedes the processed document, regardless of whether or not there is between them tags. However, such a library is a necessary part of the solution.

Assuming tags never appear in the middle of words, one solution should be

  • process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (for example, words)
  • make a document in plain text
  • search and replace plain text with regular expressions that use groups to match, store, and mark unique identifiers at the beginning of each token.
  • extract all markers with marked unique identifiers from plain text
  • process the DOM by removing unique identifiers and replacing token matching marked unique identifiers with corresponding modified tokens
  • return the processed DOM back to HTML

Example:

In 1. HTML DOM,

 stack <sometag>overflow</sometag> 

becomes a DOM

 #1;stack <sometag>#2;overflow</sometag> 

and in 2. simple text is created:

 #1;stack #2;overflow 

Regular expression needed in 3., #(\d+);stack\s+#(\d+);overflow\b and replacement #\1;stack %\2;underflow . Please note that only the second word is marked with a change of # to % in the unique identifier, since the first word does not change.

In 4. the word underflow with a unique identifier with number 2 is extracted from the received plain text, since it was marked with a change of # to % .

In 5. all identifiers are #(\d+); are removed from the text DOM nodes by looking at their numbers among the extracted words. The number 1 not found, so #1;stack is replaced with just stack . The number 2 found with the changed word underflow , so #2;overflow is replaced with underflow .

Finally, in 6. The DOM returns to the `underflow stack HTML document.

0
source share

Fun stuff to try. This sort of works. I like my friends when I attach this script to the text area and let them "translate" things. I think you could use it for something really. Fur. Check the code a few times, if you intend to use it, it works, but I'm new to all of this. I think 2 or 3 weeks have passed since I started learning php.

 <?php $html = ('<div style="border: groove 2px;"><p>Dear so and so, after reviewing your application I. . .</p><p>More of the same...</p><p>sincerely,</p><p>Important Dude</p></div>'); $oldWords = array('important', 'sincerely'); $newWords = array('arrogant', 'ya sure'); // function for oldWords function regex_oldWords_word_list(&$item1, $key) { $item1 = "/>([^<>]+)?\b$item1(tionally|istic|tion|ance|ence|less|ally|able|ness|ing|ity|ful|ant|est|ist|ic|al|ed|er|et|ly|y|s|d|'s|'d|'ve|'ll)?\b([^<>]+)?/"; } // function for newWords function format_newWords_results(&$item1, $key) { $item1 = ">$1<span style=\"color: red;\"><em> $item1$2</em></span>$3"; } // apply regex to oldWords array_walk($oldWords, 'regex_oldWords_word_list'); // apply formatting to newWords array_walk($newWords, 'format_newWords_results'); //HTML is not always as perfect as we want it $poo = array('/ /', '/>([a-zA-Z\']+)/', '/'/', '/;([a-zA-Z\']+)/', '/"([a-zA-Z\']+)/', '/([a-zA-Z\']+)</', '/\.\.+/', '/\. \.+/'); $unpoo = array(' ', '> $1', '\'', '; $1', '" $1', '$1 <', '. crap taco.', '. crap taco with cheese.'); //and maybe things will go back to normal sort of $repoo = array('/> /', '/; /', '/" /', '/ </'); $muck = array('> ', ';', '"',' <'); //before echo ($html); //I don't know what was happening on the free host but I had to keep stripping slashes //This is where the work is done anyway. $html = stripslashes(preg_replace($repoo , $muck , (ucwords(preg_replace($oldWords , $newWords , (preg_replace($poo , $unpoo , (stripslashes(strtolower(stripslashes($html))))))))))); //after echo ('<hr/> ' . $html); //now if only there were a way to keep it out of the area between //<style>here</style> and <script>here</script> and tell it that english isn't math. ?> 
-one
source share

All Articles