What would I use to remove escaped html from large datasets

Question

What would I use to remove escaped html from large datasets

Our database is filled with articles from RSS feeds. I did not know what data I would receive and how much filtering had already been configured (WP-O-Matic Wordpress plugin using SimplePie library). This plugin does some basic encoding before pasting, using Wordpress built into the post insert function, which also does some filtering. Between the encoding of the RSS feed, the encoding of the plugin using PHP, the encoding of Wordpress and the SQL escaping, I'm not sure where to start.

Data is usually at the end of the field after the content I want to save. All this on one line, but allocated for readability:

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk"

Note that some of the images are escape files, and some are not. I believe this is due to disabling the last part, to be unrecognizable as the html tag, which then called it html endcoded, while the actual img tags were left alone.

Another entry has only this in one of the fields, which means that the RSS feed did not give me anything for the element (now it’s filtered, but I have such an entry):

<img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80"

All extracted samples are on the same line , but broken into readability. Otherwise, they are exactly copied from the database from the mysql client command line.

Question: What is the best way to work with the above escaped html (or part of the html tag), so I can remove it without affecting the content?

I want to delete it because the images at the end of the field are usually images that have nothing to do with the content. In the case of feedburner, feedburner adds them to every single article in the feed. In other cases, they are broken links associated with broken images. A dot is not a valid html img tag that can be easily removed. These are garbled tags that, if unencoded are not valid html, that will not parse your standard html parsers.

[EDIT] If it were just a html fetch question that I would like to do and make strip_tags and re-insert the data, I would not ask this question.

The part I am having a problem with is that the img tag was encoded in html and the end was disabled. If it is decondexed, it will not be an html tag, so I cannot parse it in the usual way.

With all the crap <img src=" I cannot find my head looking for it except for SELECT ID, post_content FROM table WHERE post_content LIKE '<img' , which at least receives these messages to me. But when I get the data, I need to find it, delete it, but save the rest of the content.

[/ EDIT]

[EDIT 2]

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs"

The part I want to keep:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

Repeat: this is not about removing valid html img tags. It's simple. I need to be able to specifically find <img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs" if this is part of the img tag img tag mangled img tag or anchor img anchor img img mangled image etc. template, but don’t remove <img if it is really part of the article. Of the dozens of samples that I looked at, it was pretty convincing that this distorted img tag is at the end of the field.

Another is a single distorted image tag. This is a constantly distorted flikr img tag, but as above, I can't just search for <img , as it may be the real part of the content.

The problem is that I cannot just decode it and parse it as HTML, because it will not be valid html. [/ EDIT 2]

+6

mysql perl

Elizabeth Buckwalter Apr 13 '10 at 17:09

source share

6 answers

The best way:

Install HTML :: Entities from CPAN and use this to unescape URIs.
Install HTML :: Parser from CPAN and use this to parse and remove URIs after they are not saved.

Regular expressions are not a suitable tool for this task.

+3

Dave sherohman Apr 13 '10 at 19:02

source share

I would not violate it. This is far from fatal debris.

First, use HTML::Entities::decode_entities conditionally (use the value < as the first character as heuristic), then let HTML::Tidy::libXML->clean(…, 'UTF-8', 1) restore the markup to its destination. clean returns the whole document, but it trivially extracts only the necessary img element.

+2

daxim Apr 13 '10 at 20:28

source share

How about a dumb simple Perl find and replace with var containing your data ...

 foreach $line(@lines) { $line =~ s/&lt;/</gi; $line =~ s/&gt;/>/gi; }

0

onethreefour Apr 13 '10 at 18:44

source share

It is best to recall all the articles that are in the database so that they are not truncated and damaged. If this is not an option, then ...

Based on your examples above, it looks like you are deleting everything that follows the text content of each article. In your example, the text content is accompanied by a DIV tag and many IMG tags that may or may not have been truncated and / or converted to HTML objects.

If all your posts are similar, you can cut everything after the contents of the Text by removing the last div tag and everything that follows it using perl, like this:

 my $article = magic_to_get_an_article(); $article =~ s/<div>.*//s; magic_to_store_article($article);

If your posts include something more complex, you'd better use the HTML parsing module and carefully read the documentation to see how it handles invalid HTML.

0

benrifkah Apr 14 '10 at 0:17

source share

Given the input and output of the samples that you give at the end of your message, the following message will get the desired result:

 #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new( \*DATA ); if ( my $tag = $parser->get_tag('img') ) { print $tag->as_is; print $parser->get_text('div'); } __DATA__ <img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs">&lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot;

Output:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="po st_img" width="80" />Through the first two months of the year, the volume of car go handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

However, I am puzzled by the size and volume of each fragment that you have to process.

0

Sinan Ünür Apr 14 '10 at 16:20

source share

Eric Strom · Accepted Answer · 2010-04-13T20:51:31+0000

Question updated ...

To extract the data you need, you can use this approach:

 use HTML::Entities qw/decode_entities/; my $decoded = decode_entities $raw; if ($decoded =~ s{ (<img .+? (?:>.+?</img>|/>)) } {}x) { # grab the image my $img = $1; $decoded =~ s{<.+?>} {}xg; # strip complete tags $decoded =~ s{< [^>]+? $} {}x; # strip trailing noise print $img.$decoded; }

Using a regular expression for HTML parsing is generally frowned upon, but in this case it is more about deleting segments matching the pattern. After testing regular expressions on a larger dataset, you should be aware of what might need to be changed.

Hope this helps.

What would I use to remove escaped html from large datasets

More articles: