Our database is filled with articles from RSS feeds. I did not know what data I would receive and how much filtering had already been configured (WP-O-Matic Wordpress plugin using SimplePie library). This plugin does some basic encoding before pasting, using Wordpress built into the post insert function, which also does some filtering. Between the encoding of the RSS feed, the encoding of the plugin using PHP, the encoding of Wordpress and the SQL escaping, I'm not sure where to start.
Data is usually at the end of the field after the content I want to save. All this on one line, but allocated for readability:
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>
<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk"
Note that some of the images are escape files, and some are not. I believe this is due to disabling the last part, to be unrecognizable as the html tag, which then called it html endcoded, while the actual img tags were left alone.
Another entry has only this in one of the fields, which means that the RSS feed did not give me anything for the element (now itβs filtered, but I have such an entry):
<img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80"
All extracted samples are on the same line , but broken into readability. Otherwise, they are exactly copied from the database from the mysql client command line.
Question: What is the best way to work with the above escaped html (or part of the html tag), so I can remove it without affecting the content?
I want to delete it because the images at the end of the field are usually images that have nothing to do with the content. In the case of feedburner, feedburner adds them to every single article in the feed. In other cases, they are broken links associated with broken images. A dot is not a valid html img tag that can be easily removed. These are garbled tags that, if unencoded are not valid html, that will not parse your standard html parsers.
[EDIT] If it were just a html fetch question that I would like to do and make strip_tags and re-insert the data, I would not ask this question.
The part I am having a problem with is that the img tag was encoded in html and the end was disabled. If it is decondexed, it will not be an html tag, so I cannot parse it in the usual way.
With all the crap <img src=" I cannot find my head looking for it except for SELECT ID, post_content FROM table WHERE post_content LIKE '<img' , which at least receives these messages to me. But when I get the data, I need to find it, delete it, but save the rest of the content.
[/ EDIT]
[EDIT 2]
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs"
The part I want to keep:
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.
Repeat: this is not about removing valid html img tags. It's simple. I need to be able to specifically find <img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs" if this is part of the img tag img tag mangled img tag or anchor img anchor img img mangled image etc. template, but donβt remove <img if it is really part of the article. Of the dozens of samples that I looked at, it was pretty convincing that this distorted img tag is at the end of the field.
Another is a single distorted image tag. This is a constantly distorted flikr img tag, but as above, I can't just search for <img , as it may be the real part of the content.
The problem is that I cannot just decode it and parse it as HTML, because it will not be valid html. [/ EDIT 2]