I have a large set of podcast URLs that I periodically poll to check for updates. I am really trying to find a reliable way to determine if a channel that has no false positives has changed. I would like to discover not only if there is a new episode, but also if the existing episode is updated.
RSS and Atom feeds provide elements pubDate, lastBuildDateor updated. However, I find them often misused, so the feed actually inserts the current date time into these fields for each request. This makes it difficult to find changes.
My next thought was to remove all date information from podcasts, then MD5 hash the contents of the feed. Then I can compare the feed hashes to detect changes in the channels.
This is similar to 90% of cases. However, there are still hundreds of podcasts that insert dynamic data into their feeds.
One podcast has the following as its podcast cover:
http:
Where 1439649026is what I assume is a timestamp. This second number changes with each request of their feed.
It starts to seem like a losing battle. If I cannot reliably trust the date fields of the podcast feed, and if a certain percentage of podcasts insert dynamic data into their feed text, how can I reliably detect changes in the feed in reliable mode?