Retrieving New Items from an RSS Feed

I am writing an application that takes data from a series of arbitrary RSS feeds. Channels are polled asynchronously in the background, and the method is called every time a new item is added to the feed.

My problem is identifying new items in the feed. What is the best way to do this? I came up with some ideas, but they are all spoiled.

Suggestion: every time you conduct a survey, continue all the newer than pubDate last item in the last survey Problem: pubDate is an optional field.

Suggestion: keep a hash of content for each item you return and not return content with the same hash Problem: Quickly gets out of control in terms of memory usage

+6
language-agnostic c # rss
source share
2 answers

How to both?

Use pub-date for those channels that return it and save the hash of others. If most feeds return pub date, and the number of feeds does not start in millions, you should be fine with both performance and memory.

+4
source share

You can use PubDate for these RSS feeds where it is provided. When PubDate is not provided, and if the repeating elements are exactly equal, that is ... when you cannot find any separate field to distinguish them, calculate the md5 checksum and save it for comparison. Use the link http://sharpertutorials.com/calculate-md5-checksum-file/ . This way you avoid storing all content files and comparing them. In practice, you can often clear the checksum data based on the frequency of the new content in order to avoid a memory problem. If possible, maintain multiple hashes for different sources. If you post actual figures, we may have a more realistic solution.

+2
source share

All Articles