How do I skip known posts when syncing with Google Reader?

Question

How do I skip known posts when syncing with Google Reader?

To record a standalone client in the Google Reader service, I would like to know how best to synchronize with the service.

There is no official documentation yet, and the best source I've found so far is: http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI

Now, consider the following: with the information above, I can load all unread elements, I can indicate how many elements to load and use the atom identifier. I can detect duplicate entries that I have already downloaded.

What I am missing is a way to indicate that I just want updates since my last synchronization. I can say give me 10 (parameter n = 10) of the last (parameter r = d) entries. If I specify the parameter r = o (date ascending), then I can also specify the parameter ot = [last time synchronization], but only then the ascending order does not make any sense when I just want to read some elements compared to all elements.

Any idea how to solve this without loading all the elements again and just rejecting duplicates? Not a very economical way of polling.

Someone suggested that I indicate that I only need unread entries. But for this solution to work in such a way that Google Reader will not offer these records again, I will need to mark them as read. In turn, this would mean that I need to save my own read / unread status on the client and that the records are already marked as read when the user enters the online version of Google Reader. This does not work for me.

Cheers, Mariano

+7

synchronization api google-reader

Mariano kamp Dec 21 '08 at 18:30

source share

2 answers

The Google API has not yet been released, and at this point this answer may change.

Currently, you will have to call the API and ignore the already loaded elements, which, as you said, are not very effective, since you will reload the elements every time, even if they already exist.

+1

Fenton Jun 15 '09 at 12:10

source share

Curt J. Sampson · Accepted Answer · 2009-06-21T03:16:05+0000

To get the latest entries, use the standard download with the most recent descending date starting from the last entries. You will get a continuation token in the XML result, looking something like this:

<gr:continuation>CArhxxjRmNsC</gr:continuation>`

Scan the results, pulling out something new for you. You should find that either all the results are new, or everything to the point is new, and after that you already know everything.

In the latter case, everything is ready, but in the first you need to find new material older than what you have already received. Do this using the continuation to get results starting from the moment after the last result in the set that you just received, passing it in the GET request as parameter c , for example:

 http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC

Continue this path until you get everything.

The n parameter, which is a count of the number of elements to extract, works well with this, and you can change it as you go. If the scan frequency is set by the user and, therefore, can be very frequent or very rare, you can use the adaptive algorithm to reduce network traffic and load the process. First, request a small number of recent entries, say five (add n=5 to the URL of your GET request). If everything is new, in the next query where you use the continuation, ask for a larger number, say 20. If they are still new, either there are a lot of updates in the feed, or some time, so continue in groups of 100 or something something else.

However, correct me if I am mistaken here, you also want to know, after you downloaded the item, whether its state has changed from “unread” to “read” due to the person reading it using the Google Reader Interface.

One approach to this:

Update google status of any items that have been read locally.
Check and save the unread feed counter. (You want to do this until the next step so that you ensure that no new items are received between loading the latest items and the time taken to check the number of samples.)
Download the latest data.
Calculate the number of views and compare them with Google. If your feed has a higher reading rate than you calculated, you know that something was read on google.
If something was read on google, start downloading the read items and comparing them with your database of unread items. You will find some elements that, according to Google, read that your database requests are unread; update them. Continue to do this until you find the number of these elements equal to the difference between the number of views and google, or until the download is unreasonable.
If you have not found all the items you read, c'est la vie; write down the remaining number as an “unreasonable unread” total, which also needs to be included in the next local number calculation, which, in your opinion, is unread.

If a user subscribes to many different blogs, he probably also calls them broadly, so you can do it all based on each tag, and not on the entire channel, which should help keep the amount of data down, since you won’t have to do what Any transfer for shortcuts where the user has not read anything new in Google Reader.

This whole scheme can be applied to other statuses, such as withdrawn or non-artistic.

Now, as you say, this

... would mean that I need to save my own read / unread state on the client and that the records are already marked as read when the user enters the online version of Google Reader. This does not work for me.

Right. Without preserving the local read / unread state (since you still keep a database of all the elements), and marking the elements read in google (which are supported by the API) is very difficult, so why doesn’t this work for you?

However, there is another problem: the user may mark something read as unread on google. This throws some key into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user as a whole will only touch on more recent things and download the last few hundred items every time, checking the status on all of them. (This is not so bad: downloading 100 items takes me anywhere from 0.3 for 300 KB to 2.5 s for 2.5 MB, although with a very fast broadband connection.)

Again, if the user has a large number of subscriptions, he probably also received a fairly large number of shortcuts, so this is done on the basis of each label, and will speed up the process. In fact, I would suggest that you not only check based on each label, but also distribute checks by checking one label every minute, and not all every twenty minutes. You can also do this “big check” for status changes on older items less often than you do a “new stuff” check, maybe every few hours if you want to reduce bandwidth.

This is a small frequency band, mainly because you need to download the full article from Google to check the status. Unfortunately, I do not see anything like this in the API documents that are available to us. My only real advice is to minimize status checks on not new items.

How do I skip known posts when syncing with Google Reader?

More articles: