Richard Jones' Log: New pyblagg generator running...

Fri, 04 Feb 2005

The new, improved pyblagg is up and running. If this post appears on it, in a bit over an hour, then it's working :)

The new parser:

handles many more feeds,
has better handling of broken feeds (data does leak, script doesn't die, feed is marked broken),
uses the latest feedparser which supports many more feed formats, and automatically handles if-modified-since,
has an automated scraper to handle new / modified / deleted feeds listed on the wiki (runs once a week and tells me what it's done so that people spamming the wiki page will be innefective)
uses an sqlite database to store state, and
is just generally much neater code that my old massively-hacked script :)

Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries.

Comment by Fredrik on Sat, 05 Feb 2005

"Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries"

Hmm. Shouldn't you use GUID:s (and if necessary, publication dates and links) to figure out if new items have appeared?

Comment by Richard on Sat, 05 Feb 2005

Oh, I do actually look at the entries received to figure which ones are new (and I just use the link for uniqueness, as it's the only thing I can rely on).

It's just that if the feed supports the HTTP e-tag or if-modified-since mechanisms, then I don't bother fetching the actual entire feed content (hence saving everyone bandwidth). The feedparser code handles all that for me, which is nice.

What's interesting is that a large proportion of the feeds (probably about the same as don't support if-modified-since) also don't supply a Content-Type header. Not just not a useful one, but none at all.