Richard Jones' Log: New pyblagg generator running...

Fri, 04 Feb 2005

The new, improved pyblagg is up and running. If this post appears on it, in a bit over an hour, then it's working :)

The new parser:

  • handles many more feeds,
  • has better handling of broken feeds (data does leak, script doesn't die, feed is marked broken),
  • uses the latest feedparser which supports many more feed formats, and automatically handles if-modified-since,
  • has an automated scraper to handle new / modified / deleted feeds listed on the wiki (runs once a week and tells me what it's done so that people spamming the wiki page will be innefective)
  • uses an sqlite database to store state, and
  • is just generally much neater code that my old massively-hacked script :)

Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries.

Comment by Fredrik on Sat, 05 Feb 2005

"Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries"

Hmm. Shouldn't you use GUID:s (and if necessary, publication dates and links) to figure out if new items have appeared?

Comment by Richard on Sat, 05 Feb 2005

Oh, I do actually look at the entries received to figure which ones are new (and I just use the link for uniqueness, as it's the only thing I can rely on).

It's just that if the feed supports the HTTP e-tag or if-modified-since mechanisms, then I don't bother fetching the actual entire feed content (hence saving everyone bandwidth). The feedparser code handles all that for me, which is nice.

What's interesting is that a large proportion of the feeds (probably about the same as don't support if-modified-since) also don't supply a Content-Type header. Not just not a useful one, but none at all.