Richard Jones' Log: New pyblagg generator running...
The new, improved pyblagg is up and running. If this post appears on it, in a bit over an hour, then it's working :)
The new parser:
- handles many more feeds,
- has better handling of broken feeds (data does leak, script doesn't die, feed is marked broken),
- uses the latest feedparser which supports many more feed formats, and automatically handles if-modified-since,
- has an automated scraper to handle new / modified / deleted feeds listed on the wiki (runs once a week and tells me what it's done so that people spamming the wiki page will be innefective)
- uses an sqlite database to store state, and
- is just generally much neater code that my old massively-hacked script :)
Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries.
Oh, I do actually look at the entries received to figure which ones are new (and I just use the link for uniqueness, as it's the only thing I can rely on).
It's just that if the feed supports the HTTP e-tag or if-modified-since mechanisms, then I don't bother fetching the actual entire feed content (hence saving everyone bandwidth). The feedparser code handles all that for me, which is nice.
What's interesting is that a large proportion of the feeds (probably about the same as don't support if-modified-since) also don't supply a Content-Type header. Not just not a useful one, but none at all.
"Feeds that don't support if-modified-since are listed under "no-update*" since they're fetched but we got no new entries"
Hmm. Shouldn't you use GUID:s (and if necessary, publication dates and links) to figure out if new items have appeared?