On rFeedParser – Something Similar

This post is huge but I have not the time to make it smaller. I’m so very tired.

A Quick Introduction

rFeedParser is a RSS/Atom feed parser. It is a translation of Mark Pilgrim’s feedparser from Python to Ruby. It behaves almost exactly the same and passes somewhere near 99% of the tests on a Ubuntu machine. Other platforms suffer from lesser success rates due to differing Iconv installations. The feedparser documentation applies to this work, and almost any deviation from it should be considered a bug. Please file any bugs you find.

This project was inspired by Sam Ruby’s pirate testing idea, one that I hope catches on beyond these feed parsers.

The Basics

require 'rubygems'
require 'rfeedparser'

feed = FeedParser.parse('somefeedurlorfilepath')

first = feed.entries.collect{|e| e['title'] }
second = feed['entries'].collect{|e| e.title }
if first == second
  puts "This is handy when dealing with e['id'], the guid of an item/entry"
end

Installation

Agh. rFeedParser is a monster. Tons of dependencies, some overlapping in areas, and one “not nice” dependency. The “not nice” dependency is on Yoshida Masato’s xmlparser.

You can either install it by hand (be sure to add return in front of stream in saxdriver.rb, line 171), or install through “sudo apt-get install libxml-parser-ruby1.8” if you’re on Ubuntu or another Debian-based Linux, or through the xmlparser gem that I put together that seems to work on only “some” Mac machines but all Linux boxes. xmlparser, of course, depends on the Expat XML parsing library, and be sure to install the -dev, -devel or whatever version has the full headers and libraries available for linking against if you install through MacPorts or by hand.

The Latest and Greatest

The latest version is 0.9.93… Okay, really, the latest version is 0.9.931. There was a minor bug that, if it hadn’t been for the guilt of having put off the user who had brought it to me, I wouldn’t have worried about forgetting in 0.9.93. He/she (no name, just an email address) had been so nice about it.. So, future users, take note: if you see a bug I haven’t fixed yet, guilt seems to work. Also, bribery. Patches certainly don’t hurt.

The 0.9.93 and 0.9.931 updates do a number of things:

Fix a horrendous error when handling content:encoded, body, xhtml:body, prodlink and fullitem
Added some further support of Yahoo Media RSS. I’ve added support for media:thumbnail and media:content (the latter, only in its “two tag” form). This came directly from a requirement in our project at work. Mark, you should admire my ability to get paid for this.
Fixed up the lame ass headers code I had going. I don’t remember what I was on when I wrote it, but it must have been fantastic.
py2rtime had some major bugs that I can’t understand how they passed the tests. I will give a dollar to anyone who figures it out, mainly because I don’t want to deal with it. See revision 57, and compare to both revision 58 and the current code in the repository.
Use rchardet 1.1. There was a rather serious bug in 1.0. Never use gsub! ever, ever, ever, ever. Maybe sometimes.
Some messed up indentation. Neither vim nor Textmate can indent ruby code well, it seems. Or maybe I write weird looking code. Luckily, I’m reading the Dragon book and learning things and I may decide to tackle it.
ForgivingURI continues to be something I desperately want to see in the Ruby core libraries. URI.parse shouldn’t puke everytime some loser fucks up his syntax. At least, give me something more than “bad URI(is not URI?)” no matter what the problem is. Something I stole from Bob Aman FeedTools.

Speaking of patches, those interested in helping development can find a bzr repositories for rfeedparser on this very site. This is probably dumb, and a bandwidth hog, but I’m too lazy to either a) go to my workplace and log into my Ubuntu box with bzr-svn or b) patch svn on the Mac laptop I’m currently writing on to put it up on rubyforge.

Gotchas, Monkey Patches and Other Disgusting Things

Now, on to the ugly.

As Sam points out me pointing out, the original feedparser tests require the parsed times to be stored in Python’s 9-tuple format. For those of you who aren’t jargon whores, that’s basically a list of 9 integers specifying the date. Unfortunately, Ruby doesn’t have a method in Time that can take that format. The solution, for our purposes, is to use the py2rtime top-level method I wrote that does the (very easy) task of putting the 9-tuple in a form Time.utc can understand. (Also, Sam’s suggestion of naming it feeddate sounds pretty damn good).

Also, the SGMLParser in HTMLTools is kind of broken. The Regexps don’t really work as intended (which I really need to send in patches for) and its really, really not UTF-8 safe. Oh, god. Making it UTF-8 safe involved code so ugly, so treacherous, that I will probably get cancer from it.

The UTF-8 stuff, of course, isn’t the developers fault. Ruby’s encoding support sucks so much that it seems quite a few people thought it would make writing a decent feed parser nearly impossible.

So, how did I do it? Through beta software, overlapping dependencies, relying on iconv (which is always terribly configured in any operating system) and a total disregard for passing the encoding tests. That’s right, rfp uses both the character-encodings gem and ActiveSupport and we still have dozens of failures and errors, the number of each depending on what OS we’re on!

So, most of the former Eastern Bloc just won’t get to use rFeedParser for a while. Sorry. (Hey, Hungary, it supports your datetimes! Does that make you feel better?)

If someone could magic up some sort of iconv-encodings gem or tarball that can give us a standard iconv install to work with, we might be able to make the encoding situation better. I would do it, however, I have got shit to do that doesn’t make me want to gather up shove ballpoint pens into my brainstem. Or slit my wrists with codepoints. (I’m pretty sure I could come up with a physically realizable way to approximate the latter.) Sigh, maybe I’ll get to it later, but I’d love to have some help.

On to the straight-up monkey patches.

There’s a few on Hpricot, but they have very little impact. Maybe making Hpricot load a bit slower on boot due to the huge element lists I put in there. Also, there is a method called Hpricot.scrub, but it is no longer the Hpricot.scrub that you know so well. It originally was, but I needed to do some extra things that added a couple of scans on top of the two already in there and, suddenly, it was a bottleneck. So, apologies for the confusing name.

(Jeff Hodges’ Trivia Time: The guy who wrote Hpricot#scrub, Michael Moen, is the guy who “officially” put Jeff’s name in for the position at ICTV. He and Jeff work together on the same Ruby on Rails application as members of the ActiveMedia Group. When discussing new problems with Michael, Jeff is often boggled by Michael’s clarity of thought.)

Oh, and one more monkey patch. xmlparser doesn’t return the attributes of the XML tags as a Hash, but SGMLParser does and it would have been pretty damn handy if it did, so I made it do that. The code is in better_attributelist.rb (my filenames are full of ego), and it could be done better, but it suits my purpose.

Other ugly things: ForgivingURI (as mentioned above) and the inconsistent naming of methods that came about after a few bad nights of hacking through Ruby’s inheritance problems. I fixed the actual architectural problem long ago, but left the terrible names in there. So, the self.fooThing and _hasDumbPrefix stuff is my bad. Except for the methods in FeedParserMixin that are named after XML tags. Those names are prefixed with ‘_’ (and is even in the original Python code) in order to work around the differences between the XML parser and SGML parser.

I should also mention the metric ass load of datetime parsing regular expressions I had to write. Another set of patches I need to write, this time to Ruby core. I don’t even want to discuss them. Go look at time_helpers.rb and see how many times I made one problem into two. My code is grody.

The Future of the Tests

Sam brings up the idea of making the tests from the Python feedparser less, er, Pythonic. We could speed up response time If we change the expectations for dates to some method calling a 9-tuple (or rather, a 9-list or 9-Array, or 9-some-datastructure-with-brackets-not-parentheses.) we could get an instant win. I have no idea what I was trying to say here.

Also, the use of u'', u"" and the \unn or \unnnn format for non-ASCII characters in Python had to be hacked around with regular expressions. While the character-encodings gem provides something like the u'' syntax, the \u characters are completely unsupported. It’s really ugly, and kind of painful, esp. if a developer never had much experience with Python. Fortunately, I had a good deal but probably not enough considering the amount of time it took to write those Regexps.

The XML test files are a huge boon and make them more general would make it easier to maintain code equivalence across languages and allow those who are more comfortable in one language to help outside of that language’s project. But, this is all just blue sky stuff for the moment.

And Spent

This post is huge and I need to stop writing. I don’t think I’ve talked about everything I wanted to, but I’m shot. rFeedParser is nice and you should use it and tell other people to use it. Questions and comments are welcome.

Update: A few grammar and spelling clean ups. Sucktasia on ice.