Archive for the ‘ruby’ Category

rFeedParser on GitHub

Friday, May 2nd, 2008

Alright, it’s done. I’ve moved rFeedParser and rchardet to GitHub. Check out the rFeedParser and rchardet pages at GitHub and clone them with these URLs:

git://github.com/jmhodges/rfeedparser.git
git://github.com/jmhodges/rchardet.git

rFeedParser, of course, is a Ruby translation of the Universal Feed Parser in Python and passes 98.8% of its 3000+ unit tests. rchardet is a Ruby translation of chardet in Python and is used quite a bit in rFeedParser.

There are, of course, some things left to be done in both of these projects.

Off the top of my head, rFeedParser needs:

  • to be able to use libxml if the user prefers, instead of the Expat binding
  • to use version 0.4.1 of the character-encodings gem
  • someone to ask People Who Know if the way rfp strips out the bad stuff in the *\_crazy.xml tests is acceptable
  • to set up a git submodule for the tests in order to ease the merging in of tests from the feedparser repository
  • a fix up to some of the regexes and lame matching code in it, especially the time parsing code
  • resorting the incredibly ugly object hierarchy.
  • other things I’ve forgotten and am too lazy too look up

rchardet needs:

  • some information on whether using some gem-provided Tuple object instead of the giant Arrays would help the memory usage
  • fix the other encoding bugs that Mark fixed when he released the version of rchardet that cleared up the little endian UTF-16 bug I reported

There’s still a lot of work to do, and I’m listening to your concerns and taking your patches. Hit the mailing list and we can all make this better.

Special Note for People Who Want to Help: Run rake setup in your branch to install all the gems you need to run it.

Ruby and Rails Compete for Love

Wednesday, January 2nd, 2008

A thought: In the beginning, I wrote in Ruby because I liked using Ruby on Rails. But recently, I’m using Ruby on Rails because I like writing in Ruby.

I think it’s time to start looking at the options again.

Building CouchDb on Mac OS X

Friday, September 21st, 2007

I, like Sam, *really* want to play with CouchDb. But I’m a MacOSX box that I barely understand after 3 months of ownership.

Install MacPorts and run:

sudo port install erlang icu subversion

Add these two lines to your .bash_profile (or .profile if you’re running tcsh).

export ERLANG_BIN_DIR=/opt/local/bin/
export ERLANG_INCLUDE_DIR=/opt/local/lib/erlang/usr/include/

Run those two commands in your current shell or open a new one. Now, back to the install.

cd ~/projects
svn co http://couchdb.googlecode.com/svn/trunk couchdb
./build.sh | tee couchdb_svn_build.log
./build.sh --install=$HOME/sys | tee couchdb_svn_install.log
mv couchdb_svn_*.log ~/sys/log

Now, for convenience, we set up an easy way to start the CouchDb server. This assumes that $HOME/sys/bin is in your $PATH. Make a file called couchdb in $HOME/sys/bin containing:

#!/bin/bash/
cd $HOME/sys/couchdb && ./bin/startCouchDb.sh

Next, fix its permissions:

chmod +x $HOME/sys/bin/couchdb

Then, start the server:

couchdb

(I follow this up with a ln $HOME/sys/bin/couchdb $HOME/sys/bin/db but that might not be best for you.) Finally, follow the rest of Sam’s post to get a quick introduction to CouchDb.

Bonus Round: Ruby on top of CouchDb.

There are two Ruby gems for work on top of CouchDb, couchobject and CouchDb-Ruby but couchobject seems the most promising. Why? Well, for one, its respository doesn’t include tests with syntax errors. And, two, it lets you write CouchDb views in Ruby, which is fantastic.

I haven’t gotten a chance to find its limitations, yet, but considering the deep magic involved and the 0.5.0 version number, I’m sure it has a few.

To get it, hit the site for the link to the tarball or grab the repository with:

git clone git://repo.or.cz/couchobject.git

I’m terribly excited. Enjoy!

Update: Now leaving Typo City.

OpenURI, Exceptions and HTTP Status Codes

Tuesday, August 7th, 2007

If you’ve needed the numeric HTTP status code from a connection created with either open-uri’s or rest-open-uri’s open method, you’ve probably noticed that OpenURI::HTTPError is raised on any thing other than a 2xx or 1xx status code and that the docs don’t really lay out how to get to the status code in that error. Some of you may have hacked up a the_error.to_s[0..2] solution, but that is bad and terrible. Don’t do it. Here’s the right way. (Good luck remembering it after a few weeks away, however.)

require 'open-uri' # or 'rest-open-uri'
begin
  io_thing = open(some_http_uri)

  # The text of the status code is in [1]
  the_status = io_thing.status[0]

rescue OpenURI::HTTPError => the_error
  # some clean up work goes here and then..

  the_status = the_error.io.status[0] # => 3xx, 4xx, or 5xx

  # the_error.message is the numeric code and text in a string
  puts “Whoops got a bad status code #{the_error.message}”
end
do_something_with_status(the_status)

There you go. You’ll notice that neither open-uri nor rest-open-uri use the Net:HTTP response classes like it claims you should in these cases, but you can map to them with the numeric status codes. All you need are the CODE_CLASS_TO_OBJ and CODE_TO_OBJ hashes defined in Net::HTTPResponse. The latter hash is probably preferable.

Update: Edited for stupidity.

On rFeedParser

Sunday, July 22nd, 2007

This post is huge but I have not the time to make it smaller. I’m so very tired.

A Quick Introduction

rFeedParser is a RSS/Atom feed parser. It is a translation of Mark Pilgrim’s feedparser from Python to Ruby. It behaves almost exactly the same and passes somewhere near 99% of the tests on a Ubuntu machine. Other platforms suffer from lesser success rates due to differing Iconv installations. The feedparser documentation applies to this work, and almost any deviation from it should be considered a bug. Please file any bugs you find.

This project was inspired by Sam Ruby’s pirate testing idea, one that I hope catches on beyond these feed parsers.

The Basics

require 'rubygems'
require 'rfeedparser'

feed = FeedParser.parse('somefeedurlorfilepath')

first = feed.entries.collect{|e| e['title'] }
second = feed['entries'].collect{|e| e.title }
if first == second
  puts “This is handy when dealing with e['id'], the guid of an item/entry”
end

Installation

Agh. rFeedParser is a monster. Tons of dependencies, some overlapping in areas, and one “not nice” dependency. The “not nice” dependency is on Yoshida Masato’s xmlparser.

You can either install it by hand (be sure to add return in front of stream in saxdriver.rb, line 171), or install through “sudo apt-get install libxml-parser-ruby1.8” if you’re on Ubuntu or another Debian-based Linux, or through the xmlparser gem that I put together that seems to work on only “some” Mac machines but all Linux boxes. xmlparser, of course, depends on the Expat XML parsing library, and be sure to install the -dev, -devel or whatever version has the full headers and libraries available for linking against if you install through MacPorts or by hand.

The Latest and Greatest

The latest version is 0.9.93… Okay, really, the latest version is 0.9.931. There was a minor bug that, if it hadn’t been for the guilt of having put off the user who had brought it to me, I wouldn’t have worried about forgetting in 0.9.93. He/she (no name, just an email address) had been so nice about it.. So, future users, take note: if you see a bug I haven’t fixed yet, guilt seems to work. Also, bribery. Patches certainly don’t hurt.

The 0.9.93 and 0.9.931 updates do a number of things:

  • Fix a horrendous error when handling content:encoded, body, xhtml:body, prodlink and fullitem
  • Added some further support of Yahoo Media RSS. I’ve added support for media:thumbnail and media:content (the latter, only in its “two tag” form). This came directly from a requirement in our project at work. Mark, you should admire my ability to get paid for this.
  • Fixed up the lame ass headers code I had going. I don’t remember what I was on when I wrote it, but it must have been fantastic.
  • py2rtime had some major bugs that I can’t understand how they passed the tests. I will give a dollar to anyone who figures it out, mainly because I don’t want to deal with it. See revision 57, and compare to both revision 58 and the current code in the repository.
  • Use rchardet 1.1. There was a rather serious bug in 1.0. Never use gsub! ever, ever, ever, ever. Maybe sometimes.
  • Some messed up indentation. Neither vim nor Textmate can indent ruby code well, it seems. Or maybe I write weird looking code. Luckily, I’m reading the Dragon book and learning things and I may decide to tackle it.
  • ForgivingURI continues to be something I desperately want to see in the Ruby core libraries. URI.parse shouldn’t puke everytime some loser fucks up his syntax. At least, give me something more than “bad URI(is not URI?)” no matter what the problem is. Something I stole from Bob Aman FeedTools.

Speaking of patches, those interested in helping development can find a bzr repositories for rfeedparser on this very site. This is probably dumb, and a bandwidth hog, but I’m too lazy to either a) go to my workplace and log into my Ubuntu box with bzr-svn or b) patch svn on the Mac laptop I’m currently writing on to put it up on rubyforge.

Gotchas, Monkey Patches and Other Disgusting Things

Now, on to the ugly.

As Sam points out me pointing out, the original feedparser tests require the parsed times to be stored in Python’s 9-tuple format. For those of you who aren’t jargon whores, that’s basically a list of 9 integers specifying the date. Unfortunately, Ruby doesn’t have a method in Time that can take that format. The solution, for our purposes, is to use the py2rtime top-level method I wrote that does the (very easy) task of putting the 9-tuple in a form Time.utc can understand. (Also, Sam’s suggestion of naming it feeddate sounds pretty damn good).

Also, the SGMLParser in HTMLTools is kind of broken. The Regexps don’t really work as intended (which I really need to send in patches for) and its really, really not UTF-8 safe. Oh, god. Making it UTF-8 safe involved code so ugly, so treacherous, that I will probably get cancer from it.

The UTF-8 stuff, of course, isn’t the developers fault. Ruby’s encoding support sucks so much that it seems quite a few people thought it would make writing a decent feed parser nearly impossible.

So, how did I do it? Through beta software, overlapping dependencies, relying on iconv (which is always terribly configured in any operating system) and a total disregard for passing the encoding tests. That’s right, rfp uses both the character-encodings gem and ActiveSupport and we still have dozens of failures and errors, the number of each depending on what OS we’re on!

So, most of the former Eastern Bloc just won’t get to use rFeedParser for a while. Sorry. (Hey, Hungary, it supports your datetimes! Does that make you feel better?)

If someone could magic up some sort of iconv-encodings gem or tarball that can give us a standard iconv install to work with, we might be able to make the encoding situation better. I would do it, however, I have got shit to do that doesn’t make me want to gather up shove ballpoint pens into my brainstem. Or slit my wrists with codepoints. (I’m pretty sure I could come up with a physically realizable way to approximate the latter.) Sigh, maybe I’ll get to it later, but I’d love to have some help.

On to the straight-up monkey patches.

There’s a few on Hpricot, but they have very little impact. Maybe making Hpricot load a bit slower on boot due to the huge element lists I put in there. Also, there is a method called Hpricot.scrub, but it is no longer the Hpricot.scrub that you know so well. It originally was, but I needed to do some extra things that added a couple of scans on top of the two already in there and, suddenly, it was a bottleneck. So, apologies for the confusing name.

(Jeff Hodges’ Trivia Time: The guy who wrote Hpricot#scrub, Michael Moen, is the guy who “officially” put Jeff’s name in for the position at ICTV. He and Jeff work together on the same Ruby on Rails application as members of the ActiveMedia Group. When discussing new problems with Michael, Jeff is often boggled by Michael’s clarity of thought.)

Oh, and one more monkey patch. xmlparser doesn’t return the attributes of the XML tags as a Hash, but SGMLParser does and it would have been pretty damn handy if it did, so I made it do that. The code is in better_attributelist.rb (my filenames are full of ego), and it could be done better, but it suits my purpose.

Other ugly things: ForgivingURI (as mentioned above) and the inconsistent naming of methods that came about after a few bad nights of hacking through Ruby’s inheritance problems. I fixed the actual architectural problem long ago, but left the terrible names in there. So, the self.fooThing and _hasDumbPrefix stuff is my bad. Except for the methods in FeedParserMixin that are named after XML tags. Those names are prefixed with ‘_’ (and is even in the original Python code) in order to work around the differences between the XML parser and SGML parser.

I should also mention the metric ass load of datetime parsing regular expressions I had to write. Another set of patches I need to write, this time to Ruby core. I don’t even want to discuss them. Go look at time_helpers.rb and see how many times I made one problem into two. My code is grody.

The Future of the Tests

Sam brings up the idea of making the tests from the Python feedparser less, er, Pythonic. We could speed up response time If we change the expectations for dates to some method calling a 9-tuple (or rather, a 9-list or 9-Array, or 9-some-datastructure-with-brackets-not-parentheses.) we could get an instant win. I have no idea what I was trying to say here.

Also, the use of u'', u"" and the \unn or \unnnn format for non-ASCII characters in Python had to be hacked around with regular expressions. While the character-encodings gem provides something like the u'' syntax, the \u characters are completely unsupported. It’s really ugly, and kind of painful, esp. if a developer never had much experience with Python. Fortunately, I had a good deal but probably not enough considering the amount of time it took to write those Regexps.

The XML test files are a huge boon and make them more general would make it easier to maintain code equivalence across languages and allow those who are more comfortable in one language to help outside of that language’s project. But, this is all just blue sky stuff for the moment.

And Spent

This post is huge and I need to stop writing. I don’t think I’ve talked about everything I wanted to, but I’m shot. rFeedParser is nice and you should use it and tell other people to use it. Questions and comments are welcome.

Update: A few grammar and spelling clean ups. Sucktasia on ice.

Friggin’ Module Bundles

Tuesday, June 26th, 2007

What was one of the things I wanted the most when I started writing rFeedParser?

This.