Something Similar

About Jeff Hodges
Atom Feed

Notes on Distributed Systems for Young Bloods

I’ve been thinking about the lessons distributed systems engineers learn on the job. A great deal of our instruction is through scars made by mistakes made in production traffic. These scars are useful reminders, sure, but it’d be better to have more engineers with the full count of their fingers.

New systems engineers will find the Fallacies of Distributed Computing and the CAP theorem as part of their self-education. But these are abstract pieces without the direct, actionable advice the inexperienced engineer needs to start moving[1]. It’s surprising how little context new engineers are given when they start out.

Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.

The worst characteristic of this list is that it focuses on technical problems with little discussion of social problems an engineer may run into. Since distributed systems require more machines and more capital, their engineers tend to work with more teams and larger organizations. The social stuff is usually the hardest part of any software developer’s job, and, perhaps, especially so with distributed systems development.

Our background, education, and experience bias us towards a technical solution even when a social solution would be more efficient, and more pleasing. Let’s try to correct for that. People are less finicky than computers, even if their interface is a little less standardized.

Alright, here we go.

Distributed systems are different because they fail often. When asked what separates distributed systems from other fields of software engineering, the new engineer often cites latency, believing that’s what makes distributed computation hard.

But they’re wrong. What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure. If a well-formed mutex unlock fails with an error, we can assume the process is unstable and crash it. But the failure of a distributed mutex’s unlock must be built into the lock protocol.

Systems engineers that haven’t worked in distributed computation will come up with ideas like “well, it’ll just send the write to both machines” or “it’ll just keep retrying the write until it succeeds”. These engineers haven’t completely accepted (though they usually intellectually recognize) that networked systems fail more than systems that exist on only a single machine and that failures tend to be partial instead of total. One of the writes may succeed while the other fails, and so now how do we get a consistent view of the data? These partial failures are much harder to reason about.

Switches go down, garbage collection pauses make masters “disappear”, socket writes seem to succeed but have actually failed on the other machine, a slow disk drive on one machines causes a communication protocol in the whole cluster to crawl, and so on. Reading from local memory is simply more stable than reading across a few switches.

Design for failure.

Writing robust distributed systems costs more than writing robust single-machine systems. Creating a robust distributed solution requires more money than a single-machine solution because there are failures that only occur with many machines. Virtual machine and cloud technology make distributed systems engineering cheaper but not as cheap as being able to design, implement, and test on a computer you already own. And there are failure conditions that are difficult to replicate on a single machine. Whether it’s because they only occur on dataset sizes much larger than can be fit on a shared machine, or in the network conditions found in datacenters, distributed systems tend to need actual, not simulated, distribution to flush out their bugs. Simulation is, of course, very useful.

Robust, open source distributed systems are much less common than robust, single-machine systems. The cost of running on many machines for long periods of time is a burden on open source communities. Hobbyists and dilettantes are the engines of open source software and they do not have the financial resources available to explore or fix many of the problems a distributed system will have. Hobbyists write open source code for fun in their free time and with machines they already own. It’s much harder to find open source developers who are willing to spin up, maintain, and pay for a bunch of machines.

Some of this slack has been taken up by engineers working for corporate entities. However, the priorities of their organization may not be in line with the priorities of your organization.

While some in the open source community are aware of this problem, it’s not yet solved. This is hard.

Coordination is very hard. Avoid coordinating machines wherever possible. This is often described as “horizontal scalability”. The real trick of horizontal scalability is independence – being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement. Information has an upper limit to the speed it can travel, and networked communication is flakier than you think, and your idea of what constitutes consensus is probably wrong. Learning about the Two Generals and Byzantine Generals problems are useful here. (Oh, and Paxos really is very hard to implement; that’s not grumpy old engineers thinking they know better than you.)

If you can fit your problem in memory, it’s probably trivial. To a distributed systems engineer, problems that are local to one machine are easy. Figuring out how to process data quickly is harder when the data is a few switches away instead of a few pointer dereferences away. In a distributed system, the well-worn efficiency tricks documented since the beginning of computer science no longer apply. Plenty of literature and implementations are available for algorithms that run on a single machine because the majority of computation has been done on singular, uncoordinated machines. Significantly fewer exist for distributed systems.

“It’s slow” is the hardest problem you’ll ever debug. “It’s slow” might mean one or more of the number of systems involved in performing a user request is slow. It might mean one or more of the parts of a pipeline of transformations across many machines is slow. “It’s slow” is hard, in part, because the problem statement doesn’t provide many clues to location of the flaw. Partial failures, ones that don’t show up on the graphs you usually look up, are lurking in a dark corner. And, until the degradation becomes very obvious, you won’t receive as many resources (time, money, and tooling) to solve it. Dapper and Zipkin were built for a reason.

Implement backpressure throughout your system. Backpressure is the signaling of failure from a serving system to the requesting system and how the requesting system handles those failures to prevent overloading itself and the serving system. Designing for backpressure means bounding resource utilization during times of overload and times of system failure. This is one of the basic building blocks of creating a robust distributed system.

Common versions include dropping new messages on the floor (and incrementing a metric) if the system’s resources are already over-scheduled, and shipping errors back to users when the system determines it will be unable to finish the request in a given amount of time. Timeouts and exponential back-offs on connections and requests to other systems are also useful.

Without backpressure mechanisms in place, cascading failure or unintentional message loss become likely. When a system is not able to handle the failures of another, it tends to emit failures to another system that depends on it.

Find ways to be partially available. Partial availability is being able to return some results even when parts of your system is failing.

Search is an ideal case to explore here. Search systems trade-off between how good their results are and how long they will keep a user waiting. A typical search system sets a time limit on how long it will search its documents, and, if that time limit expires before the all its documents are searched, it will return whatever results it has gathered. This makes search easier to scale in the face of intermittent slowdowns, and failures because those failures are treated the same as not being able to search all of their documents. The system allows for partial results to be returned to the user and its resilience is increased.

And consider a private messaging feature in a web application. We easily believe that if private messaging is down, the image upload feature should probably keep working. So, consider designing for partial failure in the private messaging service itself. This takes some thought, of course. People are generally more okay with private messaging being down for them (and maybe some other users) than they are with all users having some of their messages go missing. If the service is overloaded or one of its machines are down, failing out just a small fraction of the userbase is preferable to missing data for a larger fraction. Being able to recognize these kinds of trade-offs in partial availability is good to have in your toolbox.

Metrics are the only way to get your job done. Exposing metrics (such as latency percentiles, increasing counters on certain actions, rates of change) is the only way to cross the gap from what you believe your system does in production and what it actually is doing. Knowing how the system’s behavior on day 20 is different from its behavior on day 15 is the difference between successful engineering and failed shamanism. Of course, metrics are necessary to understand problems and behavior, but they are not sufficient to know what to do next.

A diversion into logging. Log files are good to have, but they tend to lie. For example, it’s very common for the logging of a few error classes to take up a large proportion of a space in a log file but, in actuality, occur in a very low proportion of requests. Because logging successes is redundant in most cases (and would blow out the disk in most cases) and because engineers often guess wrong on which kinds of error classes are useful to see, log files get filled up with all sorts of odd bits and bobs. Prefer logging as if someone who has not seen the code will be reading the logs.

I’ve seen a good number of outages extended by another engineer (or myself) over-emphasizing something odd we saw in the log without first checking it against the metrics. I’ve also seen another engineer (or myself) Sherlock-Holmes’ing an entire set of failed behaviors from a handful of log lines. But note: a) we remember those successes because they are so very rare and b) you’re not Sherlock unless the metrics or the experiments back up the story.

Use percentiles, not averages. Percentiles (50th, 99th, 99.9th, 99.99th) are more accurate and informative than averages in the vast majority of distributed systems. Using a mean assumes that the metric under evaluation follows a bell curve but, in practice, this describes very few metrics an engineer cares about. “Average latency” is a commonly reported metric, but I’ve never once seen a distributed system whose latency followed a bell curve. If the metric doesn’t follow a bell curve, the average is meaningless and leads to incorrect decisions and understanding. Avoid the trap by talking in percentiles. Default to percentiles, and you’ll better understand how users really see your system.

Learn to estimate your capacity. You’ll learn how many seconds are in a day because of this. Knowing how many machines you need to perform a task is the difference between a long-lasting system, and one that needs to be replaced 3 months into its job. Or, worse, needs to be replaced before you finish productionizing it.

Consider tweets. How many tweet ids can you fit in memory on a common machine? Well, a typical machine at the end of 2012 has 24 GB of memory, you’ll need an overhead of 4-5 GB for the OS, another couple, at least, to handle requests, and a tweet id is 8 bytes. This is the kind of back of the envelope calculation you’ll find yourself doing. Jeff Dean’s Numbers Everyone Should Know slide is a good expectation-setter.

Feature flags are how infrastructure is rolled out. “Feature flags” are a common way product engineers roll out new features in a system. Feature flags are typically associated with frontend A/B testing where they are used to show a new design or feature to only some of the userbase. But they are a powerful way of replacing infrastructure as well.

Suppose you’re going from a single database to a service that hides the details of a new storage solution. Have the service wrap around the legacy storage, and ramp up writes to it slowly. With backfilling, comparison checks on read (another feature flag), and then slow ramp up of reads (yet another flag), you will have much more confidence and fewer disasters. Too many projects have failed because they went for the “big cutover” or a series of “big cutovers” that were then forced into rollbacks by bugs found too late.

Feature flags sound like a terrible mess of conditionals to a classically trained object-oriented developer or a new engineer with well-intentioned training. And the use of feature flags means accepting that having multiple versions of infrastructure and data is a norm, not an rarity. This is a deep lesson. What works well for single-machine systems sometimes falters in the face of distributed problems.

Feature flags are best understood as a trade-off, trading local complexity (in the code, in one system) for global simplicity and resilience.

Choose id spaces wisely. The space of ids you choose in your system will shape your system.

The more ids required to get to a piece of data, the more options you have in partitioning the data. The fewer ids required to get a piece of data, the easier it is to consume your system’s output.

Consider version 1 of the Twitter API. All operations to get, create, and delete tweets were done with respect to a single numeric id for each tweet. The tweet id is a simple 64-bit number that is not connected to any other piece of data. As the number of tweets goes up, it becomes clear that creating user tweet timelines and the timeline of other user’s subscriptions may be efficiently constructed if all of the tweets by the same user were stored on the same machine.

But the public API requires every tweet be addressable by just the tweet id. To partition tweets by user, a lookup service would have to be constructed. One that knows what user owns which tweet id. Doable, if necessary, but with a non-trivial cost.

An alternative API could have required the user id in any tweet look up and, initially, simply used the tweet id for storage until user-partitioned storage came online. Another alternative would have included the user id in the tweet id itself at the cost of tweet ids no longer being k-sortable and numeric.

Watch out for what kind of information you encode in your ids, explicitly and implicitly. Clients may use the structure of your ids to de-anonymize private data, crawl your system in ways you didn’t expect (auto-incrementing ids are typical source), or a host of other things you won’t expect.

Exploit data-locality. The closer the processing and caching of your data is kept to its persistent storage, the more efficient your processing, and the easier it will be to keep your caching consistent and fast. Networks have more failures and more latency than pointer dereferences and fread(3).

Of course, data-locality implies locality in space, but also locality in time. If multiple users are making the same expensive request at nearly the same time, perhaps their requests can be joined into one. If multiple instances of requests for the same kind of data are made near to one another, they could be joined into one larger request. Doing so often affords lower communication overheard and easier fault management.

Writing cached data back to storage is bad. This happens in more systems than you’d think. Especially ones originally designed by people less experienced in distributed systems. Many systems you’ll inherit will have this flaw. If the implementers talk about “Russian-doll caching”, you have a large chance of hitting highly visible bugs. This entry could have been left out of the list, but I have a special hate in my heart for it. A common presentation of this flaw is user information (e.g. screennames, emails, and hashed passwords) mysteriously reverting to a previous value.

Computers can do more than you think they can. In the field today, there’s plenty of misinformation about what a machine is capable of from practitioners that do not have a great deal of experience.

At the end of 2012, a light web server had 6 or more processors, 24 GB of memory and more disk space than you can use. A relatively complex CRUD application in a modern language runtime on a single machine is trivially capable of doing thousands of requests per second within a few hundred milliseconds. And that’s a deep lower bound. In terms of operational ability, hundreds of requests per second per machine is not something to brag about in most cases.

Greater performance is not hard to come by, especially if you are willing to profile your application and introduce efficiencies based on measurement.

Use the CAP theorem to critique systems. The CAP theorem isn’t something you can build a system out of. It’s not a theorem you can take as a first principle and derive a working system from. It’s much too general in its purview, and the space of possible solutions too broad.

However, it is well-suited for critiquing a distributed system design, and understanding what trade-offs need to be made. Taking a system design and iterating through the constraints CAP puts on its subsystems will leave you with a better design at the end. For homework, apply the CAP theorem’s constraints to a real world implementation of Russian-doll caching.

One last note: Out of C, A, and P, you can’t choose CA.

Extract services. “Service” here means “a distributed system that incorporates higher-level logic than a storage system and typically has a request-response style API”. Be on the lookout for code changes that would be easier to do if the code existed in a separate service instead of in your system.

An extracted service provides the benefits of encapsulation typically associated with creating libraries. However, extracting out a service improves on creating libraries by allowing for changes to be deployed faster and easier than upgrading the libraries in its client systems. (Of course, if the extracted service is hard to deploy, the client systems are the ones that become easier to deploy.) This ease is owed to the fewer code and operational dependencies in the smaller, extracted service and the strict boundary it creates makes it harder to “take shortcuts” that a library allows for. These shortcuts almost always make it harder to migrate the internals or the client systems to new versions.

The coordination costs of using a service is also much lower than a shared library when there are multiple client systems. Upgrading a library, even with no API changes needed, requires coordinating deploys of each client system. This gets harder when data corruption is possible if the deploys are out of order (and it’s harder to predict that it will happen). Upgrading a library also has a higher social coordination cost than deploying a service if the client systems have different maintainers. Getting others aware of and willing to upgrade is surprisingly difficult because their priorities may not align with yours.

The canonical service use case is to hide a storage layer that will be undergoing changes. The extracted service has an API that is more convenient, and reduced in surface area compared to the storage layer it fronts. By extracting a service, the client systems don’t have to know about the complexities of the slow migration to a new storage system or format and only the new service has to be evaluated for bugs that will certainly be found with the new storage layout.

There are a great deal of operational and social issues to consider when doing this. I cannot do them justice here. Another article will have to be written.

[1] Of course, Rotem-Gal-Oz’s take on the fallacies is very good.

Much love to my reviewers Bill de hÓra, Coda Hale, JD Maturen, Micaela McDonald, and Ted Nyman. Your insight and care was invaluable.


Monorail (n., jargon)

monorail (n., jargon) - A Big Ball of Mud codebase that began as a web application (esp. one written in a dynamic language) but grew into other responsibilities beyond serving HTTP traffic to users. A monorail is created when a webapp does not have its responsibilities moved to other services as the the demands of the product grow (e.g. offline processing, managing complex data storage, user authentication). Coined by Jean-Paul Cozzatti in 2010, “monorail” is a portmanteau of the words “monolithic” and “Ruby on Rails”.

While it’s difficult to determine the exact point that a webapp becomes a “monorail”, a webapp is certainly a monorail when it begins to write to and read from a distributed queue.

“Don’t put that in the monorail.”

“It’ll save us weeks of developer time if we do it in the monorail.”

“Nobody owned anything in the monorail because everyone’s code touched everyone else’s.”


Finding go.crypto and go.net

It’s kind of a pain in the ass to find the go.crypto and go.net packages and there’s wonderful goodies in both of them. To ease that, I’m writing this up with some pointers to them and some small discussion on a few of my favorite libraries within them.

go.crypto

The go.crypto source code is available in the go project. Note the drop down that let’s you pick it or the other “subrepos” of the Go project proper.

To install one of the libraries (let’s call it $LIB) in go.crypto, run:

go get code.google.com/p/go.crypto/$LIB

All of the libraries in go.crypto and go.net are written entirely in Go. The documentation for all of the go.crypto libraries is available on gopkgdoc.

crypto/bcrypt

I’m going down my favorites alphabetically and that just so happens to mean that the library I wrote is first. Fancy that.

crypto/bcrypt is an implementation of the bcrypt algorithm. bcrypt is an easy-to-use, and very secure means of hashing passwords (and other secrets) such that they cannot be reversed and are very difficult to brute force. Additionally, bcrypt allows you to specify how difficult the hashed password should be to brute force (and, therefore, how difficult it is to hash later when, say, a user logs in). It also provides a means of migrating your data to a more difficult cost as Moore’s law takes hold by embedding the cost you specified as part of the generated hash.

Using crypto/bcrypt is straight-forward. To generate a new hash from a users password to be stored in a database:

import (
    "code.google.com/p/go.crypto/bcrypt"
)

func hashMyPassword(password []byte) []byte {
    // bcrypt.DefaultCost can be substituted for any number between
    // bcrypt.MinimumCost and bcrypt.MaximumCost, inclusively.
    return bcrypt.GenerateFromPassword(password, bcrypt.DefaultCost)
}

When checking the whether a bcrypt hash matches a password given by a user, you MUST use bcrypt.CompareHashAndPassword, which is cryptographically secure (using the lovely crypto/subtle package in the standard library).

bcrypt.CompareHashAndPassword returns nil when the passwords match, and an error otherwise. This is a little odd, but you’ll only use it one or two places in your system.

You MUST NOT use bytes.Equal to compare the returned []byte with what is in your database. Using the naive equality check will make your service susceptible to timing attacks.

Example usage:

import (
    "code.google.com/p/go.crypto/bcrypt"
)

func isCorrectPassword(user *User, password []byte) boolean {
    hashedPassword := user.HashedPassword()
    return bcrypt.CompareHashAndPassword(hashedPassword, password) == nil
}

Installation of crypto/bcrypt:

go get code.google.com/p/go.crypto/bcrypt

The API documentation of crypto/bcrypt is available on gopkgdoc.

crypto/ssh

crypto/ssh is a SSH client and server library. This API is too large to give a great set of examples for, but I’ll give the basics of its networking code.

The crypto/ssh package makes great use of the net.Dial, net.Conn, and net.Listener patterns of building up network connections.

Making an SSH connection to a server and using it is easy:

import (
    "code.google.com/p/go.crypto/ssh"
    "net"
    "net/http"
)

func tunnelToPrivateServer() (*http.Response, error) {
    config := &ssh.ClientConfig{...}
    client, err := ssh.Dial("tcp", "example.com:22", config)
    if err != nil {
        return nil, err
    }
    defer client.Close()
    // client is a *ssh.ClientConn that satisfies the net.Conn interface
    // but it can also be used to tunnel to private resources.
    tr := &http.Transport{
        Dial: func(network, addr string) (net.Conn, error) {
            return client.Dial(network, addr)
        },
    }
    httpClient := &http.Client{Transport: tr}
    return httpClient.Get("http://private.example.com/secrets.txt")
}

Setting up an SSH client terminal or server is slightly more complicated and I defer to the helpful examples in the documentation for ssh.Dial and ssh.Listen.

Installation of crypto/ssh:

go get code.google.com/p/go.crypto/ssh

The API documentation of crypto/ssh is available on gopkgdoc.

go.net

The go.net source code is available in the go project.

To install one of the libraries (let’s call it $LIB) in go.net, run:

go get code.google.com/p/go.net/$LIB

The documentation for all of the go.net libraries is available on gopkgdoc.

net/spdy

net/spdy is a library implementing the SPDY protocol. As of this writing, it implements version 2 of the protocol. Unfortunately, I’ve not had the chance to play with this library, yet, so I’m going to skip making an example. I’d recommend watching that space.

Installation of net/spdy:

go get code.google.com/p/go.net/spdy

The API documentation of net/spdy is available on gopkgdoc.


The Opposite of a Bloom Filter

A Bloom filter is a data structure that may report it contains an item that it does not (a false positive), but is guaranteed to report correctly if it contains the item (“no false negatives”). The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive. That is, it may claim that it has not seen an item when it has, but will never claim to have seen an item it has not.

A colleague, Jeff Smick, had need of “the opposite of a Bloom filter”, a while back, when dealing with a data stream that was pumping hundreds of thousands of events per second. First, I’ll discuss the motivation, then we’ll dig into the implementation, and, finally, talk about alternatives.

Motivation

Along this data stream comes items with a user id, and a few other longs and ints that identify the event completely. These items can be repeated many times and their repetitions are, generally, close together in time. For each unique item, at least one copy must be written to a datastore per hour but multiple duplicate writes are allowed.

This once-per-hour-per-item constraint happens to be very important but the writes were so many that filtering out duplicates became an important scaling concern.

In order to reduce the number of writes, Smick placed a Bloom filter in the stream processors, and was able to get the large reduction expected. However, the code was complicated by the need to switch it out every hour[1], and, more importantly, the Bloom filter would drop every instance of a unique item if it collided with another, causing the switch to never be flipped during the course of an hour.

Given these constraints, and the inability to hold the entire corpus of items in memory over the whole hour, I pitched him on an alternative design using the opposite of a Bloom filter.

Implementation

The simplest data structure that fits the description of a “opposite of a Bloom filter” is a hash map. With a large corpus like the one we were dealing with, the memory consumed by a hash map is far too large and its growth is bound only by corpus size. However, from this simple idea a more fruitful one comes out.

The design I pitched uses an array, a hash function, and the ability to swap ids in and out of the array concurrently. The hash function and the array are used in the same way they are inside of a hash map implementation. Unlike a typical hash map implementation, the array size is bounded and hash collisions are not avoided.

The implementation I described used an AtomicReferenceArray internally. When an item’s byte array id is passed to containsAndAdd, it is hashed to an index in the AtomicReferenceArray. Then using the atomic getAndSet it swaps the byte array currently at that index with the one being queried for.

The old id at that index is compared to the new id just added. If they differ, containsAndAdd returns false and the process should write the event. If the ids are the same, containsAndAdd returns true and the process can safely drop the event on the ground. Occasionally, two byte arrays may hash to the same index. When this happens, containsAndAdd will return false and a duplicate write will be emitted. However, since duplicate items come close together in time, this is relatively rare.

Note that there is no way for a false positive to be emitted from the filter. (This can take a second to see because the phrases “false positive”, “false negative”, “contains” and “filter” have a bunch of negations in their definitions that collide with the others.)

This filter implementation is a thread-safe “opposite of a Bloom filter” with a nice bound on its memory usage.

For reference, I’ve posted an implementation of this filter to github that is similar to what is being used in production that consumes byte arrays[2] in Java and Go. I’ve discussed the Java code in this post, but the design in each is the same. Smick used a different implementation with a lock per index, but the design remained the same.

Alternatives

An alternative that some might propose is to use a cache object with expiration. Guava has a nice way of creating those, but the underlying hash map implementation will use 2.5x or more memory. And the maximum size setting means that items can be thrown out before we’d like them to be. All that said, using a expiring cache implementation is perfectly suitable for smaller datasets.

Further, other implementations could take advantage of loss-less compression to reduce garbage collection pressure. If your distribution of the ids is compact and orderable, tricks could be played with keeping only the intervals between ids and mapping them to the array with a simpler hash function that maintains their order.

And we’re done

Again, the implementation in Java and Go is available on github.

[1] Complicated further by the timestamp in the event item itself that was the source of truth for when the event had occurred, instead of the time of arrival in the server processing it.

[2] Reproducing this code for another object type is trivial. Same if you don’t want to the guava dependency for the murmur hash function.


Tricky Things in Scala-land

I’ve got an old document here describing some of the tricky things and gotchas I found while learning Scala (and some things I had to be reminded of about the JVM) and trying to build on top of Hadoop. I’ve turned that document into this blog post. A few things only apply to Scala 2.7, but thems the breaks.

  • Inner classes (as in the Foo class has an inner Bar class) are accessed as Foo#Bar. You can avoid writing those every with a type MyBar = Foo#Bar in the class or object that’s using the inner class.

  • Getting the class while in the class definition portion of your code is not via Foo.class but classOf[Foo].

  • <%, the view bound operator, lets you accept a container with objects inside that may obey a another trait. The not-exactly-correct meaning of its use is “only allow types that can be turned into this another type”. For instance, I once wanted a min method to implicitly exist on Array. For the method to compile, it had to only accept Arrays that contained types that could be wrapped in Orderable. Using <%in the type parameter of the method made this possible. Of course, this turned out to be a moot point when I found the static (that is, object) method Iterable.min in Scala 2.7. In Scala 2.8, min is nicely defined on types that are Iterables. Also, please ignore how terrible that one-liner min method is.

  • Oh, yeah, by the way, Iterable.min exists in Scala 2.7. Would have saved me learning more about the Scala type system than I really cared to at that point.

  • If you get an “integer too large” when putting in a raw number (like when trying val l = 0xc6a4a7935bd1e995), you forgot the ‘L’. As in, val l = 0xc6a4a7935bd1e995L. This is hopefully the dumbest thing I had to figure out on this list.

  • To define a method or constructor that takes a subclass as it’s argument, you have to use a view bound, <%, to ensure the type of the class. Say, for instance, you want to construct a subclass of Hadoop’s ArrayWritable which requires a class to be passed to it that is a subclass of Writable. To do that you would write:

    class MyArrayWritable(klass: java.lang.Class[_ <: Writable])
      extends ArrayWritable(klass) {
    }
  • Because of the difference between how Scala and Java look up inner classes, when attempting to work with a Java classes that passes its type parameters to its inner classes, you’ll have to use the type trick and a shim class to work with it properly. For instance, Hadoop’s Mapper class required me to do this:

    class SMapper[A,B,C,D] extends Mapper[A,B,C,D] {
      type Context = Mapper[A,B,C,D]#Context
    }
  • The default constuctors syntax is odd. The error message (“error: wrong number of arguments for constructor”) is clearer to to someone who knows Scala because of the little arrow pointing at the supertype in question but it caused me some consternation and javadoc hunting before realizing, no, I wasn’t wrong about the type of my superclass, and, instead, just forgot the syntax for default constructors. You just have to pass your constructor args to the superclass in the extends line. So, given the code below, the important part is to pass bar in to Mine as Mine(bar) in that extends line.

    class Ours(bar: Int) extends Mine(bar) { }
  • Here’s a small list of methods that are useful when trying to explore your way around a library in the Scala REPL.

    val c = new C
    val clazz = c.getClass
    val clazz2 = classOf[C]
    val methods = clazz.getMethods
    val ctors = clazz.getConstructors
    val fields = clazz.getFields
    val annos = clazz.getAnnotations
    val name  = clazz.getName
    val parentInterfaces = clazz.getInterfaces
    val superClass = clazz.getSuperclass
    val typeParams = clazz.getTypeParameters
  • Use javap to figure out what methods are actually on an instance of your class. This is often useful to figure out why your method isn’t overriding a method in its superclass.

  • Also, when using javap, don’t forget that Scala objects have a $ added to the end of their name, so double check the names of the actual class files.

  • When using javap, your shell will probably require the $ to be escaped (as \$) or have single quotes around it to parse your input properly. e.g. javap -c 'Outer$Inner'.

  • Oh, and adding -private and -verbose to your javap arguments is how you find out anything really useful.


Archive