In 2003, before I went to work for Google and began using thousands of distributed machines to process data, before having abstractions like Map Reduce and GFS, and before terabyte data sizes were commonplace, I was working on statistical machine translation [1] at the University of Southern California. At the time, we had a few machines (maybe 5 or 6) with just a few gigabytes of memory each. We processed hundreds of gigabytes of data, and this would take about a week or so to pass through all the steps in our system, and that was the best intersection we could find between using the most data and a rapid experimentation cycle. Finding this sweet spot allowed us to consistently achieve top results in translation quality evaluations.

The machines had 2-4GB of memory, which wasn't enough to hold all of the data, so a lot of the flows we built operated by processing streams from disk. Leaving out some of the details, a typical flow for part of the system would be to collect counts over text. For example, to build an english language model that could tell you if one sequence of words was more likely than another, you might:

  1. Preprocess raw text files into tokenized text
  2. Collect multi-word phrase counts from tokenized text
  3. Remove phrases with low counts
I'm leaving a lot out for purposes of discussion, but this flow would basically be encoded as a series of unix commands, maybe something like (splitting the command up to mirror the steps):
  1. cat source.txt | tokenize.pl |
  2. dump-phrases.pl | sort | uniq -C |
  3. egrep -v "^1 " > phrase-counts.txt
Here's that broken down a little if you're not familiar with the flags: This is a simple example, the flows would be much longer and involve more complicated manipulations. One common theme was that we always used sort as a poor man's map reduce, since its implementation would use merge sort to be very memory efficient (and disk intensive). I think this technique was common amongst folks doing data heavy processing at the time.

One natural development practice that emerged was to begin with the output of a step which was contained in a file on disk, then build the next step in the flow as a separate command. You would process a few lines of the file with your new step and interactively develop and bugfix it. This incremental build-out of complex flows was pretty much the only way that you could build out these flows. It seemed natural.

More recently, since adopting Clojure as my native language, I've discovered that the same phenomenon occurs. This is true generally about the language, that you build code up incrementally, but I'm going to highlight the thrush operator specifically. This operator is written ->> (which makes it hard to google for). It's sister to the threading (->) operator. Here are a few examples if you haven't seen it before [2]

What this allows you to do then is take a sequence, and build flows up in the same exact way that we used to! Except with much much faster iteration cycles, since you can use the REPL, and arbitrary clojure functions. Here's an example of how the first step from above would look now if it were written in clojure, along with a few examples of how you can easily modify the expression so that you can inspect what's going on. In the REPL, this is as simple as using up-arrow to edit the last command.

Finally, here's what the entire thing might look like.

The flow as describe here happens to be carried out in memory, but it would be trivial to have the steps in the sequence write to disk instead of holding contents in memory. The key similarity here is the ability to build up flows piecemeal, inspecting results and debugging code while developing. This technique strikes me as particularly useful. Hopefully this helps some people "get it".


  1. You can see an example of statistical machine translation live at translate.google.com
  2. Try this at tryclj.com