Wednesday, November 3, 2010

MapReduce -> GFS -> Bigtable

A yellow elephant, a perfect mascot
for anything geeky.
Some friend was recently interviewing for a full time position at Google. So we had this small conversation over email about some of the infrastructure that engineers have to master inside to build the magic that makes Google scale to humanity orders of magnitude. This is not a big secret, there is a lot information on the Internet by now and Google itself has published papers about these technologies. But I would like to share the things that I learned some time back during my stay at Google. I will try to explain what are MapReduce, GFS and Bigtable from the developer's perspective.

MapReduce: What do you usually do when you want to run your code over a large set of data in parallel? Naturally you would execute several instances of your program with different parts of your big set of data as input. Even better you could run those several instances of your program on different computers. There are several issues you have to take care of when you do this: Writing some extra code that splits the data and sends the appropriate pieces to each of your program instances, distributing your program itself to each computer and then collecting the data in a single place. And also you might want to think about some basic synchronization (counters) and fault tolerance. MapReduce takes care of most of those basic and more advanced things, you write your piece of code that will take data as input and you send your code and an initialization file to the MapReduce system that will be in charge of distributing your program and executing instances with the specified inputs. Why the name MapReduce? This is because the framework also addresses another issue. What is a general way of describing a distributed process that is so general that can be used to implement most kinds of distributed processing?  The answer is mapping and reducing. The processes that perform mapping are called mappers and the processes that perform reducing are called reducers. Mappers take an instance of the data as input and process it to generate some output. Reducers take the outputs of several mappers (or other reducers) as inputs and output a single output. It is interesting that by implementing our own mappers and reducers we can implement most processing functions that one might need. I will give two examples below:

Let's say you're in charge of calculating the average age of people on Facebook, your data is very big obviously and you will need to compute this average in a distributed way. In this case you would implement mappers that take a (date of birth) as input and output (key, (age, 1)), where key is a constant in this case. Your reducers would take an input of the form list of (age, count) and they would output a pair (key, (agesum, countsum)) that adds the ages and keeps the count of to how many people these age sum corresponds. At the end we will have a single output because the reducers will continue reducing until we have one output per key. So let's say you want to calculate the average age of people on Facebook by each country instead of average of all, then you would just need to use the country as the key and the reducers will receive a list of age and count only for the same key and will keep reducing until you get one output for every country.

Sometimes you don't need the reducers at all, let's say in the above example that you need to store a pre-calculated value of the age of the person using the East Asian age reckoning in which new borns are considered to have 1 year old at the time of birth and the age changes each Lunar Year. In this case the mappers take (date of birth) as input and output (East Asian age), and there is no need to use reducers, the process is purely mapping. You can watch how other algorithms can be implemented in MapReduce in the next video. (They explain sorting)

GFS: There is one problem with the above approach that GFS solves. By using MapReduce we are now able to distribute our processing to hundreds of computers but imagine all those processes trying to read from the same physical location. This will slow down most hard drives and also processes will block while trying to read the data. So if you have a distributed processing system you will also need a distributed file system and that is GFS. A good MapReduce framework will not only use a distributed file system but will also try to assign data to each mapper based on the physical proximity of data.

Bigtable: Instead of reading/writing to files it would be better if you had some more structured way of storing your data, like a database. Bigtable is a database that runs on GFS. But Bigtable is not only a database that runs in GFS, it is also a different kind of database. Some differences are that in Bigtable you can have varying amount of columns for each row without any penalty in performance and you don't need to define columns when creating the tables. The original paper says about Bigtable: 'A Bigtable is a sparse, distributed, persistent multidimensional sorted map'. And I found this nice article explaining each of those properties in more detail

Finally I will translate the proper title of this post
'Google MapReduce -> Google File System -> Google Bigtable'
to the opensource Apache implementation of these technologies
'Apache Hadoop -> Hadoop Distributed File System-> HBase'.