Wednesday, November 3, 2010

MapReduce -> GFS -> Bigtable

A yellow elephant, a perfect mascot
for anything geeky.
Some friend was recently interviewing for a full time position at Google. So we had this small conversation over email about some of the infrastructure that engineers have to master inside to build the magic that makes Google scale to humanity orders of magnitude. This is not a big secret, there is a lot information on the Internet by now and Google itself has published papers about these technologies. But I would like to share the things that I learned some time back during my stay at Google. I will try to explain what are MapReduce, GFS and Bigtable from the developer's perspective.

MapReduce: What do you usually do when you want to run your code over a large set of data in parallel? Naturally you would execute several instances of your program with different parts of your big set of data as input. Even better you could run those several instances of your program on different computers. There are several issues you have to take care of when you do this: Writing some extra code that splits the data and sends the appropriate pieces to each of your program instances, distributing your program itself to each computer and then collecting the data in a single place. And also you might want to think about some basic synchronization (counters) and fault tolerance. MapReduce takes care of most of those basic and more advanced things, you write your piece of code that will take data as input and you send your code and an initialization file to the MapReduce system that will be in charge of distributing your program and executing instances with the specified inputs. Why the name MapReduce? This is because the framework also addresses another issue. What is a general way of describing a distributed process that is so general that can be used to implement most kinds of distributed processing?  The answer is mapping and reducing. The processes that perform mapping are called mappers and the processes that perform reducing are called reducers. Mappers take an instance of the data as input and process it to generate some output. Reducers take the outputs of several mappers (or other reducers) as inputs and output a single output. It is interesting that by implementing our own mappers and reducers we can implement most processing functions that one might need. I will give two examples below:

Let's say you're in charge of calculating the average age of people on Facebook, your data is very big obviously and you will need to compute this average in a distributed way. In this case you would implement mappers that take a (date of birth) as input and output (key, (age, 1)), where key is a constant in this case. Your reducers would take an input of the form list of (age, count) and they would output a pair (key, (agesum, countsum)) that adds the ages and keeps the count of to how many people these age sum corresponds. At the end we will have a single output because the reducers will continue reducing until we have one output per key. So let's say you want to calculate the average age of people on Facebook by each country instead of average of all, then you would just need to use the country as the key and the reducers will receive a list of age and count only for the same key and will keep reducing until you get one output for every country.

Sometimes you don't need the reducers at all, let's say in the above example that you need to store a pre-calculated value of the age of the person using the East Asian age reckoning in which new borns are considered to have 1 year old at the time of birth and the age changes each Lunar Year. In this case the mappers take (date of birth) as input and output (East Asian age), and there is no need to use reducers, the process is purely mapping. You can watch how other algorithms can be implemented in MapReduce in the next video. (They explain sorting)

GFS: There is one problem with the above approach that GFS solves. By using MapReduce we are now able to distribute our processing to hundreds of computers but imagine all those processes trying to read from the same physical location. This will slow down most hard drives and also processes will block while trying to read the data. So if you have a distributed processing system you will also need a distributed file system and that is GFS. A good MapReduce framework will not only use a distributed file system but will also try to assign data to each mapper based on the physical proximity of data.

Bigtable: Instead of reading/writing to files it would be better if you had some more structured way of storing your data, like a database. Bigtable is a database that runs on GFS. But Bigtable is not only a database that runs in GFS, it is also a different kind of database. Some differences are that in Bigtable you can have varying amount of columns for each row without any penalty in performance and you don't need to define columns when creating the tables. The original paper says about Bigtable: 'A Bigtable is a sparse, distributed, persistent multidimensional sorted map'. And I found this nice article explaining each of those properties in more detail

Finally I will translate the proper title of this post
'Google MapReduce -> Google File System -> Google Bigtable'
to the opensource Apache implementation of these technologies
'Apache Hadoop -> Hadoop Distributed File System-> HBase'.

Tuesday, October 26, 2010

Machine Learning and The New York Academy of Sciences

Last Friday some friends and I attended the 5th Machine Learning Symposium organized by The New York Academy of Sciences in Manhattan. People from institutions nearby presented some of their research going in Machine Learning and Learning Theory. Even when I have been mostly working on Computer Vision and Images, the kind of projects that I am working on have a lot to do with Machine Learning so I'm very excited that there exists such a large local community interested in those topics in which I mostly rely upon and might potentially contribute. I would like to comment some of my impressions about the Machine Learning and Computer Vision community ecosystem and how there is an EM-like process going between these two.

Sometimes people in Computer Vision contribute to Machine Learning and sometimes people in Machine Learning contribute to Computer Vision as well. Computer Vision scientists get better results as they get better understanding of the human visual system and also when they get better ways to learn from data. So they often find themselves coming up with new techniques to discover patterns on image data. On the other hand Machine Learning researchers will often try
Charles Darwin at The New York Academy of Sciences
to prove their theories on interesting datasets. Interesting could mean complicated but could also mean datasets that are likely to be reproduced by a natural or artificial process. In this case image statistics are a very interesting case because they come with their own encoded statistics, the so-called natural image statistics. Another application that comes with natural statistics is NLP or Natural Language Processing. So very often Machine Learning researchers find themselves trying to discover patterns on either images or language. And maybe I should also mention that there are those who these days are trying to do them all three and I happen to be a member of such groups which is good because it lets me see closer what's going on in each of those fields. The symposium was very well balanced in this regard, it presented projects dealing with images, text and pure learning theory. I'm looking forward to attend the meeting next year and perhaps presenting some of my own this time.

I also want to stress how impressive New York looks from the New York Academy of Sciences auditorium. The whole magnificence of the giant skyscrapers feels like it can almost be touched and sensation of flying is astounding. You can see in the picture here the shadow projected by the building I was in onto the other buildings. Also I was very happy to find the bust of Charles Darwin in the inside. The eminent scientist that one day decided to pursue his research in the Galapagos Islands in Ecuador in his quest for uncovering the mysteries of evolution.

View from The New York Academy of Sciences

Sunday, September 12, 2010

Texture Synthesis, Computer Graphics and Computer Vision

Choosing some problem to work on is a very difficult task, regardless of the area that you work on. Sometimes we choose problems based solely on our personal curiosity, other times due to funding commitments. This time I was trying to decide about a small project assignment using computer graphics based solely on personal curiosity. One insight I had some time back before starting graduate school was the importance of image processing on computer graphics, especially when it comes to rendering and synthesizing textures. I thought this problem was interesting for several reasons: It was a relatively simple but well defined problem, it required knowledge about image processing, graphics, data structures, statistics, visual perception and optimization. There was already a body of knowledge from where I could learn. And there have been a lot of successful applications and future works based on those methods. I can mention now image inpainting, non photo realistic rendering and some popular seam carving methods on image processing.

The problem of texture synthesis can be described as follows: Let's say we need to put a texture on a large floor for a video game, but we only have a texture image of a small size. How do we texturize the entire floor? One thing we can do is to repeat our small image over and over until the whole surface is covered. We will obtain easily noticeable boundaries, and even if we create special textures that match in the boundaries cylindrically we will still notice the same pattern repeating over and over at a fixed rate. The ideal solution would be to create a method that 'learns' the texture in order to create a larger one that can give you the same appearance locally and globally as the original texture.

I looked at several publications on this and invariably the use of image samples to synthesize textures was the one that had established the most profound impact in the area. The idea of these methods is to take small image patches from the image pattern that you want to reproduce and place them randomly in the output texture but taking care of merging the boundaries without noticeable effects. I can't really survey those methods in a blog post but I actually did a small survey on them in this technical report

I chose to implement the method described in the SIGGRAPH 2001 paper named 'Image Quilting for Texture Synthesis and Transfer' by AA Efros (Berkeley) and WT Freeman (MERL).  Now 9 years after the paper was originally published and 1 year after I decided to take a look and implement this method, Prof. Efros was awarded the Significant New Researcher Award at this year SIGGRAPH 2010.   An excerpt of the award recognition reads:

'...Efros has published in a variety of areas in computer graphics
and computer vision, but his work can be broadly characterized
as employing data-driven approaches to solve problems that are
difficult to model using parametric methods. For example his work
in texture synthesis by example revolutionized an area in which
previous researchers had largely employed parametric approaches
with moderate success ...'

Which also summarizes what I had in mind before, this method and problem was one that was significant, has a proven track of applications derived from it, I can learn from and is definitely worth looking at. I created a video of my program synthesizing several textures. I include here the video and the slides that I presented in my class of Computer Graphics under the kind supervision of Prof. Hong Qin.

Tuesday, August 31, 2010

Niagara Falls

With summer about coming to an end and research starting to get interesting I embarked on a trip to the north-est part of New York to meet the sublime Niagara Falls and the Maid of the Mist experience. I include in this post one picture of Niagara Falls at night with Canada on the other side. 

If you're thinking about going to Niagara Falls from New York City you will probably search for tours that take you there on Google. Independently of what website you find, you will probably end up taking a bus in China Town and paying some extra money for going to either Corning Museum of Glass or Thousand Islands or both. We chose the Thousand Islands combo. Thousand Islands is an archipelago in the border between Canada and US, which for interesting that it might be you will hardly find it very interesting after visiting the majestic Niagara Falls with the Maid of the Mist experience. That said, the Maid of the Mist experience is a short boat trip that takes you as close as you can get to the Falls, so that you can immerse yourself in the mist created by the drops of water in the air. The trip was an experience on itself also, people from China, India, Latin America, Europe made part of this multi-lingual speaking bus. And we quickly passed through a number of cities in north New York, namely Rochester, Buffalo and Syracuse. 
Summer time is over and my first year of graduate school is coming to an end. Looking forward to what comes next.

Tuesday, August 3, 2010

Manhattan Pier 17

The night before taking my flight to California I spend some time with friends at Pier 17 in Manhattan, a place from where you can have a view of the main bridges of New York, more prominently Brooklyn Bridge. I include here a picture we took that night.

We ended up in one of the student housing apartments of Columbia University where one of our lab mates is spending the summer. Finally, I would also like to use this short blog post to recommend the blogs of my friends. I include here the links:  Life Is But A Dream and Oh, by the Way.
Which reminds me that I might need to choose a fancier name for my own blog. 

Saturday, July 31, 2010

The American Museum of Natural History

One of the largest museums in the world, the American Museum of Natural History brings a world of knowledge and visual information like no other place. The feeling that accompanied me along my visit to this vast sanctuary of science was that of a grown up who would have wished to visit this place when he was a little kid. The ancient civilizations, the jungle animals, the stars, the dinosaurs!

This is the way I started 2010, by visiting this iconic place in New York City which is located besides Central Park. I went there with a friend and it took us an entire day to walk around the whole place and read some of the captions about species, places and points in natural history. This is one of the first museums I have visited in the East Coast of the United States and the largest I have ever visited. I can't wait to see the other large museums of New York, my next planned destination is the Museum of Modern Art (MoMA).

Wednesday, July 28, 2010

The New York Aquarium and Coney Island

Coney Island in the south coast of Brooklyn is an interesting place. According to Wikipedia, it used to be an island and there's a lot of history behind the name But what I could notice when I was there about what is said in Wikipedia is the fact that this used to be a place for amusement parks. They say that before World War II this was the main destination in the US if you wanted to go to amusement parks with roller coasters and similar kind of rides. After some years of neglect it lost its popularity and even though when there are still some parks and rides there, it's not the top place anymore. I could personally see one of the old wooden roller coasters (look at the first picture) beside the New York Aquarium, which is the actual place that moved me and a friend to visit Coney Island in southern Brooklyn last year.

The New York Aquarium is yet another of those places you wish you had visited when you were of a younger age. Lots of big aquatic animals that you can see as close to you as some few centimeters. I could see for the first time many animals that I had only seen before through the eyes of Jacques Costeau when I was a child and used to watch his journeys on TV. At the end of the day we walked through the empty beach of this in another time acclaimed iconic place named Coney Island.

Update (Apr 19, 2011): I visited Coney Island again last Sunday and this time under some peer pressure I had a ride in the Cyclone, the wooden roller coaster that can be seen in the first image in this post. I can say it is a great experience to ride this engineering monument that dates back to 1927!

Saturday, July 10, 2010

Summer Time - California and Silicon Valley

Arriving to San Francisco International Airport doesn't really show San Francisco, from there you have to take a BART train in order to get to the downtown. The feeling that you get after getting off from one of the subway stations and you first see the real San Francisco can hardly be described, so I took this picture right after getting there:

Yes, at the beginning of this summer I traveled to California. I attended CVPR 2010, one of the main conferences in Computer Vision and Pattern Recognition. The world of datasets keeps growing and getting richer, most notably imagenet and the scene understanding database.  I could also see the demos and the insightful research that's going on in every major research center. This was a really insightful experience that I will try to use in the best way I can for my own research projects.

Going back to the San Francisco Bay Area and Silicon Valley brought back many good memories. I visited the headquarters of AMD in Sunnyvale where one of my classmates is working now for the summer. Since they bought ATI, they have been heavily involved with graphics and this is reflected in many of the details of their facilities although overall still keeps the feeling of a traditional software company. In contrast, I then visited Google and for my surprise lots of people outside of my old group could still recognize me. I had lunch there and then spent some time playing games. By night I and some friends visited Castro Street which looked more crowded than it used to be, night life in Mountain View is apparently getting better.

Right after my trip to California I shortly visited my home country. This was probably the last time I could see my sister in the following two years since she's planning to move to France to get a masters' degree in Biology. I had a very good time there although I didn't have the chance to visit many places. But as I have always thought, I prefer the people over the places. Now I'm back to research and work at Stony Brook, which is also good.

Saturday, June 12, 2010

Impressions After my Second Semester

I stopped from blogging for quite long. Last semester I was quite busy with Lambda Calculus for my Programming Languages class, where I had the opportunity to struggle and win. I understand now a lot of the principles behind type checking and have a more formal understanding of the properties that can be proven about languages and in particular about programming languages. But still, struggling and winning is probably the best valuable lesson I got from this class, one to be proud about.

The project around which my time and efforts revolved around last semester was an image analysis project related to popularizing the use of visual attributes for general purpose image understanding. My Computational Vision and Learning courses were synergistically helpful on this purpose throughout last semester as well. With nice results at hand, now I'm looking forward to new challenges.

This far I've concentrated on the things that got me interested on the first place: Image Understanding, Digital Media, Computational Learning and Visual Computing in general. But aside of that I've been concentrating on a set of graduate courses where the focus lies on the theory side: This far I've been in Theory of Computation, where in addition we also had a focus on Computational Complexity Theory. I also took the Logic class where although we dove mostly on the mathematical side I could also get the insights from the philosophical view. And I'm looking forward to attend the graduate class on Algorithms. After all "Whether you can observe a thing or not depends on the theory which you use. It is the theory which decides what can be observed."
-Albert Einstein to Werner Heisenberg.