“Living with Big Data: Challenges and Opportunities”, Jeffrey Dean and Sanjay Ghemawat, Google Inc.

As part of the Big Data Lecture Series — Fall 2012, Google’s Jeff Dean gave a talk on how Google manages to deliver services which involve managing huge amounts of data. In order to make things work over the distributed infrastructure of Google’s several data-centers, they use services and sub-services. Each service uses a protocol to communicate with other services. These protocols are language independent. Dean gave an example of a simple spell correction service which takes a request, such as correction{query:”……”}. The advantage of this model is that it is independent of client and it’s easy to make changes with no ripple effect. For an instance, to add a language feature in their spell correction service, they just need to add an extra optional request parameter: correction{query:”…….”, lang:”en_US”}. Also, it allows them to build each service independently.

Since Google has a lot of clusters in different datacenters, the list of potential things that can go wrong is long – rack failure, router failure, hard drive failure, machine failure, link failure (especially long distance links which are susceptible to external failures like attacks from wild dogs, drunken hunters etc.) are just a few! So, the software itself must provide reliability and availability. Replication allows them to solve hardware failures and issues such as data loss, slow machines, excessive load, bad latency etc. In order to tolerate latency, they primarily use two techniques – cross request adaptation and within request adaptation. The cross request adaptation technique examines recent behaviour and accordingly makes a decision for the future requests. On the other hand, the within request adaptation technique copes with the slow subsystems in context; it uses “tied requests” i.e. each request is sent to two servers (with a delay of 2ms). As soon as one of the two starts processing the request, it notifies the other to stop. Google ran experiments and they deduced that the latency hugely improves with a small overhead of a few extra disk reads.

In order to manage huge amount of data over distributed infrastructure, Google has several cluster level services, such as GFS/Colossus, Map Reduce, Cluster Scheduling System, Big Table. Although these services solve many problems, they also introduce several cross-cluster issues. To solve these cross-cluster issues, Google has built the ‘spanner’, a large-scale storage system that can manage data across all of Google’s data-centers. The ‘spanner’ has a single global namespace for data. It supports constant replication across data-centers and auto-migration to meet various constraints, such as a resource constraint (“file system is getting full”). A migration could be an app-level hint — “place this data to Europe”. The key idea is to build high level systems which provide a high level of abstraction. This black box is incredibly valuable since applications don’t need to deal with low level issues.

Monitoring and debugging are crucial in distributed environment. Every server in Google supports request tracing (call graph), online profiling, debugging variables and monitoring. Google has a tool called dapper which allows them to monitor and debug their infrastructure.

Much of Google’s work is approximating AI. Recently, they have been working on infrastructure for deep learning. Deep learning is an algorithmic approach to automatically learn high level representation from raw data. It can learn from both labelled and unlabelled data(unsupervised learning). The model could be as complex as number of parameters in billions, and requires several CPUs. In order to deal with this huge amount, Google partitions the model, adding another dimension of parallelism with multiple model instances communicating with each other. Google has built a deep network for machine learning (learning image representation, natural language processing (speech and text both)) that has significant reduction in the training time. They in fact trained an acoustic model for speech recognition in approximately 5 days with 800 machines in parallel.

One Response to ““Living with Big Data: Challenges and Opportunities”, Jeffrey Dean and Sanjay Ghemawat, Google Inc.”