In June 2009 we announced the launch of Yahoo!’s distributed Cloud storage platform, called Sherpa. Sherpa is the next-generation structured-record distributed storage service in Yahoo! that addresses scalability, availability and latency needs of Yahoo! websites.
Sherpa is horizontally scalable, and supports this scale elastically based on capacity demands. Sherpa is globally deployed within Yahoo! and data is replicated asynchronously (typically under a second) across data centers. Sherpa provides different consistency models, and applications can use consistency knobs and APIs to deal with data that is replicated asynchronously.
In the last year, we have added a number of interesting (and useful) features to Sherpa that has helped us serve the needs of Yahoo! better.
Contents
Many Web applications deploy MySQL (or some relational database) and tend to use it as a key-value store. As new NoSQL data-stores are becoming widespread in the open-source community, developers are moving their existing or new applications to these NoSQL data-stores. In Yahoo! as we help migrate such applications, the typical question is: Can Sherpa replace MySQL?
Yes, in certain cases. When data access is based on primary key (PK) alone, it is very easy to map SQL queries to Sherpa Web service requests. For example, a simple SQL statement for record retrieval would be “SELECT (*|col1,col2,…) FROM table WHERE pkcol = pkval;” The corresponding request in Sherpa is “$curl http://…/V1/get/table/pkval[field=col1&field=col2]”.
However, applications typically have more than just single key access patterns. The application may desire to access a set of records by the PK prefix. Or even have the results sorted on the PK. For example, retrieve the latest 10 status updates by user “ppsn”. Such queries are not (easily) possible using a distributed hash table. Hence, recently, we introduced distributed ordered tables (DOT) in Sherpa.
Using DOT, applications can access data clustered on the PK. For example, using the “$curl http://.../V1/ordered_scan/otable?start-key=value1&end-key=value2&order=desc” request, an application can get records in the PK range from “value1” to “value2”. In SQL this would be expressed as “SELECT * FROM otable WHERE pkcol BETWEEN “value1” AND “value2” ORDER BY pkcol DESC“. Similarly, “$curl http://.../V1/ordered_scan/otable?prefix=value-prefix&order=desc” can be used to in place of a LIKE query in SQL.
Under the covers, Sherpa ordered tables are sharded by the PK ranges rather than the hash ranges of the PK (in DHT). In the case of ordered tables, the distribution of primary keys will not be uniform; the sizes and load on the shards could vary dramatically. This makes sharding and load balancing a key challenge for DOT. So, we thought the best way to address this is by having an automated way of balancing the load via YAK.
Sherpa now has an automated load balancer, called YAK, which is responsible for detecting hotspots (highly loaded storage servers) and moving shards to lightly loaded servers. YAK does both load balancing and re-partitioning automatically, while tracking load across the hundreds of servers in multiple data centers. Each storage server keeps history of load/capacity metrics such as latencies, number of requests, error rate, number of records, and size; and keeps these across dimensions such as shard and table. Based on these metrics, YAK calculates the “heat” metric of the server as well as the heat of the shards on the server.
YAK has a rules-engine that uses the metrics to make load balancing decisions. The rules-engine picks servers that are hot and attempts to shift loads to “cold” servers by moving shards on-the-fly. Thus, it aims to keep the heat metric of the whole cluster near the average. In some cases, moving shards from one server to another may make the destination server too hot. So, YAK may decide to re-partition a shard into smaller units of movable load
Both load shifting and re-partitioning is done with minimal impact to ongoing application traffic. Our shard move protocol ensures that the new server has the latest copy (both from local updates and those via replication) before live traffic is routed from the old server to the new one.

As seen in the preceding figure, a typical Web application architecture in a data center would store data in a key-value system (such as Sherpa). In addition, the architecture could include a caching layer (e.g., memcached), and an external index (e.g., Lucene) for full text search.
Application logic gets complex. Queries can be answered from one or more of the data stores. However, application code would need to make three update calls in the code flow to maintain three disparate data repositories — and, of course, deal with failure conditions of one or more of the stores. This problem gets more complicated in a globally replicated deployment. Applications need to maintain consistency across data centers for the cache and text index.
Sherpa makes life easy for application developers with Real-time Notification Streams. Applications can listen to the notification events on a table, and use it to maintain caches and external indexes. The Notification Stream is reliable, but delivered asynchronously (typically under one second) from the associated update. Asynchronous delivery allows low latency commits of updates, but a small gap (typically under a second in practice) for the update appearing on the cache or index. A key simplifying feature is local delivery: updates to a record in any data center is delivered to a local notification client in each data center. (We have implemented strict security controls to ensure only authorized clients can receive notification events of a table.)
Each notification event includes a payload indicating the values that have changed — both old and new value. Another useful feature is mandatory fields — fields that may not have changed — with each notification event. For example, consider the use case of maintaining a secondary index called “soccer_moms”. When fields of a table are updated, the application will be notified with the old value and new value of the respective field. However, to maintain the “soccer_moms” index, it is important that the “gender” field be “female” and “num_children” be greater than zero. Hence, receiving notification with just the updates on a single field cannot be used for index maintenance. Applications can thus request that both “gender” and “num_children” be included (even if one of them changes) on each notification event, in order to maintain the “soccer_moms” index.
The first deployment of Sherpa supported the timeline-consistency model — namely, all replicas of a record apply all updates in the same order — and has API-level features to enable applications to cope with asynchronous replication. Strict adherence leads to difficult situations under network partitioning or server failures. These can be partially addressed with override procedures and local data replication, but in many circumstances, applications need a relaxed approach.
In late 2009, we introduced support for Eventual Consistency. Under this configuration, applications can perform inserts and updates on a table in the local replica with low latency. Conflicting updates from different replicas are resolved using a “last writer wins” policy at the field level (we call it merging) to ensure that replicas eventually converge to the same state. In the future, we are planning to introduce conflict notifications, to provide applications with the power to resolve such conflicts. Also, with eventual consistency, we maintain high availability in the event of server failures and network partitions. (A nice blog by Daniel Abadi talks more about this.)
In the months ahead, we plan to introduce several key features into Sherpa. A few that come to mind are secondary indexes, record-level adaptive replication, and very low latency access. It has been an extremely exciting time for Sherpa at Yahoo!. In the last year, we have onboarded numerous Yahoo! properties — such as YQL, Mobile, Yahoo! Social Platform (YOS), Advertising, MyYahoo!, Video, Sports, Shopping — into a multi-tenant, multi-datacenter system. In addition to bringing more users on to Sherpa, in the coming months Sherpa will grow to running on thousands of servers within Yahoo!.
The Oscars show on Sunday, March 7, was a big night for Yahoo!. As millions of people worldwide turned to Yahoo! for detailed coverage of the event, Yahoo!’s social features allowed visitors to join the discussion around breaking news of the awards by reading and posting comments.
Though not in the limelight like this year’s blockbuster Avatar movie, Yahoo! Avatars worked behind the scenes to make Oscar night entertaining for tens of millions of online viewers – and to do so, it made extensive use of the Yahoo! cloud. Yahoo! Avatars are used by tens of millions of Yahoo! users to represent their online identity when they interact through the social features built into Yahoo! properties.
Yahoo! Avatars migrated to Yahoo!’s unstructured storage cloud, MObStor, last year, in order to leverage the power of the cloud to scale seamlessly and handle peak traffic on Oscar night (over ten times our average property traffic).


Scaling traffic for user-generated content (e.g., Avatars) is a particularly difficult problem. On many content-driven sites, most of the traffic is driven by a small amount of content. Caching this content intelligently using web front-ends, proxies, and/or edge caches can allow you to scale your website considerably, when designed with that goal in mind. However, with user-generated content, the data sets are frequently large, and the accesses are spread randomly across the entire data set, making most caching strategies ineffective.
Before we implemented the cloud, every property would need to be prepared for such events by provisioning enough capacity to handle peak anticipated load, no matter how infrequent. As you’d imagine, this gets expensive quickly – particularly if storage is the bottleneck, as it can be with large amounts of static user-generated content.
Using the cloud, we can be smarter – we can intelligently use all the spindles at our disposal, caching more popular content higher up in the cloud stack, while only going to the spindles to serve cold reads. We overprovision smartly, sharing the expense of over-provisioning across a broad portfolio of properties. Important data is replicated to multiple datacenters worldwide for performance and availability. Taking storage scalability out of the equation also allows properties to focus on their main objective – building great products.
The cloud also integrates well with complementary cloud services. For example, properties can leverage technologies like edge caching (provided by the Yahoo! Cache System) to provide lower latencies to customers worldwide, without deploying any new hardware or software of their own. Yahoo!’s portfolio of cloud products solves these difficult problems once, so that every product engineering team doesn’t need to re-invent the wheel.
In other words, the cloud helps Yahoo! scale better, smarter, cheaper and faster … And the proof? On Oscar night, Yahoo! Avatars fared significantly better than its celluloid namesake — and didn’t end up in the hurt locker.
I presented a talk on cloud computing at an event organized by the AIM Institute at Gallup University in Omaha, Nebraska, on March 26. The phrase “cloud computing” is a bit vague, so I focused on describing a few different types of cloud-like services (e.g., app hosting, virtual private servers, file & data storage, grid computing, etc.) and Yahoo!’s role in the cloud computing community.
Omaha has a thriving tech scene. My host, Jeff Slobotski — Director of Innovation & New Media at the AIM Institute, co-organizer of the Big Omaha conference, and cofounder of Silicon Prairie News — is in the middle of it. Yahoo! opened its newest data center there in February, and there are numerous other tech and Fortune 500 companies in residence. Micah Laaker, Director, Yahoo! Open Strategy UED, described his experience at Big Omaha last year in his YDN blog post “Big Lights, Big Omaha”.
If you’re in Omaha, or are planning to visit, and would like to integrate with the tech community there, keep the following organizations and events in mind:
- www.aiminstitute.org
- Big Omaha
- Infotec
- AIM Institute’s technology breakfast seminars
Big thanks to Jeff and Silicon Prairie News cofounder/Big Omaha co-organizer Dusty Davidson for being awesome hosts, and Seraphim Mullin’s team at the Yahoo! office in Omaha for receiving us on (very) short notice.
Erik Eldridge
Yahoo! Developer Network (@ydn)
Thanks to the more than 200 developers who came to Yahoo! Wednesday night for our monthly Hadoop User Group meeting. It was our largest turnout ever and people stuck around until 9 pm.
The discussion centered around Yahoo!’s Hadoop Security release, a presentation of Hadoop research by a Berkeley PhD student and an overview of the Flightcaster flight delay prediction service built on Hadoop.
At the beginning of the meeting we announced Hadoop Summit 2010 – June 29th, Santa Clara, a one day event for technology leaders and application developers featuring keynotes from Yahoo!, Amazon Web Services, Cloudera and Facebook. We are introducing multiple track sessions around Hadoop programming, case studies and cutting-edge research. Registration and paper submission are available here.
Hadoop User Group March Meeting Recap:
It was great to see many new faces. The interesting mix of experienced developers and Hadoop “newbies” lead to many productive discussions:
A few interesting comments from attendees, posted at the event meetup page:
“Today’s meetup was really informative and very good presentation on Hadoop Online Project.. keep it coming!!”. Dave Jespersen, VP Engineering at MapR Technologies
“It was good to see security enhancements. Streaming and flight delay predictor were very cool and interesting”. Arul Ganesh, Java Developer.
“The security and performance presentations were excellent. I enjoyed the climate and the chance to meet and talk with other Hadoop enthusiasts. Great job organizing and putting these events together. I sincerely appreciate it”. Sreeni Jaladanki, Engineering Manager
For those of you who were unable to attend in person, the session’s details and slides are posted below, we will publish the video recording soon. Stay tuned!
Owen O’Malley from the Yahoo! Hadoop Team provided an overview of the upcoming Hadoop Security release. Owen described the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform. Stay tuned for a detailed post on security coming to this blog soon.
Tyson Condie a Ph.D. student at the University of California, Berkeley, presented the innovative research around Hadoop Online efforts lead by Prof. Joseph M. Hellerstein . Tyson described a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, can reduce completion times and improve system utilization. Tyson included examples from the HOP – Hadoop Online Prototype project.
Bradford Cross from Flightcaster provided an exciting overview on the FlightCaster flight delays prediction service and some cool insights into the airline industry. Bradford described how they built a scalable machine learning and data analysis platform using Clojure dynamic programming language wrapping Cascading and Hadoop. Bradford demonstrated how the use of Hadoop makes building scalable systems much simpler.
We at Yahoo!, definitely see the importance of Hadoop – the ability to process massive data sets is core to our business – and we are continuing to invest heavily in the technology and the community to make it even better. We love being at the center of discussion and debate around Hadoop.
Please join us at the Hadoop Summit to continue the conversation.
As always, we are looking for exciting technologies and experiences you want to share.
Please email presentation requests at the Hadoop Bay Area User Group Meetup page.
See you all on April 21st, 2010. Registration is available here, agenda will be published soon
Dekel Tankel
Director, Product Management
Cloud Computing at Yahoo!
The first India Hadoop Summit was held on Feb 28th, 2010 at DAYANANDA SAGAR EDUCATIONAL INSTITUTIONS, Bangalore, in partnership with CloudCamp event. It was organized by Yahoo! Research and Development India in association with CloudCamp committee.
This was the first Hadoop event, of this scale, in India, and brought many Hadoop enthusiasts together. Speakers and audiences from Hadoop development community, Hadoop user-groups, Industry evangelists, university researchers and college students came together to learn more about using Hadoop in various environments.
The morning began with a keynote session “Hadoop and its impact in Computing” (slides) by Hemanth Yamijala from Yahoo!. Hari Vasudev, VP Platform Technology Group at Yahoo!, spoke on Yahoo!’s commitment to Hadoop and Open-Source (slides).
In the afternoon, a Hadoop panel followed several fascinating sessions. Some of the Hadoop sessions:
The ending panel was moderated by Basant Verma, Yahoo! The participants were Chidambaran V. Kollengode from Yahoo!, Dr. T.S. Mohan from Infosys, Jothi Padmanabhan from Yahoo!, and Dr. G. Sudha Sadhasivam from PSG Tech Coimbatore. They discussed a range of topics pertaining to Hadoop’s adoption, usage in research, and the future of Hadoop & Cloud Computing.
Yahoo! India also hosted a booth during the event, including a Hadoop cross-word puzzle and some popular Hadoop swag. Stay tuned to hear more about our continued commitment to Hadoop and to Open Source Technologies.
Preeti Priyadarshini
Program Manager, CCDI Grid Computing
Web development is a challenging job, so you need the very best web owner tools to get it done right. Whether it's SEO, programming, utilities, software or just keeping up on the latest trends; web owner tools are what you need to succeed. Give yourself a headstart on the competition, and bookmark Web Owner Tools today.