primefaces cache

CacheCache component is used to reduce page load time by caching the content in a global cache after the initial rendering. Various cache providers are supported like ehcache and hazelcast. In this example, toolbar component is cached and output would be retrieved from cache.

http://www.primefaces.org/showcase/ui/misc/cache.xhtml

Advertisements

Tumblr Architecture – 15 Billion Page Views a Month and Harder to Scale than Twitter

With over 15 billion page views a month Tumblr has become an insanely popular blogging platform. Users may like Tumblr for its simplicity, its beauty, its strong focus on user experience, or its friendly and engaged community, but like it they do.

Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.

One of the common patterns across successful startups is the perilous chasm crossing from startup to wildly successful startup. Finding people, evolving infrastructures, servicing old infrastructures, while handling huge month over month increases in traffic, all with only four engineers, means you have to make difficult choices about what to work on. This was Tumblr’s situation. Now with twenty engineers there’s enough energy to work on issues and develop some very interesting solutions.

Tumblr started as a fairly typical large LAMP application. The direction they are moving in now is towards a distributed services model built around Scala, HBase, Redis, Kafka, Finagle,  and an intriguing cell based architecture for powering their Dashboard. Effort is now going into fixing short term problems in their PHP application, pulling things out, and doing it right using services.

The theme at Tumblr is transition at massive scale. Transition from a LAMP stack to a somewhat bleeding edge stack. Transition from a small startup team to a fully armed and ready development team churning out new features and infrastructure. To help us understand how Tumblr is living this theme is startup veteran Blake Matheny, Distributed Systems Engineer at Tumblr. Here’s what Blake has to say about the House of Tumblr:

Site:  http://www.tumblr.com/

Stats

  • 500 million page views a day
  • 15B+ page views month
  • ~20 engineers
  • Peak rate of ~40k requests per second
  • 1+ TB/day into Hadoop cluster
  • Many TB/day into MySQL/HBase/Redis/Memcache
  • Growing at 30% a month
  • ~1000 hardware nodes in production
  • Billions of page visits per month per engineer
  • Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
  • Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

Software

  • OS X for development, Linux (CentOS, Scientific) in production
  • Apache
  • PHP, Scala, Ruby
  • Redis, HBase, MySQL
  • Varnish, HA-Proxy, nginx,
  • Memcache, Gearman, Kafka, Kestrel, Finagle
  • Thrift, HTTP
  • Func – a secure, scriptable remote control framework and API
  • Git, Capistrano, Puppet, Jenkins

Hardware

  • 500 web servers
  • 200 database servers (many of these are part of a spare pool we pulled from for failures)
    • 47 pools
    • 30 shards
  • 30 memcache servers
  • 22 redis servers
  • 15 varnish servers
  • 25 haproxy nodes
  • 8 nginx
  • 14 job queue servers (kestrel + gearman)

Architecture

  • Tumblr has a different usage pattern than other social networks.
    • With 50+ million posts a day, an average post goes to many hundreds of people. It’s not just one or two users that have millions of followers. The graph for Tumblr users has hundreds of followers. This is different than any other social network and is what makes Tumblr so challenging to scale.
    • #2 social network in terms of time spent by users. The content is engaging. It’s images and videos. The posts aren’t byte sized. They aren’t all long form, but they have the ability. People write in-depth content that’s worth reading so people stay for hours.
    • Users form a connection with other users so they will go hundreds of pages back into the dashboard to read content. Other social networks are just a stream that you sample.
    • Implication is that given the number of users, the average reach of the users, and the high posting activity of the users, there is a huge amount of updates to handle.
  • Tumblr runs in one colocation site. Designs are keeping geographical distribution in mind for the future.
  • Two components to Tumblr as a platform: public Tumblelogs and Dashboard
    • Public Tumblelog is what the public deals with in terms of a blog. Easy to cache as its not that dynamic.
    • Dashboard is similar to the Twitter timeline. Users follow real-time updates from all the users they follow.
      • Very different scaling characteristics than the blogs. Caching isn’t as useful because every request is different, especially with active followers.
      • Needs to be real-time and consistent. Should not show stale data. And it’s a lot of data to deal with. Posts are only about 50GB a day. Follower list updates are 2.7TB a day. Media is all stored on S3.
    • Most users leverage Tumblr as tool for consuming of content. Of the 500+ million page views a day, 70% of that is for the Dashboard.
    • Dashboard availability has been quite good. Tumblelog hasn’t been as good because they have a legacy infrastructure that has been hard to migrate away from. With a small team they had to pick and choose what they addressed for scaling issues.

Old Tumblr

  • When the company started on Rackspace it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate. This is 2007. They still have custom domains on Rackspace. They route through Rackspace back to their colo space using HAProxy and Varnish. Lots of legacy issues like this.
  • A traditional LAMP progression.
    • Historically developed with PHP. Nearly every engineer programs in PHP.
    • Started with a web server, database server and a PHP application and started growing from there.
    • To scale they started using memcache, then put in front-end caching, then HAProxy in front of the caches, then MySQL sharding. MySQL sharding has been hugely helpful.
    • Use a squeeze everything out of a single server approach. In the past year they’ve developed a couple of backend services in C: an ID generator and Staircar, using Redis to power Dashboard notifications
  • The Dashboard uses a scatter-gather approach. Events are displayed when a user access their Dashboard. Events for the users you follow are pulled and displayed. This will scale for another 6 months. Since the data is time ordered sharding schemes don’t work particularly well.

New Tumblr

  • Changed to a JVM centric approach for hiring and speed of development reasons.
  • Goal is to move everything out of the PHP app into services and make the app a thin layer over services that does request authentication, presentation, etc.
  • Scala and Finagle Selection
    • Internally they had a lot of people with Ruby and PHP experience, so Scala was appealing.
    • Finagle was a compelling factor in choosing Scala. It is a library from Twitter. It handles most of the distributed issues like distributed tracing, service discovery, and service registration. You don’t have to implement all this stuff. It just comes for free.
    • Once on the JVM Finagle provided all the primitives they needed (Thrift, ZooKeeper, etc).
    • Finagle is being used by Foursquare and Twitter. Scala is also being used by Meetup.
    • Like the Thrift application interface. It has really good performance.
    • Liked Netty, but wanted out of Java, so Scala was a good choice.
    • Picked Finagle because it was cool, knew some of the guys, it worked without a lot of networking code and did all the work needed in a distributed system.
    • Node.js wasn’t selected because it is easier to scale the team with a JVM base. Node.js isn’t developed enough to have standards and best practices, a large volume of well tested code. With Scala you can use all the Java code. There’s not a lot of knowledge of how to use it in a scalable way and they target 5ms response times, 4 9s HA, 40K requests per second and some at 400K requests per second. There’s a lot in the Java ecosystem they can leverage.
  • Internal services are being shifted from being C/libevent based to being Scala/Finagle based.
  • Newer, non-relational data stores like HBase and Redis are being used, but the bulk of their data is currently stored in a heavily partitioned MySQL architecture. Not replacing MySQL with HBase.
  • HBase backs their URL shortner with billions of URLs and all the historical data and analytics. It has been rock solid. HBase is used in situations with high write requirements, like a million writes a second for the Dashboard replacement.  HBase wasn’t deployed instead of MySQL because they couldn’t bet the business on HBase with the people that they had, so they started using it with smaller less critical path projects to gain experience.
  • Problem with MySQL and sharding for time series data is one shard is always really hot. Also ran into read replication lag due to insert concurrency on the slaves.
  • Created a common services framework.
    • Spent a lot of time upfront solving operations problem of how to manage a distributed system.
    • Built a kind of Rails scaffolding, but for services. A template is used to bootstrap services internally.
    • All services look identical from an operations perspective. Checking statistics, monitoring, starting and stopping all work the same way for all services.
    • Tooling is put around the build process in SBT (a Scala build tool) using plugins and helpers to take care of common activities like tagging things in git, publishing to the repository, etc. Most developers don’t have to get in the guts of the build system.
  • Front-end layer uses HAProxy. Varnish might be hit for public blogs. 40 machines.
  • 500 web servers running Apache and their PHP application.
  • 200 database servers. Many database servers are used for high availability reasons. Commodity hardware is used an the MTBF is surprisingly low. Much more hardware than expected is lost so  there are many spares in case of failure.
  • 6 backend services to support the PHP application. A team is dedicated to develop the backend services. A new service is rolled out every 2-3 weeks. Includes dashboard notifications, dashboard secondary index, URL shortener, and a memcache proxy to handle transparent sharding.
  • Put a lot of time and effort and tooling into MySQL sharding. MongoDB is not used even though it is popular in NY (their location). MySQL can scale just fine..
  • Gearman, a job queue system, is used for long running fire and forget type work.
  • Availability is measured in terms of reach. Can a user reach custom domains or the dashboard? Also in terms of error rate.
  • Historically the highest priority item is fixed. Now failure modes are analyzed and addressed systematically. Intention is to measure success from a user perspective and an application perspective. If part of a request can’t be fulfilled that is account for
  • Initially an Actor model was used with Finagle, but that was dropped.  For fire and forget work a job queue is used. In addition, Twitter’s utility library contains a Futures implementation and services are implemented in terms of futures. In the situations when a thread pool is needed futures are passed into a future pool. Everything is submitted to the future pool for asynchronous execution.
  • Scala encourages no shared state. Finagle is assumed correct because it’s tested by Twitter in production. Mutable state is avoided using constructs in Scala or Finagle. No long running state machines are used. State is pulled from the database, used, and writte n back to the database. Advantage is developers don’t need to worry about threads or locks.
  • 22 Redis servers. Each server has 8 – 32 instances so 100s of Redis instances are used in production.
    • Used for backend storage for dashboard notifications.
    • A notification is something  like a user liked your post. Notifications show up in a user’s dashboard to indicate actions other users have taken on their content.
    • High write ratio made MySQL a poor fit.
    • Notifications are ephemeral so it wouldn’t be horrible if they were dropped, so Redis was an acceptable choice for this function.
    • Gave them a chance to learn about Redis and get familiar with how it works.
    • Redis has been completely problem free and the community is great.
    • A Scala futures based interface for Redis was created. This functionality is now moving into their Cell Architecture.
    • URL shortener uses Redis as the first level cache and HBase as permanent storage.
    • Dashboard’s secondary index is built around Redis.
    • Redis is used as Gearman’s persistence layer using a memcache proxy built using Finagle.
    • Slowly moving from memcache to Redis. Would like to eventually settle on just one caching service. Performance is on par with memcache.

Internal Firehose

  • Internally applications need access to the activity stream. An activity steam is information about users creating/deleting posts, liking/unliking posts, etc.  A challenge is to distribute so much data in real-time. Wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed.
  • Previously this information was distributed using Scribe/Hadoop. Services would log into Scribe and begin tailing and then pipe that data into an app. This model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second. Didn’t want people tailing files and piping to grep.
  • An internal firehose was created as a message bus. Services and applications talk to the firehose via Thrift.
  • LinkedIn’s Kafka is used to store messages. Internally consumers use an HTTP stream to read from the firehose. MySQL wasn’t used because the sharding implementation is changing frequently so hitting it with a huge data stream is not a good idea.
  • The firehose model is very flexible, not like Twitter’s firehose in which data is assumed to be lost.
    • The firehose stream can be rewound in time. It retains a week of data. On connection it’s possible to specify the point in time to start reading.
    • Multiple clients can connect and each client won’t see duplicate data. Each client has a client ID. Kafka supports a consumer group idea. Each consumer in a consumer group gets its own messages and won’t see duplicates. Multiple clients can be created using the same consumer ID and clients won’t see duplicate data. This allows data to be processed independently and in parallel. Kafka uses ZooKeeper to periodically checkpoint how far a consumer has read.

Cell Design for Dashboard Inbox

  • The current scatter-gather model for providing Dashboard functionality has very limited runway. It won’t last much longer.
    • The solution is to move to an inbox model implemented using a Cell Based Architecture that is similar to Facebook Messages.
    • An inbox is the opposite of scatter-gather. A user’s dashboard, which is made up posts from followed users and actions taken by other users,  is logically stored together in time order.
    • Solves the scatter gather problem because it’s an inbox. You just ask what is in the inbox so it’s less expensive then going to each user a user follows. This will scale for a very long time.
  • Rewriting the Dashboard is difficult. The data has a distributed nature, but it has a transactional quality, it’s not OK for users to get partial updates.
    • The amount of data is incredible. Messages must be delivered to hundreds of different users on average which is a very different problem than Facebook faces. Large date + high distribution rate + multiple datacenters.
    • Spec’ed at a million writes a second and 50K reads a second. The data set size is 2.7TB of data growth with no replication or compression turned on. The million writes a second is from the 24 byte row key that indicates what content is in the inbox.
    • Doing this on an already popular application that has to be kept running.
  • Cells
    • A cell is a self-contained installation that has all the data for a range of users. All the data necessary to render a user’s Dashboard is in the cell.
    • Users are mapped into cells. Many cells exist per data center.
    • Each cell has an HBase cluster, service cluster, and Redis caching cluster.
    • Users are homed to a cell and all cells consume all posts via firehose updates.
    • Each cell is Finagle based and populates HBase via the firehose and service requests over Thrift.
    • A user comes into the Dashboard, users home to a particular cell, a service node reads their dashboard via HBase, and passes the data back.
    • Background tasks consume from the firehose to populate tables and process requests.
    • A Redis caching layer is used for posts inside a cell.
  • Request flow: a user publishes a post, the post is written to the firehose, all of the cells consume the posts and write that post content to post database, the cells lookup to see if any of the followers of the post creator are in the cell, if so the follower inboxes are updated with the post ID.
  • Advantages of cell design:
    • Massive scale requires parallelization and parallelization requires components be isolated from each other so there is no interaction. Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
    • Cells isolate failures. One cell failure does not impact other cells.
    • Cells enable nice things like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • The key idea that is easy to miss is: all posts are replicated to all cells.
    • Each cell stores a single copy of all posts. Each cell can completely satisfy a Dashboard rendering request. Applications don’t ask for all the post IDs and then ask for the posts for those IDs. It can return the dashboard content for the user. Every cell has all the data needed to fulfill a Dashboard request without doing any cross cell communication.
    • Two HBase tables are used: one that stores a copy of each post. That data is small compared to the other table which stores every post ID for every user within that cell. The second table tells what the user’s dashboard looks like which means they don’t have to go fetch all the users a user is following. It also means across clients they’ll know if you read a post and viewing a post on a different device won’t mean you read the same content twice. With the inbox model state can be kept on what you’ve read.
    • Posts are not put directly in the inbox because the size is too great. So the ID is put in the inbox and the post content is put in the cell just once. This model greatly reduces the storage needed while making it simple to return a time ordered view of an users inbox. The downside is each cell contains a complete copy of call posts. Surprisingly posts are smaller than the inbox mappings. Post growth per day is 50GB per cell, inbox grows at 2.7TB a day. Users consume more than they produce.
    • A user’s dashboard doesn’t contain the text of a post, just post IDs, and the majority of the growth is in the IDs.
    • As followers change the design is safe because all posts are already in the cell. If only follower posts were stored in a cell then cell would be out of date as the followers changed and some sort of back fill process would be needed.
    • An alternative design is to use a separate post cluster to store post text. The downside of this design is that if the cluster goes down it impacts the entire site.  Using the cell design and post replication to all cells creates a very robust architecture.
  • A user having millions of followers who are really active is handled by selectively materializing user feeds by their access model (see Feeding Frenzy).
    • Different users have different access models and distribution models that are appropriate. Two different distribution modes: one for popular users and one for everyone else.
    • Data is handled differently depending on the user type. Posts from active users wouldn’t actually be published, posts would selectively materialized.
    • Users who follow millions of users are treated similarly to users who have millions of followers.
  • Cell size is hard to determine. The size of cell is the impact site of a failure. The number of users homed to a cell is the impact. There’s a tradeoff to make in what they are willing to accept for the user experience and how much it will cost.
  • Reading from the firehose is the biggest network issue. Within a cell the network traffic is manageable.
  • As more cells are added cells can be placed into a cell group that reads from the firehose and then replicates to all cells within the group. A hierarchical replication scheme. This will also aid in moving to multiple datacenters.

On Being a Startup in New York

  • NY is a different environment. Lots of finance and advertising. Hiring is challenging because there’s not as much startup experience.
  • In the last few years NY has focused on helping startups. NYU and Columbia have programs for getting students interesting internships at startups instead of just going to Wall Street. Mayor Bloomberg is establishing a local campus focused on technology.

Team Structure

  • Teams: infrastructure, platform, SRE, product, web ops, services.
  • Infrastructure: Layer 5 and below. IP address and below, DNS, hardware provisioning.
  • Platform: core app development, SQL sharding, services, web operations.
  • SRE: sits between service team and web ops team. Focused on more immediate needs in terms of reliability and scalability.
  • Service team: focuses on things that are slightly more strategic, that are a month or two months out.
  • Web ops: responsible for problem detection and response, and tuning.

Software Deployment

  • Started with a set of rsync scripts that distributed the PHP application everywhere. Once the number of machines reached 200 the system started having problems, deploys took a long time to finish and machines would be in various states of the deploy process.
  • The next phase built the deploy process (development, staging, production) into their service stack using Capistrano. Worked for services on dozens of machines, but by connecting via SSH it started failing again when deploying to hundreds of machines.
  • Now a piece of coordination software runs on all machines. Based around Func from RedHat, a lightweight API for issuing commands to hosts. Scaling is built into Func.
  • Build deployment is over Func by saying do X on a set of hosts, which avoids SSH. Say you want to deploy software on group A. The master reaches out to a set of nodes and runs the deploy command.
  • The deploy command is implemented via Capistrano. It can do a git checkout or pull from the repository. Easy to scale because they are talking HTTP. They like Capistrano because it supports simple directory based versioning that works well with their PHP app. Moving towards versioned updates, where each directory contains a SHA so it’s easy to check if a version is correct.
  • The Func API is used to report back status, to say these machines have these software versions.
  • Safe to restart any of their services because they’ll drain off connections and then restart.
  • All features run in dark mode before activation.

Development

  • Started with the philosophy that anyone could use any tool that they wanted, but as the team grew that didn’t work. Onboarding new employees was very difficult, so they’ve standardized on a stack so they can get good with those, grow the team quickly, address production issues more quickly, and build up operations around them.
  • Process is roughly Scrum like. Lightweight.
  • Every developer has a preconfigured development machine. It gets updates via Puppet.
  • Dev machines can roll changes, test, then roll out to staging, and then roll out to production.
  • Developers use vim and Textmate.
  • Testing is via code reviews for the PHP application.
  • On the service side they’ve implemented a testing infrastructure with commit hooks, Jenkins, and continuous integration and build notifications.

Hiring Process

  • Interviews usually avoid math, puzzles, and brain teasers. Try to ask questions focused on work the candidate will actually do. Are they smart? Will they get stuff done? But measuring “gets things done” is difficult to assess. Goal is to find great people rather than keep people out.
  • Focused on coding. They’ll ask for sample code. During phone interviews they will use Collabedit to write shared code.
  • Interviews are not confrontational, they just want to find the best people. Candidates get to use all their tools, like Google, during the interview. The idea is developers are at their best when they have tools so that’s how they run the interviews.
  • Challenge is finding people that have the scaling experience they require given Tumblr’s traffic levels. Few companies in the world are working on the problems they are.
    • Example, for a new ID generator they needed A JVM process to generate service responses in less the 1ms at a rate at 10K requests per second with a 500 MB RAM limit with High Availability. They found the serial collector gave the lowest latency for this particular work load. Spent a lot of time on JVM tuning.
  • On the Tumblr Engineering Blog they’ve posted memorials giving their respects for the passing of Dennis Ritchie & John McCarthy. It’s a geeky culture.

Lessons learned

  • Automation everywhere.
  • MySQL (plus sharding) scales, apps don’t.
  • Redis is amazing.
  • Scala apps perform fantastically.
  • Scrap projects when you aren’t sure if they will work.
  • Don’t hire people based on their survival through a useless technological gauntlet.  Hire them because they fit your team and can do the job.
  • Select a stack that will help you hire the people you need.
  • Build around the skills of your team.
  • Read papers and blog posts. Key design ideas like the cell architecture and selective materialization were taken from elsewhere.
  • Ask your peers. They talked to engineers from Facebook, Twitter, LinkedIn about their experiences and learned from them. You may not have access to this level, but reach out to somebody somewhere.
  • Wade, don’t jump into technologies. They took pains to learn HBase and Redis before putting them into production by using them in pilot projects or in roles where the damage would be limited.

I’d like to thank Blake very much for the interview. He was very generous with his time and patient with his explanations. Please contact me if you would like to talk about having your architecture profiled.

http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html

New Tweets per second record, and how!

Recently, something remarkable happened on Twitter: On Saturday, August 3 in Japan, people watched an airing of Castle in the Sky, and at one moment they took to Twitter so much that we hit a one-second peak of 143,199 Tweets per second. (August 2 at 7:21:50 PDT; August 3 at 11:21:50 JST)

To give you some context of how that compares to typical numbers, we normally take in more than 500 million Tweets a day which means about 5,700 Tweets a second, on average. This particular spike was around 25 times greater than our steady state.

During this spike, our users didn’t experience a blip on Twitter. That’s one of our goals: to make sure Twitter is always available no matter what is happening around the world.

New Tweets per second (TPS) record: 143,199 TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS.

Tweet

This goal felt unattainable three years ago, when the 2010 World Cup put Twitter squarely in the center of a real-time, global conversation. The influx of Tweets –– from every shot on goal, penalty kick and yellow or red card –– repeatedly took its toll and made Twitter unavailable for short periods of time. Engineering worked throughout the nights during this time, desperately trying to find and implement order-of-magnitudes of efficiency gains. Unfortunately, those gains were quickly swamped by Twitter’s rapid growth, and engineering had started to run out of low-hanging fruit to fix.

After that experience, we determined we needed to step back. We then determined we needed to re-architect the site to support the continued growth of Twitter and to keep it running smoothly. Since then we’ve worked hard to make sure that the service is resilient to the world’s impulses. We’re now able to withstand events like Castle in the Sky viewings, the Super Bowl, and the global New Year’s Eve celebration. This re-architecture has not only made the service more resilient when traffic spikes to record highs, but also provides a more flexible platform on which to build more features faster, including synchronizing direct messages across devices, Twitter cards that allow Tweets to become richer and contain more content, and a rich search experience that includes stories and users. And more features are coming.

Below, we detail how we did this. We learned a lot. We changed our engineering organization. And, over the next few weeks, we’ll be publishing additional posts that go into more detail about some of the topics we cover here.

Starting to re-architect

After the 2010 World Cup dust settled, we surveyed the state of our engineering. Our findings:

  • We were running one of the world’s largest Ruby on Rails installations, and we had pushed it pretty far –– at the time, about 200 engineers were contributing to it and it had gotten Twitter through some explosive growth, both in terms of new users as well as the sheer amount of traffic that it was handling. This system was also monolithic where everything we did, from managing raw database and memcache connections through to rendering the site and presenting the public APIs, was in one codebase. Not only was it increasingly difficult for an engineer to be an expert in how it was put together, but also it was organizationally challenging for us to manage and parallelize our engineering team.
  • We had reached the limit of throughput on our storage systems –– we were relying on a MySQL storage system that was temporally sharded and had a single master. That system was having trouble ingesting tweets at the rate that they were showing up, and we were operationally having to create new databases at an ever increasing rate. We were experiencing read and write hot spots throughout our databases.
  • We were “throwing machines at the problem” instead of engineering thorough solutions –– our front-end Ruby machines were not handling the number of transactions per second that we thought was reasonable, given their horsepower. From previous experiences, we knew that those machines could do a lot more.
  • Finally, from a software standpoint, we found ourselves pushed into an “optimization corner” where we had started to trade off readability and flexibility of the codebase for performance and efficiency.

We concluded that we needed to start a project to re-envision our system. We set three goals and challenges for ourselves:

  • We wanted big infrastructure wins in performance, efficiency, and reliability –– we wanted to improve the median latency that users experience on Twitter as well as bring in the outliers to give a uniform experience to Twitter. We wanted to reduce the number of machines needed to run Twitter by 10x. We also wanted to isolate failures across our infrastructure to prevent large outages –– this is especially important as the number of machines we use go up, because it means that the chance of any single machine failing is higher. Failures are also inevitable, so we wanted to have them happen in a much more controllable manner.
  • We wanted cleaner boundaries with “related” logic being in one place –– we felt the downsides of running our particular monolithic codebase, so we wanted to experiment with a loosely coupled services oriented model. Our goal was to encourage the best practices of encapsulation and modularity, but this time at the systems level rather than at the class, module, or package level.
  • Most importantly, we wanted to launch features faster. We wanted to be able to run small and empowered engineering teams that could make local decisions and ship user-facing changes, independent of other teams.

We prototyped the building blocks for a proof of concept re-architecture. Not everything we tried worked and not everything we tried, in the end, met the above goals. But we were able to settle on a set of principles, tools, and an infrastructure that has gotten us to a much more desirable and reliable state today.
The JVM vs the Ruby VM
First, we evaluated our front-end serving tier across three dimensions: CPU, RAM, and network. Our Ruby-based machinery was being pushed to the limit on the CPU and RAM dimensions –– but we weren’t serving that many requests per machine nor were we coming close to saturating our network bandwidth. Our Rails servers, at the time, had to be effectively single threaded and handle only one request at a time. Each Rails host was running a number of Unicorn processes to provide host-level concurrency, but the duplication there translated to wasteful resource utilization. When it came down to it, our Rails servers were only capable of serving 200 – 300 requests / sec / host.

Twitter’s usage is always growing rapidly, and doing the math there, it would take a lot of machines to keep up with the growth curve.

At the time, Twitter had experience deploying fairly large scale JVM-based services –– our search engine was written in Java, and our Streaming Api infrastructure as well as Flock, our social graph system, was written in Scala. We were enamored by the level of performance that the JVM gave us. It wasn’t going to be easy to get our performance, reliability, and efficiency goals out of the Ruby VM, so we embarked on writing code to be run on the JVM instead. We estimated that rewriting our codebase could get us > 10x performance improvement, on the same hardware –– and now, today, we push on the order of 10 – 20K requests / sec / host.

There was a level of trust that we all had in the JVM. A lot of us had come from companies where we had experience working with, tuning, and operating large scale JVM installations. We were confident we could pull off a sea change for Twitter in the world of the JVM. Now, we had to decompose our architecture and figure out how these different services would interact.
Programming model
In Twitter’s Ruby systems, concurrency is managed at the process level: a single network request is queued up for a process to handle. That process is completely consumed until the network request is fulfilled. Adding to the complexity, architecturally, we were taking Twitter in the direction of having one service compose the responses of other services. Given that the Ruby process is single-threaded, Twitter’s “response time” would be additive and extremely sensitive to the variances in the back-end systems’ latencies. There were a few Ruby options that gave us concurrency; however, there wasn’t one standard way to do it across all the different VM options. The JVM had constructs and primitives that supported concurrency and would let us build a real concurrent programming platform.

It became evident that we needed a single and uniform way to think about concurrency in our systems and, specifically, in the way we think about networking. As we all know, writing concurrent code (and concurrent networking code) is hard and can take many forms. In fact, we began to experience this. As we started to decompose the system into services, each team took slightly different approaches. For example, the failure semantics from clients to services didn’t interact well: we had no consistent back-pressure mechanism for servers to signal back to clients and we experienced “thundering herds” from clients aggressively retrying latent services. These failure domains informed us of the importance of having a unified, and complementary, client and server library that would bundle in notions of connection pools, failover strategies, and load balancing. To help us all get in the same mindset, we put together both Futures and Finagle.

Now, not only did we have a uniform way to do things, but we also baked into our core libraries everything that all our systems needed so we could get off the ground faster. And rather than worry too much about how each and every system operated, we could focus on the application and service interfaces.
Independent systems
The largest architectural change we made was to move from our monolithic Ruby application to one that is more services oriented. We focused first on creating Tweet, timeline, and user services –– our “core nouns”. This move afforded us cleaner abstraction boundaries and team-level ownership and independence. In our monolithic world, we either needed experts who understood the entire codebase or clear owners at the module or class level. Sadly, the codebase was getting too large to have global experts and, in practice, having clear owners at the module or class level wasn’t working. Our codebase was becoming harder to maintain, and teams constantly spent time going on “archeology digs” to understand certain functionality. Or we’d organize “whale hunting expeditions” to try to understand large scale failures that occurred. At the end of the day, we’d spend more time on this than on shipping features, which we weren’t happy with.

Our theory was, and still is, that a services oriented architecture allows us to develop the system in parallel –– we agree on networking RPC interfaces, and then go develop the system internals independently –– but, it also meant that the logic for each system was self-contained within itself. If we needed to change something about Tweets, we could make that change in one location, the Tweet service, and then that change would flow throughout our architecture. In practice, however, we find that not all teams plan for change in the same way: for example, a change in the Tweet service may require other services to do an update if the Tweet representation changed. On balance, though, this works out more times than not.

This system architecture also mirrored the way we wanted, and now do, run the Twitter engineering organization. Engineering is set up with (mostly) self-contained teams that can run independently and very quickly. This means that we bias toward teams spinning up and running their own services that can call the back end systems. This has huge implications on operations, however.
Storage
Even if we broke apart our monolithic application into services, a huge bottleneck that remained was storage. Twitter, at the time, was storing tweets in a single master MySQL database. We had taken the strategy of storing data temporally –– each row in the database was a single tweet, we stored the tweets in order in the database, and when the database filled up we spun up another one and reconfigured the software to start populating the next database. This strategy had bought us some time, but, we were still having issues ingesting massive tweet spikes because they would all be serialized into a single database master so we were experiencing read load concentration on a small number of database machines. We needed a different partitioning strategy for Tweet storage.

We took Gizzard, our framework to create sharded and fault-tolerant distributed databases, and applied it to tweets. We created T-Bird. In this case, Gizzard was fronting a series of MySQL databases –– every time a tweet comes into the system, Gizzard hashes it, and then chooses an appropriate database. Of course, this means we lose the ability to rely on MySQL for unique ID generation. Snowflake was born to solve that problem. Snowflake allows us to create an almost-guaranteed globally unique identifier. We rely on it to create new tweet IDs, at the tradeoff of no longer having “increment by 1” identifiers. Once we have an identifier, we can rely on Gizzard then to store it. Assuming our hashing algorithm works and our tweets are close to uniformly distributed, we increase our throughput by the number of destination databases. Our reads are also then distributed across the entire cluster, rather than being pinned to the “most recent” database, allowing us to increase throughput there too.
Observability and statistics
We’ve traded our fragile monolithic application for a more robust and encapsulated, but also complex, services oriented application. We had to invest in tools to make managing this beast possible. Given the speed with which we were creating new services, we needed to make it incredibly easy to gather data on how well each service was doing. By default, we wanted to make data-driven decisions, so we needed to make it trivial and frictionless to get that data.

As we were going to be spinning up more and more services in an increasingly large system, we had to make this easier for everybody. Our Runtime Systems team created two tools for engineering: Viz and Zipkin. Both of these tools are exposed and integrated with Finagle, so all services that are built using Finagle get access to them automatically.

stats.timeFuture("request_latency_ms") {
// dispatch to do work
}

The above code block is all that is needed for a service to report statistics into Viz. From there, anybody using Viz can write a query that will generate a timeseries and graph of interesting data like the 50th and 99th percentile of request_latency_ms.

Runtime configuration and testing
Finally, as we were putting this all together, we hit two seemingly unrelated snags: launches had to be coordinated across a series of different services, and we didn’t have a place to stage services that ran at “Twitter scale”. We could no longer rely on deployment as the vehicle to get new user-facing code out there, and coordination was going to be required across the application. In addition, given the relative size of Twitter, it was becoming difficult for us to run meaningful tests in a fully isolated environment. We had, relatively, no issues testing for correctness in our isolated systems –– we needed a way to test for large scale iterations. We embraced runtime configuration.

We integrated a system we call Decider across all our services. It allows us to flip a single switch and have multiple systems across our infrastructure all react to that change in near-real time. This means software and multiple systems can go into production when teams are ready, but a particular feature doesn’t need to be “active”. Decider also allows us to have the flexibility to do binary and percentage based switching such as having a feature available for x% of traffic or users. We can deploy code in the fully “off” and safe setting, and then gradually turn it up and down until we are confident it’s operating correctly and systems can handle the new load. All this alleviates our need to do any coordination at the team level, and instead we can do it at runtime.
Today
Twitter is more performant, efficient and reliable than ever before. We’ve sped up the site incredibly across the 50th (p50) through 99th (p99) percentile distributions and the number of machines involved in serving the site itself has been decreased anywhere from 5x-12x. Over the last six months, Twitter has flirted with four 9s of availability.

Twitter engineering is now set up to mimic our software stack. We have teams that are ready for long term ownership and to be experts on their part of the Twitter infrastructure. Those teams own their interfaces and their problem domains. Not every team at Twitter needs to worry about scaling Tweets, for example. Only a few teams –– those that are involved in the running of the Tweet subsystem (the Tweet service team, the storage team, the caching team, etc.) –– have to scale the writes and reads of Tweets, and the rest of Twitter engineering gets APIs to help them use it.

Two goals drive us as we did all this work: Twitter should always be available for our users, and we should spend our time making Twitter more engaging, more useful and simply better for our users. Our systems and our engineering team now enable us to launch new features faster and in parallel. We can dedicate different teams to work on improvements simultaneously and have minimal logjams for when those features collide. Services can be launched and deployed independently from each other (in the last week, for example, we had more than 50 deploys across all Twitter services), and we can defer putting everything together until we’re ready to make a new build for iOS or Android.

Keep an eye on this blog and @twittereng for more posts that will dive into details on some of the topics mentioned above.

Thanks goes to Jonathan Reichhold (@jreichhold), David Helder (@dhelder), Arya Asemanfar (@a_a), Marcel Molina (@noradio), and Matt Harris (@themattharris) for helping contribute to this blog post.

https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

The Architecture Twitter Uses to Deal with 150M Active Users, 300K QPS, a 22 MB/S Firehose, and Send Tweets in Under 5 Seconds

Toy solutions solving Twitter’s “problems” are a favorite scalability trope. Everybody has this idea that Twitter is easy. With a little architectural hand waving we have a scalable Twitter, just that simple. Well, it’s not that simple as Raffi Krikorian, VP of Engineering at Twitter, describes in his superb and very detailed presentation on Timelines at Scale. If you want to know how Twitter works – then start here.

It happened gradually so you may have missed it, but Twitter has grown up. It started as a struggling three-tierish Ruby on Rails website to become a beautifully service driven core that we actually go to now to see if other services are down. Quite a change.

Twitter now has 150M world wide active users, handles 300K QPS to generate timelines, and a firehose that churns out 22 MB/sec. 400 million tweets a day flow through the system and it can take up to 5 minutes for a tweet to flow from Lady Gaga’s fingers to her 31 million followers.

A couple of points stood out:

  • Twitter no longer wants to be a web app. Twitter wants to be a set of APIs that power mobile clients worldwide, acting as one of the largest real-time event busses on the planet.
  • Twitter is primarily a consumption mechanism, not a production mechanism. 300K QPS are spent reading timelines and only 6000 requests per second are spent on writes.
  • Outliers, those with huge follower lists, are becoming a common case. Sending a tweet from a user with a lot of followers, that is with a large fanout, can be slow. Twitter tries to do it under 5 seconds, but it doesn’t always work, especially when celebrities tweet and tweet each other, which is happening more and more. One of the consequences is replies can arrive before the original tweet is received. Twitter is changing from doing all the work on writes to doing more work on reads for high value users.
  • Your home timeline sits in a Redis cluster and has a maximum of 800 entries.
  • Twitter knows a lot about you from who you follow and what links you click on. Much can be implied by the implicit social contract when bidirectional follows don’t exist.
  • Users care about tweets, but the text of the tweet is almost irrelevant to most of Twitter’s infrastructure.
  • It takes a very sophisticated monitoring and debugging system to trace down performance problems in a complicated stack. And the ghost of legacy decisions past always haunt the system.

How does Twitter work? Read this gloss of Raffi’s excellent talk and find out…

The Challenge

  • At 150 million users with 300K QPS for timelines (home and search) naive materialization can be slow.
  • Naive materialization is a massive select statement over all of Twitter. It was tried and died.
  • Solution is a write based fanout approach. Do a lot of processing when tweets arrive to figure out where tweets should go. This makes read time access fast and easy. Don’t do any computation on reads. With all the work being performed on the write path ingest rates are slower than the read path, on the order of 4000 QPS.

Groups

  • The Platform Services group is responsible for the core scalable infrastructure of Twitter.
    • They run things called the Timeline Service, Tweet Service, User Service, Social Graph Service, all the machinery that powers the Twitter platform.
    • Internal clients use roughly the same API as external clients.
    • 1+ millions apps are registered against 3rd party APIs
    • Contract with product teams is that they don’t have to worry about scale.
    • Work on capacity planning, architecting scalable backend systems, constantly replacing infrastructure as the site grows in unexpected ways.
  • Twitter has an architecture group. Concerned with overall holistic architecture of Twitter. Maintains technical debt list (what they want to get rid of).

Push Me Pull Me

  • People are creating content on Twitter all the time. The job of Twitter is to figure out how to syndicate the content out. How to send it to your followers.
  • The real challenge is the real-time constraint. Goal is to have a message flow to a user in no more than 5 seconds.
    • Delivery means gathering content and exerting pressure on the Internet to get it back out again as fast as possible.
    • Delivery is to in-memory timeline clusters, push notifications, emails that are triggered, all the iOS notifications as well as Blackberry and Android, SMSs.
    • Twitter is the largest generator of SMSs on a per active user basis of anyone in the world.
    • Elections can be one of the biggest drivers of content coming in and fanouts of content going out.
  • Two main types of timelines: user timeline and home timeline.
    • A user timeline is all the tweets a particular user has sent.
    • A home timeline is a temporal merge of all the user timelines of the people are you are following.
    • Business rules are applied. @replies of people that you don’t follow are stripped out. Retweets from a user can be filtered out.
    • Doing this at the scale of Twitter is challenging.
  • Pull based
    • Targeted timeline. Things like twitter.com and home_timeline API. Tweets delivered to you because you asked to see them. Pull based delivery: you are requesting this data from Twitter via a REST API call.
    • Query timeline. Search API. A query against the corpus. Return all the tweets that match a particular query as fast as you can.
  • Push based
    • Twitter runs one of the largest real-time event systems pushing tweets at 22 MB/sec through the Firehose.
      • Open a socket to Twitter and they will push all public tweets to you within 150 msec.
      • At any given time there’s about 1 million sockets open to the push cluster.
      • Goes to firehose clients like search engines. All public tweets go out these sockets.
      • No, you can’t have it. (You can’t handle/afford the truth.)
    • User stream connection. Powers TweetDeck and Twitter for Mac also goes through here. When you login they look at your social graph and only send messages out from people you follow, recreating the home timeline experience. Instead of polling you get the same timeline experience over a persistent connection.
    • Query API. Issue a standing query against tweets. As tweets are created and found matching the the query they are routed out the registered sockets for the query.

High Level for Pull Based Timelines

  • Tweet comes in over a write API. It goes through load balancers and a TFE (Twitter Front End) and other stuff that won’t be addressed.
  • This is a very directed path. Completely precomputed home timeline. All the business rules get executed as tweets come in.
  • Immediately the fanout process occurs. Tweets that come in are placed into a massive Redis cluster. Each tweet is replicated 3 times on 3 different machines. At Twitter scale many machines fail a day.
  • Fanout queries the social graph service that is based on Flock. Flock maintains the follower and followings lists.
    • Flock returns the social graph for a recipient and starts iterating through all the timelines stored in the Redis cluster.
    • The Redis cluster has a couple of terabytes of RAM.
    • Pipelined 4k destinations at a time
    • Native list structure are used inside Redis.
    • Let’s say you tweet and you have 20K followers. What the fanout daemon will do is look up the location of all 20K users inside the Redis cluster. Then it will start inserting the Tweet ID of the tweet into all those lists throughout the Redis cluster. So for every write of a tweet as many as 20K inserts are occurring across the Redis cluster.
    • What is being stored is the tweet ID of the generated tweet, the user ID of the originator of the tweet, and 4 bytes of bits used to mark if it’s a retweet or a reply or something else.
    • Your home timeline sits in a Redis cluster and is 800 entries long. If you page back long enough you’ll hit the limit. RAM is the limiting resource determining how long your current tweet set can be.
    • Every active user is stored in RAM to keep latencies down.
    • Active user is someone who has logged into Twitter within 30 days, which can change depending on cache capacity or Twitter’s usage.
    • If you are not an active user then the tweet does not go into the cache.
    • Only your home timelines hit disk.
    • If you fall out of the Redis cluster then you go through a process called reconstruction.
      • Query against the social graph service. Figure out who you follow. Hit disk for every single one of them and then shove them back into Redis.
      • It’s MySQL handling disk storage via Gizzard, which abstracts away SQL transactions and provides global replication.
    • By replicating 3 times if a machine has a problem then they won’t have to recreate the timelines for all the timelines on that machine per datacenter.
    • If a tweet is actually a retweet then a pointer is stored to the original tweet.
  • When you query for your home timeline the Timeline Service is queried. The Timeline Service then only has to find one machine that has your home timeline on it.
    • Effectively running 3 different hash rings because your timeline is in 3 different places.
    • They find the first one they can get to fastest and return it as fast as they can.
    • The tradeoff is fanout takes a little longer, but the read process is fast. About 2 seconds from a cold cache to the browser. For an API call it’s about 400 msec.
  • Since the timeline only contains tweet IDs they must “hydrate” those tweets, that is find the text of the tweets. Given an array of IDs they can do a multiget and get the tweets in parallel from T-bird.
  • Gizmoduck is the user service and Tweetypie is the tweet object service. Each service has their own caches. The user cache is a memcache cluster that has the entire user base in cache. Tweetypie has about the last month and half of tweets stored in its memcache cluster. These are exposed to internal customers.
  • Some read time filtering happens at the edge. For example, filtering out Nazi content in France, so there’s read time stripping of the content before it is sent out.

High Level for Search

  • Opposite of pull. All computed on the read path which makes the write path simple.
  • As a tweet comes in, the Ingester tokenizes and figures out everything they want to index against and stuffs it into a single Early Bird machine. Early Bird is a modified version of Lucene. The index is stored in RAM.
  • In fanout a tweet may be stored in N home timelines of how many people are following you, in Early Bird a tweet is only stored in one Early Bird machine (except for replication).
  • Blender creates the search timeline. It has to scatter-gather across the datacenter. It queries every Early Bird shard and asks do you have content that matches this query? If you ask for “New York Times” all shards are queried, the results are returned, sorted, merged, and reranked. Rerank is by social proof, which means looking at the number of retweets, favorites, and replies.
  • The activity information is computed on a write basis, there’s an activity timeline. As you are favoriting and replying to tweets an activity timeline is maintained, similar to the home timeline, it is a series of IDs of pieces of activity, so there’s favorite ID, a reply ID, etc.
  • All this is fed into the Blender. On the read path it recomputes, merges, and sorts. Returning what you see as the search timeline.
  • Discovery is a customized search based on what they know about you. And they know a lot because you follow a lot of people, click on links, that information is used in the discovery search. It reranks based on the information it has gleaned about you.

Search and Pull are Inverses

  • Search and pull look remarkably similar but they have a property that is inverted from each other.
  • On the home timeline:
    • Write. when a tweet  comes in there’s an O(n) process to write to Redis clusters, where n is the number of people following you. Painful for Lady Gaga and Barack Obama where they are doing 10s of millions of inserts across the cluster. All the Redis clusters are backing disk, the Flock cluster stores the user timeline to disk, but usually timelines are found in RAM in the Redis cluster.
    • Read. Via API or the web it’s 0(1) to find the right Redis machine. Twitter is optimized to be highly available on the read path on the home timeline. Read path is in the 10s of milliseconds. Twitter is primarily a consumption mechanism, not a production mechanism. 300K requests per second for reading and 6000 RPS for writing.
  • On the search timeline:
    • Write. when a tweet comes in and hits the Ingester only one Early Bird machine is hit. Write time path is O(1). A single tweet is ingested in under 5 seconds between the queuing and processing to find the one Early Bird to write it to.
    • Read. When a read comes in it must do an 0(n) read across the cluster. Most people don’t use search so they can be efficient on how to store tweets for search. But they pay for it in time. Reading is on the order of 100 msecs. Search never hits disk. The entire Lucene index is in RAM so scatter-gather reading is efficient as they never hit disk.
  • Text of the tweet is almost irrelevant to most of the infrastructure. T-bird stores the entire corpus of tweets. Most of the text of a tweet is in RAM. If not then hit T-bird and do a select query to get them back out again. Text is almost irrelevant except perhaps on Search, Trends, or What’s Happening pipelines. The home timeline doesn’t care almost at all.

The Future

  • How to make this pipeline faster and more efficient?
  • Fanout can be slow. Try to do it under 5 seconds but doesn’t work sometimes. Very hard, especially when celebrities tweet, which is happening more and more.
  • Twitter follow graph is an asymmetric follow. Tweets are only rendered onto people that are following at a given time. Twitter knows a lot about you because you may follow Lance Armstrong but he doesn’t follow you back. Much can be implied by the implicit social contract when bidirectional follows don’t exist.
  • Problem is for large cardinality graphs. @ladygaga has 31 million followers. @katyperry has 28 million followers. @justinbieber has 28 million followers. @barackobama has 23 million followers.
  • It’s a lot of tweets to write in the datacenter when one of these people tweets. It’s especially challenging when they start talking to each other, which happens all the time.
  • These high fanout users are the biggest challenge for Twitter. Replies are being seen all the time before the original tweets for celebrities. They introduce race conditions throughout the site. If it takes minutes for a tweet from Lady Gaga to fanout then people are seeing her tweets at different points in time. Someone who followed Lady Gaga recently could see her tweets potentially 5 minutes before someone who followed her far in the past. Let’s say a person on the early receive list replies then the fanout for that reply is being processed while her fanout is still occurring so the reply is injected before the original tweet in the people receiving her tweets later. Causes much user confusion. Tweets are sorted by ID before going out because they are mostly monotonically increasing, but that doesn’t solve the problem at that scale. Queues back up all the time for high value fanouts.
  • Trying to figure out how to merge the read and write paths. Not fanning out the high value users anymore. For people like Taylor Swift don’t bother with fanout anymore, instead merge in her timeline at read time. Balances read and write paths. Saves 10s of percents of computational resources.

Decoupling

  • Tweets are forked off in many different ways, mostly to decouple teams from each other. The search, push, interest email,  and home timeline teams can work independently of each other.
  • For performance reasons the system has been being decoupled. Twitter used to be fully synchronous. That stopped 2 years ago for performance reasons. Ingesting a tweet into the tweet API takes up to 145 msecs and then all the clients are disconnected. This is for legacy reasons. The write path is powered by Ruby through the MRI, a single threaded server, processing power is being eaten up every time a Unicorn worker is allocated. They want to be able to release a client connection as fast as they can. A tweet comes in. Ruby ingests it. Sticks it into a queue and disconnects. They only run about 45-48 processes per box so they can only ingest that many tweets simultaneously per box so they want to disconnect as fast as they can.
  • The tweets are handed off to the asynchronous pathway where all the stuff we’ve been talking about kicks in.

Monitoring

  • Dashboards around the office show how the system is performing at any given time.
  • If you have 1 million followers it takes a couple of seconds to fanout all the tweets.
  • Tweet input statistics: 400m tweets per day; 5K/sec daily average; 7K/sec daily peak; >12K/sec during large events.
  • Timeline delivery statistics: 30b deliveries / day (~21m / min); 3.5 seconds @ p50 (50th percentile) to deliver to 1m; 300k deliveries /sec; @ p99 it could take up to 5 minutes
  • A system called VIZ monitors every cluster. Median request time to the Timeline Service to get data out of Scala cluster is 5msec. @ p99 it’s 100msec. And @ p99.9 is where they hit disk, so it takes a couple hundred of milliseconds.
  • Zipkin is based on Google’s Dapper system. With it they can taint a request and see every single service it  hits, with request times, so they can get a very detailed idea of performance for each request. You can then drill down and see every single request and understand all the different timings. A lot of time is spent debugging the system by looking at where time is being spent on requests. They can also present aggregate statistics by phase, to see how long fanout or delivery took, for example. It was a 2 year project to get the get for the activities user timeline down to 2 msec. A lot of time was spent fighting GC pauses, fighting memcache lookups, understanding what the topology of the datacenter looks like, and really setting up the clusters for this type of success.

    http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html

How Twitter Uses Redis to Scale – 105TB RAM, 39MM QPS, 10,000+ Instances

Yao Yue has worked on Twitter’s Cache team since 2010. She recently gave a really great talk: Scaling Redis at Twitter. It’s about Redis of course, but it’s not just about Redis.

Yao has worked at Twitter for a few years. She’s seen some things. She’s watched the growth of the cache service at Twitter explode from it being used by just one project to nearly a hundred projects using it. That’s many thousands of machines, many clusters, and many terabytes of RAM.

It’s clear from her talk that’s she’s coming from a place of real personal experience and that shines through in the practical way she explores issues. It’s a talk well worth watching.

As you might expect, Twitter has a lot of cache.

Timeline Service for one datacenter using Hybrid List:
  • ~40TB allocated heap
  • ~30MM qps
  • > 6,000 instances
Use of BTree in one datacenter:
  • ~65TB allocated heap
  • ~9MM qps
  • >4,000 instances

You’ll learn more about BTree and Hybrid List later in the post.

A couple of points stood out:

  • Redis is a brilliant idea because it takes underutilized resources on servers and turns them into valuable service.
  • Twitter specialized Redis with two new data types that fit their use cases perfectly. So they got the performance they needed, but it locked them into an older code based and made it hard to merge in new features. I have to wonder, why use Redis for this sort of thing? Just create a timeline service using your own datastructures. Does Redis really add anything to the party?
  • Summarize large chunks of log data on the node, using your local CPU power, before saturating the network.
  • If you want something that’s high performance separate the fast path, which is the data path, away from the slow path, which is the command and control path.
  • Twitter is moving towards a container environment with Mesos as the job scheduler. This is still a new approach so it’s interesting to hear about how it works. One issue is the Mesos wastage problem that stems from requirement to specify hard resource usage limits in a complicated runtime world.
  • A central cluster manager is really important to keep a cluster in a state that’s easy to understand.
  • The JVM is slow and C is fast. Their cache proxy layer is moving back to C/C++.

With that in mind, let’s learn more about how Redis is used at Twitter:

Why Redis?

  • Redis drives Timeline, Twitter’s most important service. Timeline is an index of tweets indexed by an id. Chaining tweets together in a list produces the Home Timeline. The User Timeline, which consists of tweets the user has tweeted, is just another list.

  • Why consider Redis instead of Memcache? The Network Bandwidth Problem and The Long Common Prefix Problem.

  • The Network Bandwidth Problem.

    • Memcache didn’t work as well as Redis for the timeline. The problem was dealing with fanout.

    • Twitter read and writes happen incrementally and they are fairly small, but the timelines themselves are fairly large.

    • When a tweet is generated it needs to be written to all relevant timelines. The tweet is a small piece of data that is attached to some data structure. On read it’s desirable to load a small batch of tweets. On a scroll down another batch is loaded.

    • The hometime line can be largish, what is reasonable for a viewer to read in one set. Maybe 3000 entries, for example. Which means for performance reasons accessing the databases should be avoided.

    • A read-modify-write cycle  for incremental writes, and small reads, on large objects (the timeline), is too expensive and creates a network bottleneck.

    • On a gigalink at 100K+ reads and writes per second, if the average object size is more than 1K, the network becomes the bottleneck.

  • The Long Common Prefix Problem (really two problems)

    • A flexible schema approach is used for data formats. An object has certain attributes that may or may not exist. A separate key can be created for each individual attribute. This requires sending out a separate request for each individual attribute and not all attributes may be in the cache.

    • Metrics that are observed over time have the same name with each sample having a different time stamp. If storing each metric individually the long common prefix is being stored many many times.

    • To be more space efficient in both scenarios, for metrics and a flexible schema, it is desirable to have a hierarchical key space.

  • A dedicated caching cluster under utilizes CPUs. For simple cases, in-memory key-value stores are CPU light. 1% of CPU time  on a box can handle more than 1K requests per second for small key values. Though for different data structures the result can be different.

  • Redis is a brilliant idea. It sees what the server can do, but is not doing. For simple key-value stores, there’s a lot of CPU headroom on the server side for a service like Redis.

  • Redis was first used within Twitter in 2010 for the Timeline service. It is also used in the Ads service.

  • The on disk features of Redis are not used. Partly this is because inside Twitter the Cache and Storage services are in different teams so they use whatever mechanisms they think best. Partly this may be because the Storage team thinks another service fits their goals better than Redis.

  • Twitter forked Redis 2.4 and added some features to it, so they are stuck at 2.4 (2.8.14 is the latest stable version). Changes were: two data structure features within Redis; in-house cluster management features; in-house logging and data insight.

  • Hotkeys are a problem so they are a building a tiered caching solution with client side caching that will automatically cache hotkeys.

Hybrid List

  • Added Hybrid List to Redis for more predictable memory performance.

  • Timeline is a list of Tweet IDs, so it’s a list of integers. Each ID is small.

  • Redis supports two list types: ziplist and linklist. Ziplist is space efficient. Linked list is flexible, but as a doubly linked list has the overhead of two pointers per key, which given the size of the ID is very high overhead.

  • To use memory efficiently ziplists are used exclusively.

  • A Redis ziplist threshold is set to the max size of a Timeline. Never store a bigger Timeline than can be stored in a ziplist. This means a product decision, how many tweets can be in a Timeline, are linked to a low level component (Redis). Generally not desirable.

  • Adding to and deleting from a ziplist is inefficient, especially with a very large list. Deleting from a ziplist uses memmove to move data around, to make sure the list is still contiguous. Adding to a ziplist requires a memory realloc call to make enough space for the new entry.

  • Potential high latency for write operations due to Timeline size. Timelines vary a lot in size. Most users don’t tweet very much, so their User Timeline is small. Home Timelines, especially those involving celebreties can be huge. When updating a large timeline and the cache runs out of heap, which is often the case when using a cache, a very large number of very small timelines will be evicted before there’s enough contiguous RAM to handle one big ziplist. As all this cache management takes time, a write operation can have a high latency.

  • Since writes are fanned out to a lot of timelines there’s a higher chance to be caught in a write latency trap as memory is used for expanding the timelines.

  • It’s hard to create a SLA for write operations given the high variability of write latencies.

  • Hybrid List is a linked list of ziplists. A threshold is set of how big each ziplist can be in bytes. In bytes because to memory efficient it helps to allocate and deallocate blocks of the same size. When a list goes over it is spilled into the next ziplist. A ziplist is not recycled until the list is empty, which means it is possible, through deletion, to have each ziplist have only one entry. In practice, tweets aren’t deleted all that often.

  • Before Hybrid List a workaround was to expire larger timelines more quickly, which freed up memory for other timelines, but was expensive when a user went to view their timeline.

BTree

  • Added BTree to Redis to support range queries on hierarchical keys to return a list of results.

  • In Redis the way to deal with secondary keys or fields is a hash map. To have sorted data in order to perform a range query a sorted set is used. Sorted set orders by a score which is a double, so an arbitrary secondary key or an arbitrary name can’t be used for the sorting. Since hash map uses a linear search it’s not great if there are a lot of secondary keys or fields.

  • BTree is the attempt fix the shortcomings of hash map and sorted set. It’s better to just have one data structure that does what you want. It’s easier to understand and reason about.

  • Borrowed the BSD implementation of BTree and added it to Redis to create a BTree. Supports key lookup as well as range query. Has good lookup performance. The code is relatively simple. The downside is BTree is not memory efficient. It has a lot of meta data overhead due to the pointers.

Cluster Management

  • A cluster is using more than one instance of Redis for a single purpose. If a data set is larger than a single Redis instance can handle or throughput is higher than what a single instance can handle, the key space will need to be partitioned so the data can be stored in more than one shard, across a set of instances. Routing is taking a key and figuring out which shard the data for the key is on.

  • Thinks cluster management is the number one reason Redis adoption hasn’t exploded. When a cluster is available there’s no reason not to migrate all cache use cased to Redis.

  • Tricky to get Redis cluster right. People use Redis because as a data structure server the idea is to perform frequent updates. But a lot of Redis operations are not idempotent. If there’s a network glitch a retry is required and the data can be corrupted.

  • Redis cluster favors having a centralized manager dictating the global view. With memcache a lot clusters use a client side approach based on consistent hashing. If there’s inconsistent data, so be it. To provide really good services, a cluster needs features like detecting which shard is down and then replaying operations to get back in sync. After a long enough period spent down cache state should be cleaned up. Corrupted data in Redis is hard to detect. When there’s a list and it’s missing a chunk, it’s hard to tell.

  • Twitter has multiple attempts at building a Redis cluster. Twemproxy which is not used by Twitter internally, it was built for Twemcache and Redis support was added. Two more solutions were based on proxy style routing. One was associated with the Timeline service and not meant to be general. The second was a generalization of the Timeline solution that provided cluster management, replication, and shard repairing.

  • Three options in a cluster: servers talk to each other to reach agreement of what a cluster looks like; use a proxy; or do client side cluster management where the clients form a quorum.

  • Didn’t go with a server approach because the philosophy is to keep servers simple, dumb and fast.

  • Didn’t go with the client because changes are hard to propagate. Approximately 100 projects in Twitter use a cache cluster. Changing anything in the client would have to be pushed to 100 clients it could take years for changes to propagate. Quick iteration means it’s almost impossible to put code in the client.

  • Went with a proxy style routing approach and partitioning for two reasons. A cache service is a high performance service. If you want something that’s high performance separate the fast path, which is the data path, away from the slow path, which is the command and control path. If cluster management is merged into the server it complicates the code for Redis, which is a stateful service, any time you want to fix a bug or provide an upgrade to the cluster management code, the stateful Redis service must be restarted too, which will potentially throw away a bunch of data. A rolling restart of a cluster is painful.

  • There was a concern using the proxy approach that another network hop is inserted between the client and the server. Profiling showed the extra hop is a myth. At least in their ecosystem. Latency to through the Redis server was less than .5 milliseconds. At Twitter most of the backend services are Java based and use Finagle to talk to each other. When going through the Finagle path the latency was close to 10 milliseconds. So the extra hop isn’t the problem. Inside the JVM is the problem. Outside the JVM you can do pretty much whatever you want, unless of course you go through another JVM.

  • Failure of a proxy doesn’t matter much. On the data path introducing a proxy layer isn’t so bad. The client doesn’t care which proxy they talk to. If a proxy fails after a timeout the client goes to another proxy. No sharding is happening at the proxy level, they are all stateless. To scale throughput simply add more proxies. The tradeoff is additional cost. The proxy layer is allocated resources just to do the forwarding. Cluster management, sharding, and doing the view of the cluster happens outside the proxies. The proxies don’t have to agree with each other.

  • Twitter has instances that have 100K open connections and it works fine. There’s just overhead to pay. There’s no reason to close connections. Just keep them open, it improves latency.

  • Cache clusters are used as a look-aside cache. The caches themselves are not responsible for data replenishment. The client is responsible for fetching a missing key from storage then caching it. If a node goes down the shard is moved to another node. The failed machine is flushed when it comes back so no data is left around. All this is done by the cluster leader. A central viewpoint is really important to keep a cluster in a state that’s easy to understand.

  • Did an experiment with a proxy written in C++. The C++ proxy saw a significant performance increase (no number given). The proxy tier is being moved back to C and C++.

Data Insight

  • When there’s a call saying the cache system is misbehaving most of the time the cache is fine. Usually the clients are configured wrong. Or they are abusing the cache system by requesting way too many keys. Or requesting the same key over and over again and saturating the server or the link.

  • When you tell someone they are abusing your system they want proof. Which key? Which shard is bad? What kind of traffic leads to this behaviour? Proof requires metrics and analysis that can be shown to customers.

  • An SOA architecture doesn’t give you problem isolation or make debugging easier automatically. You have to have good visibility into every component that makes up the system.

  • Decided to build Insight into caching. The cache is written in C and is fast, so it can provide data that other components can’t. Other compents can’t handle the load of providing data for every request.

  • Logging every single command is possible. The cache can log everything at 100K qps. Only meta data is logged, values are not logged (Good joke about the NSA).

  • Avoid locking and blocking. Especially don’t block on disk writes.

  • At 100 qps and a 100 bytes per log message, each box will log 10MB of data per second. That’s a lot of data to move off the box. 10% of network bandwidth would be used just in case something went bad. Economically not feasible.

  • Precompute logs on the box to reduce costs. Assumption is that it is already knows what will be computed. A process reads the logs and generates a summary and periodically sends this view of the box. The view is tiny compared to the original data.

  • View data is aggregated by Storm, stored, and there’s a visualization system sitting on top. You can get data like here are your top 20 keys; here’s your traffic by second and there’s a peak which means the traffic pattern is spiky; here’s are the number of unique keys, which helps with capacity planning. A lot can be done when every single log is captured.

  • Insight is very valuable for operations. If there are packet drops often that can be linked to either a hot key or spiky traffic behaviour.

Wish List for Redis

  • Explicit memory management.

  • Deployable (Lua) Scripts. Talked about near the start.

  • Multi-threading. Would make cluster management easier. Twitter has a lot of “tall boxes,” where a host has 100+ GB of memory and a lot of CPUs. To use the full capabilities of a server a lot of Redis instances need to be started on a physical machine. With multi-threading fewer instances would need to be started which is much easier to manage.

Lessons Learned

  • Scale demands predictability. The larger the cluster, the more customers, the more predictable and deterministic you want your service to be. When there’s one customer and there’s a problem you can dig into a problem and it’s intriguing. When you have 70 customers you can’t keep up.

  • Tail latencies matter. When you do fanouts to a lot of shards, when one is slow your entire query will be slow.

  • Deterministic configuration is operationally important. Twitter is moving towards a container environment. Mesos is used as the job scheduler. The scheduler fulfills the request for the amount of CPU, memory etc. A monitor kills any job that goes over its resource requirement. Redis causes a problem in a container environment. Redis introduces external fragmentation, meaning you use more memory to store the same amount of data. If you don’t want to be killed you have to compensate for that with oversupply. You have to think my memory fragmentation ratio won’t go over 5%, but I’ll allocate 10% more as a buffer space. Maybe even 20%. Or I think I’ll get 5000 connections per host, but just in case let me allocate memory for 10,000 connections. The result is a huge potential for waste. Super low latency services don’t play well with Mesos today, so these jobs are isolated from other jobs.

  • Knowing your resource usage at runtime is really helpful. In a large cluster bad stuff happens. You think you are safe but things happen and behaviour is unexpected. Most services today can’t degrade gracefully. For example, when a limit of 10GB of RAM is reached then requests are rejected until there’s free RAM. This only fails a small percentage of traffic that’s proportional to the resource that they require. That’s graceful. Garbage collection problems are not graceful, traffic just gets dropped on the floor, this problem affects a lot of teams in a lot of companies every day.

  • Push computation to the data. If you look at relative network speeds, CPU speeds, and disk speeds, it makes sense to do computation before going to disk and do computation before going to the network. An example is summarizing logs on a node before they are pushed to a centralized monitoring service. LUA in Redis another way to apply computation close to the data.

  • LUA is not production ready in Redis today. On demand scripting means service providers can’t guarantee their SLA. A loaded script can do anything. What service provider would want to take the risk of blowing their SLA because of someone elses code? A deployment model would be better. It would allow for code review and benchmarking, so resource usage and performance could be properly calculated.

  • Redis as the next high performance stream processing platform. It has pub-sub and scripting. Why not?

http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html