Raw Notes from Beyond LAMP: Scaling Websites Past MySQL at SXSWi 2010

Sorry, I got to “Beyond LAMP” about 25 minutes late - my notes don’t include anything from the first part of the meeting.


Updated: Here are some additional great notes covering the beginning part of the session, as well as some more organized notes from the end part.
  • twitter uses cassandra
    • no disk seeks when you do a write
    • no master, you can do write on any machine
    • when you post a twitter, it gets written into the queue for each of your followers. so if you have 1,000 followers, then it’s written 1,000 times.
    • cassandra designed to use commodity servers
  • monitoring
    • one of the tricks is to know when a machine needs to be replaced when all you have are hundreds of commodity servers
    • monitor, monitor, monitor
    • cpu load, file descripters, bandwidth, database connections, database performance, disk space, etc.
    • need monitoring system, ability to graph, all centrally, so you don’t have to go to individual machines.
  • Being on the front page of digg crashed server (digg sends a lot of traffic)
    • Why not memcache the front page? It gets loaded all day long, and it is always the same.
    • Rewrote the system to only read from database when it’s not memcached. Refresh memcache once per day.
    • Before change, was 60% db writes, 40% reads. After change was 99% db writes, only 1% reads. All the reads were now coming from memcache. 
  • At twitter, expect that people will come along and read what you’ve written. So they do write-thru caching. The tweet is put first into the cache, then into the database. This way they never need to read the database to get the recent tweets.
  • when you get beyond a certain point, you can’t analyze the data on a single machine. you have terabytes and terabytes of data.
    • Hadoop lets you run distributed jobs, that automatically retry when systems go down or fail.
    • Without this, as the data grows, you end up asking simpler and simpler questions. 
    • With this, you can ask more sophisticated questions. 
  • Scaling search...
    • they can process 1 to 2 searches per second
    • search is hard
  • What was the first thing to blow up for you?
    • 1st was mysql, 2nd was apache. 
      • made the switch over to engineX for serving up images. much, much faster. Using Apache was like using a sledgehammer to server up images.
    • connection issues with postgres.
    • migrating data schema when at scale is really hard... turn off indexes before copying data
    • twitter: one thing that kept our ops team awake at night...
      • we are a rails app
      • how do we maintain relationships?
      • we had it normalize with a follower table: user_id, follower_id in a single table.
      • lookups against this table were table
      • they built an intermediate solution... denormalized data structure
      • while they worked on a longer term solution... they built a custom social graph data tool.
      • it need to work across 7 orders of magnitude: from someone with 1 follower to someone with 1M followers
  • Questions
    • Deployment
      • twitter uses murder: bit-torrent for deployment. seed some servers, then those servers help feed others. brought main app deployment time from 12 minutes to 37 seconds. check out twitter opensource - they open source most of their tools
    • Hardware databases
      • twitter is using some, facebook experimenting with them. they are PCI express cards almost as fast as main memory.
    • databases versus key stores
      • it’s natural to go to denormalized - you just want the data you want in the form you want
      • over time, more of the logic goes into the application code, so database indexes are less useful
    • how do you manage when data is on particular servers
      • at twitter, using cassandra, are already consistent
        • there are bunch of new systems that have different tradeoffs. some have eventual consistency, some don’t. if your application can’t handle eventual consistency, then cassandra isn’t for you.
    • did anyone consider any of the top ten database, like say oracle
      • twitter: we strongly prefer open source systems. as we scale, we like to be able to peek under the hood, and see what is going on.
      • facebook: we like open source, we like the way open source projects work together, we like to be nimble - these proprietary systems are not so nimble.
      • it’s a combination of openness, ideology, and cost.
    • berkeleydb versus memcached
      • memcached is just a wrapper on top of berkely db
Post a Comment