Adrian Cockcroft

@adrianco

Battery Ventures

Please, no More Minutes, Milliseconds, Monoliths… Or Monitoring Tools!
#Monitorama May 2014
  • Why at a Monitoring talk when I’m known as the Cloud guy?
  • 20 Years of free and open source tools for monitoring
  • “Virtual Adrian” rules
    • disk rule for all disks at once: look for slow and unbalanced usage
    • network rule” slow and unbalanced usage
  • No more monitoring tools
    • We have too many already
    • We need more analysis tools
  • Rule #1: Spend more time working on code that analyzes the meaning of metrics than the code that collects, moves, stores, and displays metrics.
  • What’s wrong with minutes?
    • Takes too long to see a problem
    • Something broke at 2m20s.
    • 40s of failure didn’t trigger (3m)
    • 1st high metrics seen at agent on instance
    • 1st high metric makes it to central server (3m30s)
    • 1 data collection isn’t enough, so it takes 3 data points (5m30s)
    • 5 minutes later, we take action that something is wrong.
  • Should be monitoring by the second
  • SaaS based products show what can be done
    • monitoring by the second
  • Netflix: Streaming metrics directly from front end services to a web browser
  • Rule #2: Metric to display latency needs to be less than human attention span (~10s)
  • What’s wrong with milliseconds?
    • Some JVM tools measure response times in ms
      • Network round trip within a datacenter is less than 1ms
      • SSD access latency is usually less than 1 ms
      • Cassandra response times can be less than 1ms
    • Rounding errors make 1ms insufficient to accurately measure and detect problems.
  • Rule #3: Validate that tour measurement system has enough accuracy and precision
  • Monolithic Monitoring Systems
    • Simple to build and install, but problematic
    • What is it goes down? gets deployed?
    • Should be a pool of analysis/display aggregators, a pool of distribution collection systems, all monitoring a large number of application. 
    • Scalability: 
      • problems scaling data collection, analysis, and reporting throughput
      • limitations on the number of metrics that can be monitored
  • In-Band, Out-of-band, or both?
    • In-band: can leave you blind during outage
    • SaaS: is out of band, but can also sometimes go down.
    • So the right answer is to have both: SaaS and internal. No one outage can take everything out.
  • Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
  • Issues with Continouus Deliver and Microservices
    • High rate of change
      • Code pushes can cause floods of new instances and metrics
      • Short baseline for alert threshold analysis-everything looks unusual
    • Ephermeral configurations
      • short lifetimes make it hard to aggregate historical views
      • Hand tweaked monitoring tools take too much work to keep running
    • Microservices with complex calling patterns
      • end-to-end request flow measurements are very important
      • Request flow visualizations get very complicated
      • How many? Some companies go from zero to 450 in a year.
    • “Death Star” Architecture Diagrams
      • You have to spend time thinking about visualizations
      • You need hierarchy: ways to see micro services but also groups of services
  • Autoscaled ephermeral instances at Netflix (the old way)
    • Largest services use autoscaled red/block code pushes
    • average lifetime of an instance is 36 hours
    • Uses trailing load indicators
  • Scryer: Predictive Auto-scaling at Netflix
    • More morning load Sat/Sun high traffic
    • lower load on wednesday
    • 24 hours predicted traffic vs. ctually
    • Uses forward prediction to scale based on expected load. 
  • Monitoring Tools for Developers
    • Most monitoring tools are build to be used by operations people
      • Focus on individual systems rather than applications
      • Focus on utilization rather than throughput and response time.
      • Hard to integrate and extend
    • Developer oriented monitoring tools
      • Application Performance Measurement (APM) and Analysis
      • Business transactions, response time, JVM internal metrics
      • Logging business metrics directly
      • APIs for integration, data extraction, deep linking and embedding
        • deep linking: should be able to cut and paste link to show anyone exactly the data I’m seeing
        • embedding: to be able to put in wiki page or elsewhere.
  • Dynamic and Ephemeral Challenges
    • Datacenter Assets
      • Arrive infrequently, disappear infrequently
      • Stick around for three years or so before they get retired
      • Have unique IP and mac addresses
    • Cloud Assets
      • Arrive in bursts. A netflix code push creates over a hundred per minute
      • Stick around for a few hours before they get retired
      • Often reuse the IP and Mac address that was just vacated.
      • Use Netflix OSS Edda to record a full history of your configuration
  • Distributed Cloud Application Challenges
    • Cloud provider data stores don’t have the usual monitoring hooks: no way to install an agent on AWS mySQL.
    • Dependency on web services as well as code.
    • Cloud applications span zones and regions: monitoring tools also need to span and aggregate zones and regions. 
    • Monit
  • Links
    • http://techblog.netflix.com: Read about netflix tools and solutions to these problems
    • Adrian’s blog: http://perfcap.blogspots.com
    • Slideshare: http://slideshare.com/adriancockcroft
  • Q&A:
    • Post-commit, pre-deploy statistical tests. What do you test?
      • Error rate. Performance. Latency.
      • Using JMeter to drive.