Adrian Cockcroft on Monitoring Cloud Services - William Hertling's Thoughtstream

Adrian Cockcroft

@adrianco

Battery Ventures

Please, no More Minutes, Milliseconds, Monoliths… Or Monitoring Tools!

#Monitorama May 2014

Why at a Monitoring talk when I’m known as the Cloud guy?
20 Years of free and open source tools for monitoring
“Virtual Adrian” rules
- disk rule for all disks at once: look for slow and unbalanced usage
- network rule” slow and unbalanced usage
- …
No more monitoring tools
- We have too many already
- We need more analysis tools
Rule #1: Spend more time working on code that analyzes the meaning of metrics than the code that collects, moves, stores, and displays metrics.
What’s wrong with minutes?
- Takes too long to see a problem
- Something broke at 2m20s.
- 40s of failure didn’t trigger (3m)
- 1st high metrics seen at agent on instance
- 1st high metric makes it to central server (3m30s)
- 1 data collection isn’t enough, so it takes 3 data points (5m30s)
- 5 minutes later, we take action that something is wrong.
Should be monitoring by the second
SaaS based products show what can be done
- monitoring by the second
Netflix: Streaming metrics directly from front end services to a web browser
Rule #2: Metric to display latency needs to be less than human attention span (~10s)
What’s wrong with milliseconds?
- Some JVM tools measure response times in ms
- - Network round trip within a datacenter is less than 1ms
  - SSD access latency is usually less than 1 ms
  - Cassandra response times can be less than 1ms
- Rounding errors make 1ms insufficient to accurately measure and detect problems.
Rule #3: Validate that tour measurement system has enough accuracy and precision
Monolithic Monitoring Systems
- Simple to build and install, but problematic
- What is it goes down? gets deployed?
- Should be a pool of analysis/display aggregators, a pool of distribution collection systems, all monitoring a large number of application.
- Scalability:
- - problems scaling data collection, analysis, and reporting throughput
  - limitations on the number of metrics that can be monitored
In-Band, Out-of-band, or both?
- In-band: can leave you blind during outage
- SaaS: is out of band, but can also sometimes go down.
- So the right answer is to have both: SaaS and internal. No one outage can take everything out.
Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
Issues with Continouus Deliver and Microservices
- High rate of change
- - Code pushes can cause floods of new instances and metrics
  - Short baseline for alert threshold analysis-everything looks unusual
- Ephermeral configurations
- - short lifetimes make it hard to aggregate historical views
  - Hand tweaked monitoring tools take too much work to keep running
- Microservices with complex calling patterns
- - end-to-end request flow measurements are very important
  - Request flow visualizations get very complicated
  - How many? Some companies go from zero to 450 in a year.
- “Death Star” Architecture Diagrams
- - You have to spend time thinking about visualizations
  - You need hierarchy: ways to see micro services but also groups of services
Autoscaled ephermeral instances at Netflix (the old way)
- Largest services use autoscaled red/block code pushes
- average lifetime of an instance is 36 hours
- Uses trailing load indicators
Scryer: Predictive Auto-scaling at Netflix
- More morning load Sat/Sun high traffic
- lower load on wednesday
- 24 hours predicted traffic vs. ctually
- Uses forward prediction to scale based on expected load.
Monitoring Tools for Developers
- Most monitoring tools are build to be used by operations people
- - Focus on individual systems rather than applications
  - Focus on utilization rather than throughput and response time.
  - Hard to integrate and extend
- Developer oriented monitoring tools
- - Application Performance Measurement (APM) and Analysis
  - Business transactions, response time, JVM internal metrics
  - Logging business metrics directly
  - APIs for integration, data extraction, deep linking and embedding
  - - deep linking: should be able to cut and paste link to show anyone exactly the data I’m seeing
    - embedding: to be able to put in wiki page or elsewhere.
Dynamic and Ephemeral Challenges
- Datacenter Assets
- - Arrive infrequently, disappear infrequently
  - Stick around for three years or so before they get retired
  - Have unique IP and mac addresses
- Cloud Assets
- - Arrive in bursts. A netflix code push creates over a hundred per minute
  - Stick around for a few hours before they get retired
  - Often reuse the IP and Mac address that was just vacated.
  - Use Netflix OSS Edda to record a full history of your configuration
Distributed Cloud Application Challenges
- Cloud provider data stores don’t have the usual monitoring hooks: no way to install an agent on AWS mySQL.
- Dependency on web services as well as code.
- Cloud applications span zones and regions: monitoring tools also need to span and aggregate zones and regions.
- Monit
Links
- http://techblog.netflix.com: Read about netflix tools and solutions to these problems
- Adrian’s blog: http://perfcap.blogspots.com
- Slideshare: http://slideshare.com/adriancockcroft
Q&A:
- Post-commit, pre-deploy statistical tests. What do you test?
- - Error rate. Performance. Latency.
  - Using JMeter to drive.