Time Travel in a Simulated Universe

Posted on December 11, 2016

How does it affect time travel if you start with the assumption that reality as we know it is a computer simulation?

In this case, time travel has nothing to do with physics, and everything to do with software simulations.

Time travel backward would require that the program saves all previous states (or at least checkpoints at fine enough granularity to make it useful enough for time traveling) and the ability to insert logic and data from the present into states of the program in the past. Seems feasible.

Time travel forward would consist of removing the time traveling person from the program, running the program forward until reaching the future destination, then reinserting the person.

Forward time travel is relatively cheap (because you’d be running the program forward anyhow), but backward time travel is expensive because you keep having to roll the universe back, slowing the forward progress of time. In fact, one person could do a denial of service attack on reality simply by continually traveling to the distant past. Then, every time you come back, you would have to immediately return to the past.

Notes from Defrag, Day One

Posted on November 19, 2014

hertling

These are my session notes from Defrag 2014 (#Defragcon).

I normally break my notes out and add some context to them, but I’m short of time, so I’m simply posting raw notes below.

Followup

Slack — superior chat, with channels and per channel notifications. Lots of integrations. Seems better than both Campfire and Hip Chat.

Chris Anderson

3D Robotics

Use drones for farmers to spot irrigation, pest problems, soil differences.
Can’t see the patterns from the ground
Visual and near-infrared.
Push button operation: One button to “do your thing”
What it enables:
- better farming.
- produce more with less resources.
- don’t overwater.
- don’t underwater and lose crops.
- don’t apply pesticides everywhere, just where the problem is.
- tailor to the soil.
it’s big data for farmers.
- it turns an open-loop system into a closed-loop system

George Dyson

author Turing’s Cathedral

From Analog to Digital

Alan Turing: 1912-1954
Turing “being digital was more important than being electronic”
It is possible to invent a single machine which can compute any computable program.
Movie: The Imitation Game — true movie about Alan Turing
Insisted on hardware random number generated because software algorithms to generate random numbers cannot be trusted, nor can the authorities (whom he worked for)
John von Neumann: continued Alan Turing’s work, always gave him credit.
Where Turing was hated by his government, von Neumann got everything from his government: funding of millions of dollars.
Baumberger: made his riches in retail, decide to found an institution of learning
“The usefulness of useless information” — just hire great minds and let them work on whatever they want, and good things will come.
Thanks to German-Nazi situation in the 1930s, it was “cheap” to get jewish intellectuals.
The second professor hired: Albert Einstein.
In Britain, they took the brightest people to work on encryption. In the US, we took them to Los Alamos to build the atomic bomb.
….lots of interesting history…
By the end of Turing’s life, he had moved past determinism. He believes it was important for machines to be able to make mistakes in order to have intuition and ingenuity.
What’s next?
- Three-dimensional computation.
- - Turing gave us 1-dimension.
  - von Neumann gave us 2-d.
- Template-based addressing
- - In biology, we use template-based addressing. “I want a copy of this protein that matches this pattern.” No need to specify a particular address of a particular protein.
- Pulse-frequency computing
- Analog computing

Amber Case

Esri

Designing Calm Technology

50 billion devices will be online by 2020 — Cisco
Smart watch: how many of the notifications you get are really useful, and how many are bothering you?
Imagine the dystopian kitchen of the future: all the appliances clamoring for your attention, all needing firmware updates before you can cook, and having connectivity problems.
Calm Technology
- Mark Weiser and John Seely Brown describe calm technology as “that which informs but doesn’t demand our focus or attention.” [1]
“Technology shouldn’t require all of our attention, just some of it, and only when necessary.”
The coming age of calm technology…
If the cloud goes down, I should be able to still turn down my thermostat.
Calm technology makes things to the peripherally of our attention. Placing things in the peripherally allow us to pay less attention to many more things.
A tea kettle: calm technology. You set it, you forget about it, it whistles when it’s ready. No unnecessary alerts.
A little tech goes a long way…
We’re addicted to adding features: consumers want it, we like to build it. But that adds cost to manufacturing and to service and support.
Toilet occupied sign: doesn’t need to be translated, easily understand, even if color-blind.
Light-status systems: Hue Lightbulb connected to a weather report.
Light-LEDs attached to Beeminder report: green, yellow, red. Do you need to pay attention? Instead of checking app 10 times a day, nervous about missing goals.
We live in a somewhat dystopic world: not perfect, but we deal with it.
Two principles
- a technology should inform and encalm
- make use of periphery attention
Design for people first
- machines shouldn’t act like humans
- humans shouldn’t act like machines
- Amplify the best part of each
Technology can communicate, but doesn’t need to speak.
Roomba: happy tone when done, unhappy tone when stuck.
Create ambient awareness through different senses
- haptics vs auditory alerts
- light status vs. full display
Calm Technology and Privacy
- privacy is the ability not to be surprised. “i didn’t expect that. now i don’t trust it.”
Feature phones
- limited features, text and voice calls, few apps, became widespread over time
Smartphone cameras
- not well known, not everybody had it.
- social norm created that it was okay to have a phone in your pocket. we’re not terrified that people are taking pictures: because we know what it looks like when something is taking a picture.
Google Glass Launch
- Reduced play, confusion, speculation, fear.
- Had the features come out slowly, maybe less fear.
- but the feature came out all at once.
- are you recording me? are you recording everything? what are you tracking? what are you seeing? what are you doing?
- poorly understood.
Great design allows people to accomplish their goals in the least amount of movies
Calm technology allows people to accomplish the same goals with the least amount of mental cost
A person’s primary task should not be computing, but being human.

Helen Greiner

CyPhy Works

Robots Take Flight

commercial grade aerial robots that provide actionable insights for data driven decision making
PARC tethered aerial robot
- persistent real-time imagery and other sensing
- on-going real-time and cloud based analytic service
- 500-feet with microfilament power.
- stays up indefinitely.
2014: Entertaining/Recording
2015/16: Protecting/Inspecting: Military, public safety, wildlife, cell towers, agriculture
2017/18: Evaluating/Managing: Situation awareness, operations management, asset tracking, modeling/mapping.
2019: Packaging/Delivery
“If you can order something and get it delivered within 30 minutes, that’s the last barrier to ordering online.” because i only buy something in a store if I need it right away.
Concept delivery drone: like an osprey, vertical takeoff but horizontal flight.
Tethered drone can handle 20mph winds with 30mph gusts.
- built to military competition spec.
how do you handle tangling, especially in interior conditions?
- externally: spooler is monitoring tension.
- internally: spooler is on the helicopter, so it avoids ever putting tension on the line. disposable filament.

Lorinda Brandon

@lindybrandon

Monkey selfies and other conundrums

Who owns your data?

Your data footprint
- explicit data
- implicit data
trendy to think about environmental footprint.
explicit: what you intentionally put online: a blog post, photo, or social media update.
implicit data
- derived information
- not provided intentionally
- may not be available or visible to the person who provided the data
The Biggest Lie on the internet: I’ve read the terms of use.
But even if you read the terms of use, that’s not where implicit data comes in. That’s usually in the privacy policy.
Before the connected age:
- helicopters flew over roads to figure out the traffic conditions.
Now, no helicopters.
- your phone is contributing that data.
- anonymously.
- and it benefits you with better routing.
Samsung Privacy Policy
- collective brain syndrome: i watched two footballs out of many playing over the weekend. On the following morning, my samsung phone showed me the final scores of just the two games I watched.
- Very cool, but sorta creepy.
- I read the policy in detail: it took a couple of hours.
Things Samsungs collect:
- device info
- time and duration of your use of the service
- search query terms you enter
- location information
- voice information: such as recording of your voice
- other information: the apps you use, the websites you visit, and how you interact with content offered through a service.
Who they share it with.
- They don’t share it for 3rd party marketing. but they do share for the purpose of their businesses
- Affiliates
- business partners
- Service providers
- other parties in connection with corporate transactions
- other parties when required by law
- other parties with your consent (this is the only one you opt-in to)
Smart Meter – Data and privacy concerns
- power company claims they own it, and they can share/sell it to whom they like.
- What they collect:
- - individual appliances used in the household
  - power usage data is easily available
  - data transmitted inside and outside the grid
- In Ohio, law enforcement using it to locate grow houses.
Your device != your data
Monkey selfies
- Case where photographer was setting up for photo shoot.
- Monkey stole camera, took selfies.
- Photographer got camera back.
- Who owns the copyright on the photos?
- Not the photographer, who didn’t take them.
- Not the monkey, because the monkey didn’t have intent.
- So it’s in the public domain.
Options
- DoNotTrack.us – sends signal that indicates opt-out preference.
- Disconnect.me – movement to get vendors to identify what data and data sharing is happening.
- Opennotice.org – minimal viable consent receipt, which creates a repository of your consent.
- ClearButton.net – MIT project to express desire to know who has your data, work with manufacturers.
Innovate Responsibly
- If you are a creator, be sensitive to people’s needs.
- Even if you are doing altruist stuff, you’ve still got to be transparent and responsible.

How to Distribute and Make Money from your API

Orlando Kalossakas, Mash-ape

@orliesaurus

API management
API marketplace: find APIs to use
Devices connect to the internet
- 2013: 8.7B
- 2015: 15B
- 2020: 50B
App stores
- 1.4M: Google play
- 1.2M: Apple
- 300k: Microsoft
- 140K: Amazon
Jeff Bezos:
- “turn everything into APIs, or I fire you.”
- A couple of years later
Mashape.com: hosts over 10,000 private and public API
Google / Twitter / Facebook: Billions of API calls per day
Mashape pricing
- 92% free
- 5.6% freemium
- 1.4% paid
Consumers of mash ape APIs more than doubling every year.
API forms:
- As a product: the customer uses the API directly add capabilities
- As an extension of a product: the API is used in conjunction with the product to add value.
- As promotion: The API is used as a mechanism to promote the product.
paid or freemium flavors
- pay as you go, with or without tiers
- monthly recurring
- unit price
- rev share
- transaction fee
depending on business model, you might end up paying developers to use your API
- if you are expedia or amazon, you’re paying the developers to integrate with you.
Things to consider…
- is your audience right?
- Do your competitors have APIs?
- Could they copy your model easily?
- How does the API fit into your roadmap?
Preparing…
- discovery
- security
- monitoring / qa
- testing
- support
- documentation*
- monetization*
- *most important
How will you publish your API?
- onboarding and documentation are the face of your API?
- Mashape: if you have interactive documentation, consumers are more likely to use it.
Achieving great developer experience
- Track endpoint analytics
- track documentation/s web analytics
- get involved in physical hackathons
- keep api documentation up to date
- don’t break things.

Blend Web IDEs, Open Source and PaaS to Create and Deploy APIs

Jerome Louvel , Restlet

New API landscape:
- web of data (semantic)
- cloud computing & hybrid architectures
- cross-channel user experiences
- mobile and contextual access to services
- Multiplicity of HCI modes (human computer interaction)
- always-on and instantaneous service
Impacts on API Dev
- New types of APIs
- - Internal and external APIs
  - composite and micro APIs
  - experience and open APIs
- Number of APIs increases
- - channels growth
  - history of versions
  - micro services pattern
  - quality of service
- industrialization needed
- - new development workflows
API-driven approach benefits
- a pivot API descriptor
- server skeletons & mock generations
- up-to-date client SDKs and docs
- rapid API crafting and implementation
Code-first or API-first approaches
- can be combined using code introspect ors to extract, and code generators to resync.
Crafting an API
- swagger, apiary, raml, miredot, restlet studio
- new generation of tools:
- - IDE-type
  - web-based
- example: swagger editor is GUI app
- RESTlet visual studio

Connecting All Things (drone, sphero, raspberry pi, phillips hue) to Build a Rube Goldberg Machine

Kirsten Hunter

https://twitter.com/synedra

API evanglist at Akamai
cylon-sphero
node.js
cylon library makes it easy to control robotics

Audit all the things

Posted on May 6, 2014

hertling

Auditing all the things: The future of smarter monitoring and detection

Jen Andre

Founder @threatstack

@fun_cuddles

Started with question on twitter:
- Can you produce a list of all process running on your network?
- But then expanded… wanted to know everything
Why? Is there a reason to be this paranoid?
- prevention fails.
should you care?
- if you’re a startup about pets and you get hacked, you just change all passwords
- but if you’re a pharmaceutical company, then you really do care.
“We found no evidence that any customer data was accessed, changed or lost”
- Did you look for evidence?
- Do you really know what happened?
- If you log everything (the right things), then you don’t have to do forensic evidence.
“We’re in the cloud!”
Continuous security monitoring
- auditing + analytics + automation
Things to monitor:
- Systems: authentications, process activity, network activity, kernel modules, file system
- Apps: authentications, db requests, http requests
- services: AWS api calls, SaaS api calls
In order to do:
- Intrusion detection
- “active defense”
- rapid incident response
“Use the host, Luke”
apt-get install audit
- pros:
- - super powerful
  - build into your kernel
  - relatively low overhead
- you can audit logins, system calls.
auditd
- the workings:
- - userland audit daemon and tools <- link="" net="" socket=""> kernel thread queue <- audit="" doing="" kernel="" li="" messages="" things="" threads="">
  - /var/log/audit
- not so nice:
- - obtuse logging
  - enable rate limiting or it could ‘crash’ your box
  - - auditctl -b 1000 -r 1500 # 100 buffers, 15000 eps max)
alternative: connect directly to net link socket and write your own audit listening
- wrote a JSON format exporter
- luajit! for filtering, transformation & alerting
authentications
- who is logging in and from where?
- Can use wtmp
- - can turn into json
- auditd also returns login information
- pam_loginid will add a session id to all executed commands so you can link real user to sudo’d commands
Detecting attacks
- most often a long time goes by before people are hacked, sometimes years.
- often they get a phone call from the government: hey, you’ve got servers sending data to china.
- the hardest attack to detect is when the attacker is using valid credentials to access it.
- things to think about:
- - is that user running commands he should;’t be?
  - - ex: why is anyone except chef user running gcc on a production system?
  - why is a server that only accepts inbound connections suddenly making outbound ones?
  - - or why connecting to machines other than expected ones?
  - are accounts logging in from unexpected locations? (or at unexpected times)
  - are files being copied to /lib /bin, etc.
Now go and audit!

Dan Slimmon on Monitoring

Posted on May 5, 2014

hertling

Car Alarms & Smoke Alarms & Monitoring

Dan Slimmon

@danslimmon

Senior Platform Engineer at Exosite

I work in Ops, so I wear a lot of hats
One of those is data scientist
- Learn data analysis and visualization
- You’ll be right more often and people will believe your right even more often than you are
A word problem
- Plagiarism: 90% chance of positive
- No Plagiarism: 20% chance of positive
- Kids plagiarize 30% of the time
- Given a random paper, what’s the probability that you’ll get a negative result?
- - 0.3*0.9 + 0.7*0.2 = 0.27+0.14=0.41
  - 59% likely to get negative result
- If you get a positive result, how likely is it to really be plagiarized?
- - 65.8% likely
  - this is terrible.
  - Teachers will stop trusting the test.
Sensitivity & Specificity
- Sensitivity: % of actual positives that are identified as such
- Specificity: % of negative results that are indicated as such
- Prevalence: percentage of people with problem
- https://williamhertling.com/wp-content/uploads/2014/05/LkxcxLt.png
- Positive Predictive Value: the probably that something is actually wrong.

Car Alarms
- Go off all the time for reasons that aren’t someone stealing your car.
- Most people ignore them.
Smoke Alarms
- You get your ass outside and wait for the alarm to go off and the fire trucks.
We need monitoring tools that are both highly sensitive and highly specific.
Undetected outages are embarrassing, so we tend to focus on sensitivity.
- That’s good.
- But be careful with thresholds.
- Too high, and you miss real problems. Too low, and too many false alarms.
- There’s only one line with thresholds, so only one knob to adjust.
Get more degrees of freedom.
- Hysteresis is a great way to add degrees of freedom.
- - State machines
  - Time-series analysis
As your uptime increases, you must get more specific.
- Going back to the chart…our positive predictive value goes down when there’s less actual problems.
A lot of nagios configs combine detecting problem with identifying what the problem is.
- You need to separate those concerns.
- Baron Schwartz says: Your alerting should tell you whether work is getting done.
- Knowing that nginx is down doesn’t affect if your site is up. Check to see if you site is up (detecting problem), which is separate from source of problem (nginx isn’t running)
- Alert on problems, bot on diagnosis.

Links:

Bischeck: https://www.bischeck.org/

Katherine Daniels on Monitoring

Posted on May 5, 2014

hertling

Katherine Daniels

@beerops

kd@gamechanger.io

The site is going down.
But everything seemed to be fine.
- checked web servers, databases, mongo, more.
What was wrong? The monitoring tool wasn’t telling us.
One idea: monitor more. monitor everything.
- But if you’re looking for a needle in a haystack, the solution is not to add more hay.
- Monitoring everything just adds more stuff to weed through. Including thousands of things that might be not good (e.g. disk quote too high), but aren’t actually whats causing the problem.
Monitor some of the things. The right things. But which things? If we knew, we’d already be monitoring.
Talk to Amazon…
- “try switching the load balancer”
- “try switching the web server”
We had written a service called healthd that was supposed to monitor api1, and api2.
But we didn’t have logging for healthd, so we didn’t know what was wrong.
We needing more detail.
So adding logging, so we knew which API had a problem.
We also had some people who tried the monitor everything problem.
They uncovered a user who seemed to be scripting the site.
They added metrics for where the time was being spent with the API handlers
The site would go down for a minute each time things would blip.
We set the timeouts to be lower.
We found some database queries to be optimized.
We found some old APIs that we didn’t need and we removed them.
The end result was that things got better. The servers were mostly happy.
But the real question is: How did we get to a point where our monitoring didn’t tell us what we needed? We thought we were believers in monitoring. And yet we got stuck.
Black Boxes (of mysterious mysteries)
- Using services in the cloud gives you less visibility
Why did we have two different API services…cohabiting…and not being well monitored?
- No one had the goal of creating a bad solution.
- But we’re stuck. So how do we fix it?
- We stuck nginx in front and let it route between them.
What things should you be thinking about?
- Services:
- - Are the services that should be running actually running?
  - Use sensu or nagio
- Responsiveness:
- - Is the service responding?
- System metrics:
- - CPU utilization, disk space, etc.
  - What’s worth an alert depends: on a web server it shouldn’t use all the memory, on a mongo db it should, and if it isn’t, that’s a problem.
- Application metrics?
- - Are we monitoring performance, errors?
  - Do we have the thresholds set right?
  - We don’t want to to look at a sea of red: “Oh, just ignore that. It’s supposed to be red.”
Work through what happens?
- Had 20 servers running 50 queues each.
- Each one has its own sensu monitor. HipChat shows an alert for each one… a thousand outages.
You must load test your monitoring system: Will it behave correctly under outages and other problems?
“Why didn’t you tell me my service was down?” “Service, what service? You didn’t tell us you were running a service.”

Sensu: https://sensuapp.org/

Adrian Cockcroft on Monitoring Cloud Services

Posted on May 5, 2014

hertling

Adrian Cockcroft

@adrianco

Battery Ventures

Please, no More Minutes, Milliseconds, Monoliths… Or Monitoring Tools!

#Monitorama May 2014

Why at a Monitoring talk when I’m known as the Cloud guy?
20 Years of free and open source tools for monitoring
“Virtual Adrian” rules
- disk rule for all disks at once: look for slow and unbalanced usage
- network rule” slow and unbalanced usage
- …
No more monitoring tools
- We have too many already
- We need more analysis tools
Rule #1: Spend more time working on code that analyzes the meaning of metrics than the code that collects, moves, stores, and displays metrics.
What’s wrong with minutes?
- Takes too long to see a problem
- Something broke at 2m20s.
- 40s of failure didn’t trigger (3m)
- 1st high metrics seen at agent on instance
- 1st high metric makes it to central server (3m30s)
- 1 data collection isn’t enough, so it takes 3 data points (5m30s)
- 5 minutes later, we take action that something is wrong.
Should be monitoring by the second
SaaS based products show what can be done
- monitoring by the second
Netflix: Streaming metrics directly from front end services to a web browser
Rule #2: Metric to display latency needs to be less than human attention span (~10s)
What’s wrong with milliseconds?
- Some JVM tools measure response times in ms
- - Network round trip within a datacenter is less than 1ms
  - SSD access latency is usually less than 1 ms
  - Cassandra response times can be less than 1ms
- Rounding errors make 1ms insufficient to accurately measure and detect problems.
Rule #3: Validate that tour measurement system has enough accuracy and precision
Monolithic Monitoring Systems
- Simple to build and install, but problematic
- What is it goes down? gets deployed?
- Should be a pool of analysis/display aggregators, a pool of distribution collection systems, all monitoring a large number of application.
- Scalability:
- - problems scaling data collection, analysis, and reporting throughput
  - limitations on the number of metrics that can be monitored
In-Band, Out-of-band, or both?
- In-band: can leave you blind during outage
- SaaS: is out of band, but can also sometimes go down.
- So the right answer is to have both: SaaS and internal. No one outage can take everything out.
Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
Issues with Continouus Deliver and Microservices
- High rate of change
- - Code pushes can cause floods of new instances and metrics
  - Short baseline for alert threshold analysis-everything looks unusual
- Ephermeral configurations
- - short lifetimes make it hard to aggregate historical views
  - Hand tweaked monitoring tools take too much work to keep running
- Microservices with complex calling patterns
- - end-to-end request flow measurements are very important
  - Request flow visualizations get very complicated
  - How many? Some companies go from zero to 450 in a year.
- “Death Star” Architecture Diagrams
- - You have to spend time thinking about visualizations
  - You need hierarchy: ways to see micro services but also groups of services
Autoscaled ephermeral instances at Netflix (the old way)
- Largest services use autoscaled red/block code pushes
- average lifetime of an instance is 36 hours
- Uses trailing load indicators
Scryer: Predictive Auto-scaling at Netflix
- More morning load Sat/Sun high traffic
- lower load on wednesday
- 24 hours predicted traffic vs. ctually
- Uses forward prediction to scale based on expected load.
Monitoring Tools for Developers
- Most monitoring tools are build to be used by operations people
- - Focus on individual systems rather than applications
  - Focus on utilization rather than throughput and response time.
  - Hard to integrate and extend
- Developer oriented monitoring tools
- - Application Performance Measurement (APM) and Analysis
  - Business transactions, response time, JVM internal metrics
  - Logging business metrics directly
  - APIs for integration, data extraction, deep linking and embedding
  - - deep linking: should be able to cut and paste link to show anyone exactly the data I’m seeing
    - embedding: to be able to put in wiki page or elsewhere.
Dynamic and Ephemeral Challenges
- Datacenter Assets
- - Arrive infrequently, disappear infrequently
  - Stick around for three years or so before they get retired
  - Have unique IP and mac addresses
- Cloud Assets
- - Arrive in bursts. A netflix code push creates over a hundred per minute
  - Stick around for a few hours before they get retired
  - Often reuse the IP and Mac address that was just vacated.
  - Use Netflix OSS Edda to record a full history of your configuration
Distributed Cloud Application Challenges
- Cloud provider data stores don’t have the usual monitoring hooks: no way to install an agent on AWS mySQL.
- Dependency on web services as well as code.
- Cloud applications span zones and regions: monitoring tools also need to span and aggregate zones and regions.
- Monit
Links
- https://techblog.netflix.com: Read about netflix tools and solutions to these problems
- Adrian’s blog: https://perfcap.blogspots.com
- Slideshare: https://slideshare.com/adriancockcroft
Q&A:
- Post-commit, pre-deploy statistical tests. What do you test?
- - Error rate. Performance. Latency.
  - Using JMeter to drive.

Printable Obama for America AWS Architecture

Posted on July 17, 2013

hertling

The Obama for America AWS infrastructure diagram is a bundle of awesome. This mature cloud architecture diagram clearly shows the incredible web application architecture the Obama Campaign was able to create in a very short time.

A tiny portion of the AWS Obama for America 2012 infrastructure diagram

It’s been available online at awsofa.info for a while, which is a nice, browsable Google Maps powered way to see the intricacies of the diagram.

However, if you’ve tried to print it, it’s next to impossible. Using the browser File->Print just captures what is onscreen, so you’re forced to choose between viewing a tiny portion of the diagram in high resolution or the entire diagram in a low, unusable resolution.

I really wanted to print the whole diagram on a plotter, and after playing with the HTML for an hour, I got a good 24″x200″ print hanging on the wall at my office.

Now hanging on the wall at my work.

I’m sharing the high-resolution PDF that was an intermediate step in my workflow. It’s 27″x240″, but if you select scale to paper size on a plotter, this should print well on any large size. Note that it’s considerably wider than it is tall, so it’s helpful to set a custom size.

Download the PDF here. (60MB, PDF)

I hope this helps anyone trying to print it.

Update 2013-07-25: I just learned via JP that:

Miles Ward is the author of the diagram. Thank you Miles for creating such an awesome, useful diagram.
There’s a hidden easter egg in the diagram. Look for a Take 5 candy bar.
You can download a PNG of the diagram courtesy of JP.

Scaling A Web App 1,000x in 3 Days

Posted on February 10, 2012

hertling

When I’m not writing science fiction novels, I work on web apps for a largish company*. This week was pretty darn exciting: we learned on Monday afternoon that we needed to scale up to a peak volume of 10,000 simultaneous visitors by Thursday morning at 5:30am.

This post will be about what we did, what we learned, and what worked.

A little background: our stack is nginx, Ruby on Rails, and mysql. The web app is a custom travel guide. The user comes into our app and either imports a TripIt itinerary, or selects a destination and enters their travel details. They can expressed preferences in dozens of different categories (think about rating cuisine types of restaurants, categories of attractions, etc.). They can preview their travel guide live, in the browser, and order a high quality paperback version.

Our site was in beta, and we’ve been getting a few hundred visitors a day. At 2pm on Monday we learned that we would be featured on a very popular TV show on Thursday morning (9:30am EST), and based on their previous experiences, we could expect anywhere from 20,000 to 150,000 unique visitors that day, with a peak load of 10,000 simultaneous visitors. This would be a 100x to 1,000x increase in the number of unique visitors that we would normally serve. We estimated that our peak volume would be 10,000 visitors at once, which would create a load of ~40,000 page requests per minute (rpm).

What we had going for us:

We were already running on Amazon Web Services (AWS), using EC2 behind a load balancer.
We already had auto-scale rules in place to detect when when CPU load exceeded a certain percentage on our app servers, and to double the number of EC2 instances we had.
We had reliable process for deploying (although this became a constraint later).
A month before we had spent a week optimizing performance of our app, and we had already cut page load time by 30%, largely by reducing database calls.

On Monday we got JMeter installed on a few computers. With JMeter you can write a script that mimics a typical customer visit. In our case, we wrote a script that loaded the home page, selected a destination, went through the preferences, and went to check out. Although we didn’t post to any pages (which we would have liked to, if we had more time), a number of database artifacts did get created as the simulated user went through the user interface. (This is important later.)

After some experimentation, we decided that we couldn’t run more than 125 threads on a single JMeter instance. So if we wanted to simulate truly high loads, we needed to run JMeter on several computers. We also needed to make sure not to saturate any part of our local network. At a minimum, we made sure we didn’t have all the machines going through the same router. We also VPNed into an entirely different site, and ran another client from a geographically distant location.

We quickly learned that:

we needed to change our initial auto-scale group from two servers to four.
we needed to reduce our CPU load threshold from 80% to 50%.
we needed to reduce the time we sustained that load from 5 minutes to 1 minute.

After making these changes, and letting the app servers scale to 16 servers, we saw that we were now DB bound. The DB server would quickly go to 100% load. We increased to the maximum AWS DB size (extra extra large, with extra CPUs and memory) but we still swamped the DB server. We watched the JMeter tests run, sending the average response times into the 30 second territory. We knew we were sunk if we didn’t solve this problem. It was mid-day Tuesday.

I said “we knew we were sunk”, but to be honest, this actually took some convincing. Some members of the team felt like any changes so close to the big day would be too risky to make. We could introduce a bug. But a 100% correct application experience that will definitely get buried under load will result in delivering no pages to anyone. That’s 100% risk. After a few hours, we had everyone convinced that we had to solve the scalability issues.

We briefly explored several different large scale changes we could have made, but ultimately judged to be too risky:

duplicating our entire stack, and using DynDNS or Amazon’s Route 53 to route between them. Because we didn’t have users accounts or other shared data, we could have done this in theory. If we had, we would later have needed to synchronize a bunch of databases through some scripts to collect all the customer data together. If we had a week or an Ops expert, we probably would have done this.
offloading part of the DB load by creating a read-only replica, and shunting some of the database calls to certain mostly-static data (e.g. list of destinations and attractions at those destinations) to the read-only database. We didn’t do this mostly because we didn’t know how we’d point some database queries at one db server, and others at a different db server.
using memcache or other mechanisms to hit only memory instead of database: this too would have been extensive code changes. Given the short timeframe, it felt too risky.

The action we did decide to take:

We had a few easily identified DB queries that ran with every page request, that loaded the same data each time. We rewrote the code to cache these in memory in global variables that stayed resident between calls. So they would get loaded once when the servers rebooted, and never again.

Around this time, which was Wednesday morning, we started to get smarter about how we used JMeter. Instead of slamming the servers with many hundreds of simultaneous threads from many clients and watching request times spiral around of control, we instead spun up threads slowly, let the servers stabilize, counted our total throughput in requests-per-second, and checked CPU load at that level, then increased the threads, and kept rechecking. We grabbed a flipchart, and started recording data where everyone could see it. Our goal was to determine the maximum RPM we could sustain without increasing response times.

As the day went on, we incrementally added code changes to cache some database requests, as well as other late-breaking requests we needed before we were live on Thursday. This is where our deployment process started to break down.

We have the following environments: local machines, dev, demo, staging, production. Normally, we have a test script that runs (takes about 40 minutes) before pushing to dev. The deployment script takes 15-30 minutes (80% of which is spent waiting for git clone. Why???) And we have to repeat that process to move closer and closer to production.

The only environment that is a mirror of production is staging. If we had a code optimization we needed to performance test, it would take hours to deploy the change to staging. Conversely, we can (and did) log into the staging servers and apply the changes manually (ssh into server, vi to edit files, reboot mongrel), but we would need to repeat across four servers, bring down the auto-scale servers, rebuild the auto-scale AMI, then bring up the auto-scale servers. Even the short-cut process took 20 to 30 minutes.

We did this several times, and carefully keeping track of our data, we noticed that we were going in the wrong direction. We were able to sustain 15 requests per second (900 rpm), and the number kept get lower as the day went on. Now the astute database administrators who are reading this will chuckle, knowing exactly what is causing this. But we don’t have a dedicated DBA on the team, and my DB skills are somewhat rusty.

We already had New Relic installed. I have always loved New Relic since I used it several years ago when I developed a Facebook app. It’s a performance monitoring solution for Rails and other web app environments. I had really only used it as a development tool, as it lets you see where your app is spending it’s time.

It’s also intended for production monitoring, and what I didn’t know up until that point is that if we paid for the professional plan, it would identify slow-running page requests and database queries, and provide quite a lot of detail about exactly what had executed.

So, desperate to get more data, we upgraded to the New Relic Pro plan. Instantly (ok, within 60 seconds), we had a wealth of data about exactly what DB requests were slow and where our app was spending it’s time.

One DB query was taking 95% of all of our DB time. As soon as I saw that, and I remembered that our performance was getting worse during the day, it all snapped into place, and I realized that we had to have an unindexed field we were querying on. Because the JMeter tests were creating entries in the database tables, and because they had been running for hours, we had hundreds of thousands of rows in the database, so we were doing entire table scans.

I checked the DB schema.rb file, as saw that someone had incorrectly created a multi-column index. (In general, I would say to avoid multi-column indexes unless you know specifically that you need one.) In a multiple column index, if you specify columns A, B, and C in a query, you can use it to query against A, A and B, or A and B and C. But the columns are order dependent, and you cannot leave out column A.

That’s exactly what had happened in our case. So I quickly created a migration to create the index we needed, we ran the migration on our server, let the server have five minutes to build the index, and then we reran the tests. We scaled to 120 requests per second (7,200 rpm), and the DB load was only 16%.

This was Wednesday afternoon, about 3pm. At this point we figured the DB server could handle the load from about 36,000 page requests per minute. We were bumping up against the app server CPU load again.

At some point during the day we had maxed out the number of EC2 instances we could get. We had another team doing scalability testing that day, and we were hitting some upper limit. We killed unnecessary running AMIs, we asked everyone else to stop testing, and we requested an increase in our upper limit via AWS, then went to our account rep and asked them to escalate the request.

By the time we left late Wednesday night, we felt confident we could handle the load on Thursday morning.

Thursday came, and we all had Google analytics, New Relic, and our backdoor interfaces covering all screens. Because it was all happening at 5:30am PST, we were spread across hotel rooms and homes. We were using Campfire for real-time chat.

The first load came at 5:38am, and the spike happened quickly: from 0 visitors we went to over a thousand visitors in just a minute. Our page response times stayed rocky steady at 500ms, and we never experienced any delays in serving customer pages.

We had a pricing bug come up, and had to hot patch the servers and reboot them while under load (one at a time, of course), and still were able to keep serving up pages. Our peak load was well under the limits we had load tested, but was well over what we could have sustained on Monday afternoon prior to the changes we made. So the two and a half days of grueling effort paid off.

Lessons learned:

Had we paid for the Pro plan for New Relic earlier, we probably would have saved ourselves a lot of effort, but either way, I do believe New Relic saved our butts: if we hadn’t gotten the data we needed on Wednesday afternoon, we would not have survived Thursday morning.
Load testing needs to be done regularly, not just when you are expecting a big event. At any point someone might make a database change, forget an index, or introduce some other weird performance anomaly that isn’t spotted until we have the system under load.
There needed to be a relatively safe, automated way to deliver expedited deployments to staging and production for times when you must shortcut the regular process. That shouldn’t be dependent on one person using vi to edit files and not making any mistakes.
We couldn’t have been scaling app instances and database servers so quickly if we weren’t using a cloud service.
It really helped to have the team all in one place for this. Normally we’re distributed, and that works fine when it’s not crisis mode, but I wouldn’t have wanted to do this and be 1,000 miles away.

*As usual, I’m speaking as and for myself, even when I’m writing about my day job, not my employer.

SQL Recipe: Calculating Ratios of Things

Posted on May 13, 2011

hertling

Let’s say we have an sql table of jobs. They can be in any of several states. They have a starte time and a completed time. And they have have a type of A or B.

type	started	completed	state
A	9:01am	9:02am	completed
A	9:10am	9:14am	aborted
B	9:20am	9:21am	cancelled
A	9:20am	9:22am	completed

Let’s say we want to know for A and B, what percentage of jobs complete. We’d also like to know what percentage of jobs finish under a certain time threshold. Now this is drastically simplified, but you can imagine that job type could be any complex set of variables you want to analyze by: region, language, type of rendering, whatever.

Here’s a query that will calculate the ratio of completed jobs by type:
select type, avg(cast( case when state='completed' then 1 else 0 end as float) ) as completed_ratio from jobs group by type

The way it works:

The case statement sets of value of 1 for the success condition, and 0 for any other.
These are integers, so we cast them as float, so we’ll get the averaging behavior we want.
We avg them, to get the average.

Similarly, if we wanted to calculate the ratio of jobs completed under a certain time threshold, we could do that in the case statement, such as this example which calculates the ratio of jobs whose end time is under 3 minutes:

select type, avg(cast( case when datediff(minute, completed, started) < 3 then 1 else 0 end as float) ) as completed_ratio from jobs group by type

Ruby Recursive Hash with SQL-like capabilities

Posted on February 24, 2011

hertling

I recently was working on a Ruby project where I had to query an SQL server, and the connection setup/query/response/close latency was about 5 seconds. I needed to generate many unique data slices for a reporting project, and the naive implementation generated some 6,000 database queries, which would take about eight hours to run.

I could fetch all the data in one query that had a large group by clause, which would return all the data I needed in 60 seconds instead of 8 hours.

But then I had a pile of data in rows. That’s not very handy when you have to slice it by seven different characteristics.

So I wrote a recursive Ruby hash structure with semantics that allow me to retrieve by arbitrary attributes:

h=RecursiveHash.new([:business, :product_line, :format, :country, :state, :count], 0)

h.insert [‘ConsumerBusiness’, ‘Product A’, ‘HTML’, “UnitedStates”, 1, 10]

h.retrieve { :product_line => ‘ConsumerBusiness’ }
h.retrieve { :product_line => ‘ConsumerBusiness’, :doc_format => ‘HTML’ }

The default mode is to sum the responses, but it can also return an array of matches.
It’s really fast, and has a simple interface that, to me, is highly readable.

Here it is – RecursiveHash.rb:

class Array; def sum; inject( nil ) { |sum,x| sum ? sum+x : x }; end; end

class RecursiveHash
 def initialize(attributes, default_value, mode=:sum)
  @attribute_name = attributes[0]
  @child_attributes=attributes[1..-1]
  @default_value=default_value
  @master=Hash.new
  @mode=mode
 end
 
 def insert(values)
  if values.size > 2
   #puts "Inserting child hash at key #{values[0]}, child: #{values[1..-1].join(',')}"
   if @master[values[0]]==nil
    @master[values[0]]=RecursiveHash.new(@child_attributes, @default_value)
   end
   @master[values[0]].insert(values[1..-1])
  else
   puts "Inserting value at key #{values[0]}, value: #{values[1]}"
   @master[values[0]]=values[1]
  end
 end
 
 def return_val(obj, attributes)
  if obj.is_a? RecursiveHash
   return obj.retrieve(attributes)
  elsif obj==nil
   return @default_value
  else
   return obj
  end
 end
 
 def retrieve(attributes)
  if attributes[@attribute_name]==nil or attributes[@attribute_name]=='*' or attributes[@attribute_name].is_a? Array
   keys=nil
   if attributes[@attribute_name].is_a? Array
    keys=attributes[@attribute_name]
   else
    keys=@master.keys
   end
   
   v=keys.collect { |key| return_val(@master[key], attributes) }
   #puts "v: #{v.join(',')}"
   return @mode==:sum ? v.sum : v
  else
   return return_val(@master[attributes[@attribute_name]], attributes)
  end   
 end
 
 def pprint(n=0, parent_key="N/A")
  indent = "  " * n
  puts "#{indent}#{parent_key} (holds #{@attribute_name})"
  @master.each_key { |key| 
   if @master[key].is_a? RecursiveHash 
    @master[key].pprint(n+1, key)
   else
    puts "#{indent}    #{key}: #{@master[key]}"
   end
  }
 end
end

William Hertling's Thoughtstream

A writer musing about science fiction, A.I., and the Internet.

Category Archives: programming

Time Travel in a Simulated Universe

Notes from Defrag, Day One

Audit all the things

Dan Slimmon on Monitoring

Katherine Daniels on Monitoring

Adrian Cockcroft on Monitoring Cloud Services

Printable Obama for America AWS Architecture

Scaling A Web App 1,000x in 3 Days

SQL Recipe: Calculating Ratios of Things

Ruby Recursive Hash with SQL-like capabilities