On Tuesday night, Xavier Shay (@xshay) gave a short talk on database anti-patterns.

Here are my rough notes:

  • STI for Shared Data
    • STI = single table inheritance. google “rails sti”.
    • this is sometimes called a “god table”. It is easily found by having a single table that has many null columns, because different kinds of objects are being stored in a single table.
    • Example: a database for books with columns id, type, name, illustrator. for books, there is no illustrator, so you have a null field.
    • This gets complicated over the time. you have to loosen database constraints (you can’t enforce a value for illustrator), and logic is required to handle the null case, even for books. 
    • Using class table inheritence is one solution: books is one table, a second table is comics, and comics table has the illustrator field. In this case, instead of complicating the handling for books with a null illustrator field, we make comics, which needs the extra data, handling getting it from the table.
  • Not Deleting Data
    • People are afraid to delete data
      • in part, afraid of creating dead references, e.g. can’t use has_many
      • business needs to go back in past
    • It’s bad to try to use one database for both reporting and operational needs.
    • Example: a users database, in which records can marked as deleted.
      • then either user names can’t be unique, or then user names can’t be reused. either way, this then gets into extra coding and/or relaxing constraints.
      • plus, all your queries become “find users where state is not deleted”, so all queries become more complex and slower.
    • Solution: have two tables, and move the deleted users into the old_users database, which gives you your history.
  • Different data, same database
    • No notes here, sadly.
  • Not locking data
    • Example: it’s easy, if the user hits pay twice, that you could have a race condition here:
      • order = Order.find(id)
      • order.mark_as_paid!
    • Fix
      • Order.transaction do 
        • order = Order.find(id, :lock => true)
        • order.mark_as_paid!
      • end

CloudCamp 2010
Portland, Oregon
#oscon #cloudcamp #pdx
CloudCamp is a free birds of a feather session at OSCON, the O’Reilly Open Source Conference. I came out of general interest, and because one of the promised tracks is deploying your own cloud using open source tools.
Promo: New user group: pdxdevops
Lightning Talks
open cloud
Sam Johnson
Google, Zurich
  • open source / open cloud: freedom. You can move from one cloud to another. avoids lock-in.
  • unfettered competition leads to commoditization leads to utility computing.
  • case study: free software 
    • open source is a happy medium between free software and proprietary software that leads to useful stuff, good for business.
    • open source is trademarked, giving it some instant recognizability and specific criteria for being open source
  • criteria for open cloud
    • open interfaces (atompub)
    • open formats (open document)
  • http://opencloudinitiative.org
Adrian Cole
Ops Code
5 APIs for Provisioning
  • Provisioning
    • Allows access to cheap resources
    • APIs -> automation
    • Tools exists
  • Manage Complexity
    • multi-cloud APIs
      • abstract what is commoditized
      • provide a consistent substrate
      • reduce complexity and lock-in
  • Dasein Cloud
    • Written by guy who did first JDBC
    • Focuses on services
  • Apache Deltacloud http://deltacloud.org
    • Ruby implementation
    • provides REST endpoint. Can use curl to manipulate the clouds.
  • fog
    • ruby cloud computing library
    • compute an storage across many providers (about 6)
  • jclouds
    • multi-cloud framework
    • zero lock-in to cloud apis
    • written in java
    • runs in google app engine
  • libcloud http://libcloud.org 
    • was a python library, with java coming soon
    • is about compute
    • works with 16 providers
The Simple Cloud API
Doug Tidwel
The Simple Cloud API brings cloud technologies to PHP and the PHPilosophy to the cloud, starting with common interfaces for three cloud application services: File Storage Services, Document Storage Services, Simple Queue Services.
  • Joint effort of Zend, GoGrid, IBM, MS, Nirvanix, and Rackspace.
    • But you can build libraries to support other clouds
  • Supports 3 areas:
    • File storage (s3, nirvanix, azure blob, rackspace)
    • Document storage (s3 doc, azure doc)
    • Simple queues (sqs, …)
  • Uses Factory and Adapter design patterns
Principal consultant with Center Stance
Cloud Consultants: do implementations in the cloud
Not much of an open source person, more of a cloud person.
  • SalesForce.com, SAAS.
  • VisualForce is a templating language + Apex (java like) = to do addons for SalesForce.
  • App Exchange: app marketplace. 
  • managed and unmanaged packages.
    • managed packages are controlled, no code.
  • 940 packages in the app exchange.
  • less than 10% of those are open source: about 80 packages.
Dyn, Inc.
  • Doing DynDNS for over 12 years. 3.5M people using it.
  • Dynect Platforms: hosts companies like twitter, 37signals, zappos.
  • Geotarget multiple clouds
    • Users in EU, go to Amazon EU, users in the Western USA go to GoGrid, users in the Eastern USA go to …
    • Automatically redirect traffic to servers that are running (active failover)
  • DNS can give you a slider for your traffic: how much do you want to send to the cloud vs. your own servers? you can base it on latency, on location, on etc.
  • DNS resolution time is part of overall latency for users. DynDNS is faster (like 32 ms vs 120 ms in example.) that’s 90ms you’re getting back to be able to do more in your own server.
Hahahaha: They asked “who considers themselves an expert on the topic of open source and cloud computing?” Five people raised their hands. “OK, you’re the panel. Come on up.”
  • How is CC going to change the choice of the dev platform?
  • Is open source still relevant in cloud computing?
  • Will open source save us from a handful of monopolies?
  • What are the implications on hardware? What will change for hardware?
Stuart Smith, Rackspace: Is open source still relevant?
  • Only if you value freedom.
  • In fact, it is even more important.
  • When your proprietary software vendor goes out of business, you still have the software, you still have the license key.
  • When your proprietary cloud vendor goes out of business, your company is fucked.
Will open source save us from monopolies?
  • Just being free isn’t enough. There have been other free efforts that have been crushed by monopolies.
  • You have to have people adopt it.
  • The only way it is going to work is if everyone gets involved. otherwise cloud computing will be dominated by a few proprietary stacks.
How does this influence our choice of platforms?
  • With some platforms, like Google App Engine, you either drink the koolaid, or you don’t.
We’re going through this change between latency sensitive and bandwidth sensitive. Everything moving to data centers. highly multicore systems. now losing in the market place to classic out of order design. we’re going to see lots more cores, lots more latency sensitive. gpu assisted. more message passing hardware to avoid going through the OS.
Breakout Discussions
Why open? Open stack. Open cloud.
Open is:
  • creative commons license on the specs themselves. if the specs themselves are copyrighted, you can’t even tell your customers about them.
  • patents: you can’t have key technology locked up.
  • trademarks: when you start talking about “amazon compatibility”, you have problems. so the relevant names must be open for use.
  • implementations: you need to have multiple implementations.
  • open design / transparency / open process: so the community can participate, so i can understand the design, what is going on.
    • open process is hard: because standards bodies are in theory open, but they cost $12,000 to join, so it;’s not really open.
    • if it’s not open, then other people can’t innovate and move things forward. that’s limited to the standards setters.
  • then what are the options? a different standards body?
    • we had an unconference, and invited people to participate, and we were able to learn from each other and move things forward.
      • (this was on the format used for virtual machines)
  • Open cloud is:
    • open formats
    • open interfaces
    • open source
    • open data
  • “multiple, interoperable implementations, at least one of which is open source”
    • having an open source implementation does give you a real viable alternative. 
    • example: if there was an open format for microsoft office, and they said, well all you have to do is implement microsoft office yourself, then it isn’t really viable, unless there really is an existing open source implementation.
  • part of the core of open source is the right to fork.
    • if you don’t have the right to go, then you are married to the solution (e.g. whoever will buy MySQL)
    • this would include the right to fork a spec
      • let the best API float to the top.

My notes from the @pdxruby talk on 2010/04/06

Machine Learning and Data Mining
Randall Thomas
Engine Yard
  • Randall’s Slides from Talk
  • netflix, amazon, google: recommending movies, books and music, links based on your personal experience
    • the future is about information…not data (how many gigabytes of data do you have sitting around?)
    • if it’s so cool, how come everyone isn’t doing it? it’s hard
  • world’s shortest stats course
    • two types of statistics
      • descriptive: the average height in this room is 5’ 6”
      • inferential: odds are, this horse is going to come in first. 
    • the two tasks
      • classification: you try to come up with a system for classification (cluster analysis, decision trees)
      • prediction: card counting, i predict that this deck is hot
      • or both: we want to both classify the data and draw inferences about new data
  • two types
    • supervised learning
    • unsupervised learning: the way a bayesian filter works… i have no idea what the inputs were, but i can look at the macro behavior, and then make predictions. this is also the way markov models work, the way spam filters work.
  • R
    • heavy-weight lifting tool for statistics
    • has shell for working in statistics
  • 5 numbers, one picture
    • pallas.telperion.info/ruby-stats
  • RSRuby
    • lets you eval R code
  • Computer friendly data descriptions
    • feature vector: simple 0 or 1 for each feature. beer, wine, whiskey, gin are the vectors. (1 if you like it, 0 if you don’t)
      • attempt bitwise and of vectors
  • Clustering…
    • Simple Geometric: just use the distance formula. If you have 2 dimensions, or 3 dimensions, there is a simple formula. that formula generalizes to N dimensions
    • R code: plot(sort(mydata$profits))
  • Not Simple Geometric Clustering
    • Support Vector Machines: create maximal separation of unseparatable data by projecting onto different planes.
    • You can seperate into two groups: one that is good, and one that is bad. one that are people attacking your IP ports, and one that isn’t. one that is spam, one that isn’t.
    • You can apply the SVM over and over again recursively… this turns into a decision tree.
  • Read: 
    • First: Introductory Statistics with R by Peter Dalgaard (2nd edition) – teaching you the basics in a tutorial fashion
    • Second: A Handbook of Statistical Analyses Using R by Brian S Everitt and Torsten Hothorn
      • load the free PDF in Rvignette(package = “HSAUR”)
    • The Elements of Statistical Learning by Hastie, Tibshirani, Friedman
      • www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
  • Regression in R
  • Examples of companies doing this…
    • Collective Intellect: doing mining of memes

Wow, I loved this presentation! Feel like a programming kung fu master… – Will
Revenge Of Kick-Ass Mash-Ups with Punk Rock APIs
Kent Brewster
  • Notable Mash-Ups
    • Google Maps Mash-Up: first recorded AJAX mach-up, probably inspired most of the state of the modern art.
    • Flickr Blog badges
  • Punk Rock: DIY ethic
    • Other generative things
      • lego blocks, erectors sets, refrigerator boxes
      • original apple //e
    • is your site sterile?
      • users are cows, not customers
      • real customers are coke and GM
      • any unauthorized use is abuse
  • Your existing API
    • you already have an API: HTML
    • you’re already being screen-scraped
      • you know this
  • If you open up an API, you get pinpoint data about how it is being used.
    • Sterile APIs: HTML, RSS
    • Generative APIs: Free.
    • Punk Rock APIs: use generative APIs to turn sterile APIs into generative APIs.
  • Job interview at netflix: asked to review code. looking through real source code, he found that they had cribbed his own code. hired.
  • Netflix Bubble Widgets
    • single line javascript include
  • Pipes.yahoo.com
    • this is why yahoo is still relevant. they are doing amazing stuff like pipes.
  • Some very little javascript can do amazing things because it relies on Yahoo Pipes to do the heavy lifting.
  • YQL: yahoo query language. amazing tool.
    • select * from twitter.search where q=‘earthquake’;
    • This works because the community contributes tables (see community tables) that actually do the fetching/parsing of the data.
  • bit.ly/kb_twit 
  • bit.ly/kb_sxsw
    • used YQL, and a bit of xpath.
    • filtered results, nice presentation, runs fast.
  • Advice for Hackers
    • Go easy on the server. Since every request comes from a separate IP address, client-side mash-ups look like botnet attacks.
    • Respect robots.txt
      • Pipes and YQL respect robots.txt
    • Create and pass an application ID even if it’s not required. 
    • Let the site now what you’re doing. They might hire you. 
  • Advice for Site Owners
    • Build your API first. Build your site on your API, and then open it up to the community. Example: Flickr.
    • Whitelist Pipes and YQL: It’s the right thing to do.
      • They are giving you a free API caching mechanism
      • Twitter has done it. If you are running up against twitter API limits, try it.
  • How to open an API where you work
    • Build an interesting mash-up
    • Write the documentation for the API you wish you had.
    • Don’t write a spec. Write the actual docs.
    • Give it to the back-end guys.
  • To Be Useful for Client-Side Mash-Ups
    • Return Javascript
    • Wrap the requested JSON in the client’s preferred Javascript callback
  • To be useful for repeated calls… (some complicated stuff I didn’t get)
    • something having to do with square brackets
  • Every Javascript reply must have HTTP Status 200
    • If it comes back with anything else, the browser won’t see the response and the calling script will hang forever.
  • Demo the Last: Missing Kids CAPTCHA
  • Questions…
    • What if a call never returns?
      • You have to set a timeout. Probably requires a global variable. 
    • Examples of business mashups? Examples of doing it to correct a company’s bad UI?
      • People are more interesting to me… so not so aware.
      • Don’t surprise anyone in your IT group. If you should it to your boss, and they think it is awesome, you’ve really stuck the IT group in a corner.
    • If you’re a company, and you’ve never done this before, go talk to Mashery, or other companies like that.

Building Apps in Your Spare Time
  • Gina Trapani
    • write stuff mainly to procastinate writing
    • Firefox scripts to improve gmail (better gmail 2
    • ThinkTank – ask your friends
  • Matt Haughey
    • Side projects
      • Wrote fuelly: social miles per gallon.
      • MetaFilter (1999), written when blogs were still new
  • Adam Pash
    • MixTape: playlists shared with friends
    • Belvedere 
    • Texter: shorthand for your computer. Like textexpander for the mac.
  • Why should I develop an app in my spare time?
    • Just built a tool for ourselves (and 25,000 other users).
    • Just wanted something as clean as possible. Not an overbearing UI like slashdot.
    • Fill a need… Gmail
    • Want an archive of tweets.
    • Very important to scratch your own itch
    • Ego motivation… opportunity to get users right away, get feedback
    • You can build anything… that is really exciting.
      • Pash: I am not a programmer by trade, and I am not a great programmer, but I can still make anything.
      • Trapani: it’s amazing what you can do now between APIs and the languages available
    • Don’t expect to make money. Metafilter was a success, but it took 6 years before they made money. There can be a huge slog. If your motivation is only money, you’ll shutter the project. If you build an app you use every day, then at least you can still use it every day.
    • “The internet is so ready to give you an answer to any problem” — Pash
    • You can work on stuff that will further your career
    • If you don’t have an idea you are excited about, then you aren’t going to make it happen.
  • All the beloved things… twitter, flickr… they didn’t start as a plan to make a lot of money.
  • How can I do it?
    • You have to dedicate time.
    • If you are really excited about it, you can find the time.
    • The first thing to go for most people is the television. Two hours of veg time at the end of the day is the easiest thing to go.
    • It can be a relaxing time… just enjoy it, watch TV, plan to put a year into it.
    • Use frameworks… don’t reinvent the wheel. Rapidly prototype. Google what you need to do, and copy and paste code. Use libraries and plugins that exist, there are plugins for everything.
    • Collaboration is a big deal
      • it’s so much more fun to work with someone
      • it’s so helpful to bounce ideas off something
    • You really don’t need to be a coder or to hire someone to start. You can go from zero to competent in just about any language about six months. 
    • Dan Bricklan, inventor of the spreadsheet (will: about a billion years ago), was like “iphone development, this sounds interesting“, and went out and bought an MacBook, an iPhone development book, and wrote an app, and put it in the store for $3
    • Did you ever pay anyone?
      • Yeah, I don’t really have the skills or competency anymore in design, so I hire some designers. Same for CSS… I don’t have the skills any more to make this work in dozens of browsers. I sent to it to some kids in (the middle of nowhere), and paid them $100.
      • I’ve never hired anyone because I’m cheap, but I barter with people. “I’ll build something for you if you design something for me.”
    • Open source
      • Trapani: everything I’ve done is open source. At lifehacker, we have this big community of people doing open source. Why not use those resources?
        • There is nothing more awesome than waking up to check your email and finding a code contribution.
      • But you can’t rely on that. It’s a big commitment for someone to get the code, work on it, and submit a change.
  • Pitching your idea to the company… to sponsor them
    • You’ve got to make the case for why to do
    • Google’s 20% time is a good example to cite
    • Or it may be synergistic: e.g. for lifehacker it raises their credibility for their employees to be doing open source
  • Questions…
    • Talk about ownership when you are working at a company
      • Check your company’s policy before hand. Some have weird policies like even what you do on your own time is owned by the company.
      • If you can convince your company to open source it, then it isn’t an issue at all.
    • I am a developer, and I like to build super-visualize things, but I am not a designer. How can i find someone to work with?
      • There are some sites to help. But that is kind of a crapshoot.
      • you network a lot. 
      • Go to an ignite in Portland. 
      • Look up the portfolio of designers you meet.
      • Don’t go to rubycon to find a designer.
      • Go to social events or design events.
    • Talk about programming where you might not want to open source the code. Talk about some successful examples of that.
      • I had security issues – a giant login system with crappy code. I wanted to keep that code secret.
      • One motivation to make your code good is to open source it.
      • But if you can’t do open source… then you have to hire programmers, or find one fan of your work to work with. and still keep it closed.
    • What about liability…worried about being sued.
      • I made a music sharing site that uses mp3s shared on servers around the country. So I made an LLC, and now MixTape belongs to that LLC. 
      • Having a terms of service can help. Lawyers can help you do terms of service and LLC for less than $1k. 
      • Or copy and paste from Google or someone else. Something is better than nothing.
    • Tradeoff with APIs… you are at the mercy of the service. You get a lot, but then the service could go away.
    • How do you get users? I’m the sole user of like a half a dozen apps.
      • It’s not easy. Integrate them into whatever you do. For fuelly, we made badges people could put on their blogs. Talk about it on twitter.
      • Talking to developers about things you made. No one want talks to a PR person. We want to talk to developers.
    • As a designer, I want to learn programming. Where should I go?
      • Google is great. 
    • I’m not hearing why the stuff you make is as awesome as it is. What are the decisions you can make, what are the freedoms you have, that you don’t have to make money off it
      • You are the user. You are the designer. You can make the application what you want it to be. It can be very satisfying.  
    • At what point do you reach break even on the server costs?
      • I’m spending $100/month for the server, and using AdSense will cover the costs. 
      • You can do “donate a dollar” via paypal, but that is sporadic.
      • It’s weird to do a project where covering the hosting cost is considered a success.
      • Amazon referrals, ads, are a passive way to do it.
    • Share a couple of websites that would be good resources
      • prototype
      • jquery
      • open languages have great documentation… documentation plus comments is amazing.
      • free git book online
      • stackoverflow
      • peepcode
      • just google your programming question

Sorry, I got to “Beyond LAMP” about 25 minutes late – my notes don’t include anything from the first part of the meeting.

Updated: Here are some additional great notes covering the beginning part of the session, as well as some more organized notes from the end part.

  • twitter uses cassandra
    • no disk seeks when you do a write
    • no master, you can do write on any machine
    • when you post a twitter, it gets written into the queue for each of your followers. so if you have 1,000 followers, then it’s written 1,000 times.
    • cassandra designed to use commodity servers
  • monitoring
    • one of the tricks is to know when a machine needs to be replaced when all you have are hundreds of commodity servers
    • monitor, monitor, monitor
    • cpu load, file descripters, bandwidth, database connections, database performance, disk space, etc.
    • need monitoring system, ability to graph, all centrally, so you don’t have to go to individual machines.
  • Being on the front page of digg crashed server (digg sends a lot of traffic)
    • Why not memcache the front page? It gets loaded all day long, and it is always the same.
    • Rewrote the system to only read from database when it’s not memcached. Refresh memcache once per day.
    • Before change, was 60% db writes, 40% reads. After change was 99% db writes, only 1% reads. All the reads were now coming from memcache. 
  • At twitter, expect that people will come along and read what you’ve written. So they do write-thru caching. The tweet is put first into the cache, then into the database. This way they never need to read the database to get the recent tweets.
  • when you get beyond a certain point, you can’t analyze the data on a single machine. you have terabytes and terabytes of data.
    • Hadoop lets you run distributed jobs, that automatically retry when systems go down or fail.
    • Without this, as the data grows, you end up asking simpler and simpler questions. 
    • With this, you can ask more sophisticated questions. 
  • Scaling search…
    • they can process 1 to 2 searches per second
    • search is hard
  • What was the first thing to blow up for you?
    • 1st was mysql, 2nd was apache. 
      • made the switch over to engineX for serving up images. much, much faster. Using Apache was like using a sledgehammer to server up images.
    • connection issues with postgres.
    • migrating data schema when at scale is really hard… turn off indexes before copying data
    • twitter: one thing that kept our ops team awake at night…
      • we are a rails app
      • how do we maintain relationships?
      • we had it normalize with a follower table: user_id, follower_id in a single table.
      • lookups against this table were table
      • they built an intermediate solution… denormalized data structure
      • while they worked on a longer term solution… they built a custom social graph data tool.
      • it need to work across 7 orders of magnitude: from someone with 1 follower to someone with 1M followers
  • Questions
    • Deployment
      • twitter uses murder: bit-torrent for deployment. seed some servers, then those servers help feed others. brought main app deployment time from 12 minutes to 37 seconds. check out twitter opensource – they open source most of their tools
    • Hardware databases
      • twitter is using some, facebook experimenting with them. they are PCI express cards almost as fast as main memory.
    • databases versus key stores
      • it’s natural to go to denormalized – you just want the data you want in the form you want
      • over time, more of the logic goes into the application code, so database indexes are less useful
    • how do you manage when data is on particular servers
      • at twitter, using cassandra, are already consistent
        • there are bunch of new systems that have different tradeoffs. some have eventual consistency, some don’t. if your application can’t handle eventual consistency, then cassandra isn’t for you.
    • did anyone consider any of the top ten database, like say oracle
      • twitter: we strongly prefer open source systems. as we scale, we like to be able to peek under the hood, and see what is going on.
      • facebook: we like open source, we like the way open source projects work together, we like to be nimble – these proprietary systems are not so nimble.
      • it’s a combination of openness, ideology, and cost.
    • berkeleydb versus memcached
      • memcached is just a wrapper on top of berkely db