Editorials

Important distinctions about working in the cloud

Some real-world distinctions come into play as you implement different solutions – learning as we go, here are some of the things we’re hitting (and hearing from others).

First, the distinction outlined in comments yesterday about providers having to control the performance and resource usage of services – this is at the core of one of the differences we’ve come up against. The comments outlined this really well though, from Maurice Pelchat:

Problem with cloud setup is that they need to throttle everything.

 

On a on-premise server, you can max out ressources like IOPS or CPU or disk bandwith, because you’re alone using it. You rarely all use the maximum of all of them. Sometimes an application needs for a moment a lot of CPU, just that. Sometimes it needs for a moment high IOPS without large disk bandwith. Sometimes a lot of disk bandwith is required for a short time, with less IOPS. If you are a cloud provider, you doesn’t want that a given Customer takes all of theses types of ressources at the same time, which would seriously degrade quality of service for other customers. So you just limit CPU, IOPS, disk bandwith. This makes could server a poor solution for short intensive bursts of processing that demands one of them.

 

The net result: A single humble physical machine with a couple of SSD often outperforms expensive could servers of many times more expensive, making not obvious which one gives the best ROI.

Now, I’m a huge believer in this whole scaling opportunity presented by cloud services. We’ve used it, and often, to deal with large events, readership floods and the like. It makes things work and work well when it’s done right.

As a very real example that backs up Maurice’s point, when we first moved SSWUG to new infrastructure and new databases and the new CMS, if you were around then, there’s no way you didn’t see the blood, guts and pain we went through trying to get the sizing right. Flood of readership in the AM with the newsletter, slower later, events brought more, etc. We hit the wall time and time again with hosting, with database services, all of it.

Where a traditional, on-premise server configuration would have “merely” slowed down, we quite literally hit the wall. No more connections, no more processing, scale, scale and scale some more. Then, as we did so, we ended up with massive server and infrastructure changes (and bills) that cost $thousands and more hours than I care to admit. It was (legitmately) the ISP protecting themselves from flash crowds, but not something I expected. I thought performance would slow, etc. I never expected to hit absolute walls.

We had to punt to new options and configurations in real-time, in front of the world while everyone seemingly watched, as we tried to address things. It was a painful learning lesson.

But it was the perfect example of the issues Maurice points out. Those protections put in place on your configuration – whether it be capacity models on Azure or instance sizing and such on AWS, matter. They respond differently than you might expect with on-premise systems. Things don’t just slow down and bog down.

All of this is NOT to say don’t do it. It’s to say “Understand it. Test it. Break it. Make sure you know how it all works.

What have you seen with your own testing?