Why can't anyone launch an online service without outages?
PS4 was just the latest big launch tainted by network issues; Akamai's chief gaming strategist explains why rollouts still go wrong
Last Friday, Sony launched the PlayStation 4 with a required day-one patch and the PlayStation Network promptly buckled beneath the resulting rush of new owners trying to get their system up and running. It's become a common scene in gaming, especially when anticipated new releases require a constant online connection, such as with the Diablo III and Sim City launch debacles. And with Xbox One debuting this Friday and requiring its own firmware update right out of the box, Microsoft will become the latest company to see if its online infrastructure can withstand a new product rollout.
Surely these are companies who know their business, who have plenty of experience dealing with online services, and are well aware that online services can crumble under the weight of a million gamers. So why does it keep happening? Why haven't they learned how to avoid these black eyes? Will they ever figure it out, or are these embarrassing launch outages just a fact of life now? Earlier this week, GamesIndustry International spoke with Kris Alexander, chief strategist for connected devices and gaming at cloud computing firm Akamai to get some answers.
"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources.'"
Kris Alexander
Alexander said part of the problem arises from the demands of modern video games. To combine responsive gameplay with the sprawling set of services currently on offer, some tasks need to be handled on the user's console, while others are offloaded to remote servers. For those to operate in concert, there needs to be a solid connection between the two.
"Typically the bottlenecks are at the origin, where the server is, and then the last mile, where the clients are," Alexander said. "There's a lot of complexity to the architecture, how you put something like this together and plan for it. You really need to plan for this at the onset. The old frameworks of developing an app and then deciding to make it interactive in multiplayer [doesn't work]. You can't add it on. It's not as if you add it in during the development cycle; you have to plan for it at the beginning. That's one of the most important aspects because some components scale well, while others scale not as well."
Unfortunately, a company can understand that problem and plan for elastically scaling its computing capabilities all it wants, but as Alexander noted, "scaling is only as good as the algorithm behind it."
"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources,'" Alexander said. "That pool of resources has to be distributed in some way for users coming in from different areas, not just geographies but different networks. If you all of a sudden have a bunch of users coming over AT&T versus Comcast versus Verizon, and the interconnects to wherever your servers are get overloaded in terms of peering, then that peering might get re-routed around. You haven't planned for that, the ISPs haven't planned for it, and you can run into problems."
No matter how good the planning and preparation is, the key then becomes adaptability. But even when the company has its own business well in hand, Alexander said there's only so much it can do to mitigate problems on the user side, that "last mile" bottleneck he alluded to.
"There are actually more issues in the last mile than people realize," Alexander said. "There's a major issue going on which I think is going to get significantly worse before it gets better. Every app on every device just in your home has no concept that there are other apps or devices that need resources. They all act like they're the only ones who need resources, so when a new app fires up asking for resources in terms of connectivity, it's going to ask for a lot. And what happens is that degrades everything else on your network. Because of this, it creates all kinds of fluctuations in terms of availability of resources. So whether it's your PC, your PlayStation, or your Xbox firing up, it's fighting with a bunch of other things in the house that are constantly asking for resources."
"If there was a perfect model, someone would be making a lot of money off it right now. But there isn't."
Kris Alexander
Neither that problem--Alexander called it "contention"--nor the elasticity problems can be completely solved just by throwing money at the problem. And even though Akamai has some services that can help companies solve certain parts of this problem, Alexander doesn't pretend that the company has all the answers.
"I think we're always going to see them at some level or another," Alexander said. "If there was a perfect model, someone would be making a lot of money off it right now. But there isn't. There's just too many variables in terms of what end users might actually do and the conditions of those end users on the devices they're coming in on."
While these sort of launch snafus are likely to continue happening, Alexander said the industry is still learning from each misstep thanks to analytics. Even on a botched launch, the data collected gives companies information to help them make better decisions and assumptions for the next one.
"It's something I'm seeing a lot of companies do," Alexander said. "It's an important part of the equation because the data you collect allows you to create models to improve your planning. It's all about planning, and then it's all about flexibility. Not just what's the plan when you roll out, but what are your contingencies for how you flex when things happen?"
Developer: we need XXX servers spread out over these geographical zones with YYYY bandwidth to ensure smooth rollout"
Money people:"that's way too expensive! Surely you can get by with half that"
Developer "nope, really can't"
A middle is reached, glitchy and overload ensues. All parties agree that this totally won't happen next time, till it doed
The biggest problem is that people seem to think putting up a highly scalable and secure server system is easy and it is anything but. Too many people hire cowboy kids who think they can build everything in a high-level tool like Rails and have no real scalability or security experience.
Then, when the site fails, everyone says "why did this happen?" Just like a bridge that falls down, it happened because you didn't pay the experts to do it right. Now you'll spend a lot MORE money trying to recover from your success disaster.
But these days you can use an on demand hoster like AWS, Joyent, Azure, etc
Insufficient resources is not an excuse any more. Today, server systems fail under load primarily because of insufficient engineering (which includes insufficient load testing and staging).
Edited 2 times. Last edit by Jeffrey Kesselman on 20th November 2013 9:56pm
And TEN launched completely smoothly more then a decade ago.
Edited 1 times. Last edit by Jeffrey Kesselman on 21st November 2013 1:07am
Here's hoping for a smoother ride on Xbox One.
Edited 1 times. Last edit by Joe Tay on 21st November 2013 7:59am
In layman's terms: "we predict we'll need 200 servers, we only have 100, so we get extra 50 and then hope for the best. The odds that it won't be a complete disaster are pretty good, and if the demand is much higher than what we anticipated then we still win (although PR will have to take longer hours)."
It's a perfectly valid pragmatic reasoning which only has the one drawback that, if applied regularly, a percentage of people will stop pre-ordering and will wait it out (already happening). It's really always just a matter of weeks (at most) though, so it's a reasonable choice for many and, in turn, it makes the resources needed for lunch smaller next time (or the ride smoother for early adopters).