Why can't anyone launch an online service without outages?

PS4 was just the latest big launch tainted by network issues; Akamai's chief gaming strategist explains why rollouts still go wrong

Feature by Brendan Sinclair Managing Editor

Published on Nov. 20, 2013

7 comments

Last Friday, Sony launched the PlayStation 4 with a required day-one patch and the PlayStation Network promptly buckled beneath the resulting rush of new owners trying to get their system up and running. It's become a common scene in gaming, especially when anticipated new releases require a constant online connection, such as with the Diablo III and Sim City launch debacles. And with Xbox One debuting this Friday and requiring its own firmware update right out of the box, Microsoft will become the latest company to see if its online infrastructure can withstand a new product rollout.

Surely these are companies who know their business, who have plenty of experience dealing with online services, and are well aware that online services can crumble under the weight of a million gamers. So why does it keep happening? Why haven't they learned how to avoid these black eyes? Will they ever figure it out, or are these embarrassing launch outages just a fact of life now? Earlier this week, GamesIndustry International spoke with Kris Alexander, chief strategist for connected devices and gaming at cloud computing firm Akamai to get some answers.

"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources.'"
Kris Alexander

Alexander said part of the problem arises from the demands of modern video games. To combine responsive gameplay with the sprawling set of services currently on offer, some tasks need to be handled on the user's console, while others are offloaded to remote servers. For those to operate in concert, there needs to be a solid connection between the two.

"Typically the bottlenecks are at the origin, where the server is, and then the last mile, where the clients are," Alexander said. "There's a lot of complexity to the architecture, how you put something like this together and plan for it. You really need to plan for this at the onset. The old frameworks of developing an app and then deciding to make it interactive in multiplayer [doesn't work]. You can't add it on. It's not as if you add it in during the development cycle; you have to plan for it at the beginning. That's one of the most important aspects because some components scale well, while others scale not as well."

Unfortunately, a company can understand that problem and plan for elastically scaling its computing capabilities all it wants, but as Alexander noted, "scaling is only as good as the algorithm behind it."

"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources,'" Alexander said. "That pool of resources has to be distributed in some way for users coming in from different areas, not just geographies but different networks. If you all of a sudden have a bunch of users coming over AT&T versus Comcast versus Verizon, and the interconnects to wherever your servers are get overloaded in terms of peering, then that peering might get re-routed around. You haven't planned for that, the ISPs haven't planned for it, and you can run into problems."

No matter how good the planning and preparation is, the key then becomes adaptability. But even when the company has its own business well in hand, Alexander said there's only so much it can do to mitigate problems on the user side, that "last mile" bottleneck he alluded to.

"There are actually more issues in the last mile than people realize," Alexander said. "There's a major issue going on which I think is going to get significantly worse before it gets better. Every app on every device just in your home has no concept that there are other apps or devices that need resources. They all act like they're the only ones who need resources, so when a new app fires up asking for resources in terms of connectivity, it's going to ask for a lot. And what happens is that degrades everything else on your network. Because of this, it creates all kinds of fluctuations in terms of availability of resources. So whether it's your PC, your PlayStation, or your Xbox firing up, it's fighting with a bunch of other things in the house that are constantly asking for resources."

"If there was a perfect model, someone would be making a lot of money off it right now. But there isn't."
Kris Alexander

Neither that problem--Alexander called it "contention"--nor the elasticity problems can be completely solved just by throwing money at the problem. And even though Akamai has some services that can help companies solve certain parts of this problem, Alexander doesn't pretend that the company has all the answers.

"I think we're always going to see them at some level or another," Alexander said. "If there was a perfect model, someone would be making a lot of money off it right now. But there isn't. There's just too many variables in terms of what end users might actually do and the conditions of those end users on the devices they're coming in on."

While these sort of launch snafus are likely to continue happening, Alexander said the industry is still learning from each misstep thanks to analytics. Even on a botched launch, the data collected gives companies information to help them make better decisions and assumptions for the next one.

"It's something I'm seeing a lot of companies do," Alexander said. "It's an important part of the equation because the data you collect allows you to create models to improve your planning. It's all about planning, and then it's all about flexibility. Not just what's the plan when you roll out, but what are your contingencies for how you flex when things happen?"