Why can't anyone launch an online service without outages?

PS4 was just the latest big launch tainted by network issues; Akamai's chief gaming strategist explains why rollouts still go wrong

Last Friday, Sony launched the PlayStation 4 with a required day-one patch and the PlayStation Network promptly buckled beneath the resulting rush of new owners trying to get their system up and running. It's become a common scene in gaming, especially when anticipated new releases require a constant online connection, such as with the Diablo III and Sim City launch debacles. And with Xbox One debuting this Friday and requiring its own firmware update right out of the box, Microsoft will become the latest company to see if its online infrastructure can withstand a new product rollout.

Surely these are companies who know their business, who have plenty of experience dealing with online services, and are well aware that online services can crumble under the weight of a million gamers. So why does it keep happening? Why haven't they learned how to avoid these black eyes? Will they ever figure it out, or are these embarrassing launch outages just a fact of life now? Earlier this week, GamesIndustry International spoke with Kris Alexander, chief strategist for connected devices and gaming at cloud computing firm Akamai to get some answers.

"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources.'"

Kris Alexander

Alexander said part of the problem arises from the demands of modern video games. To combine responsive gameplay with the sprawling set of services currently on offer, some tasks need to be handled on the user's console, while others are offloaded to remote servers. For those to operate in concert, there needs to be a solid connection between the two.

"Typically the bottlenecks are at the origin, where the server is, and then the last mile, where the clients are," Alexander said. "There's a lot of complexity to the architecture, how you put something like this together and plan for it. You really need to plan for this at the onset. The old frameworks of developing an app and then deciding to make it interactive in multiplayer [doesn't work]. You can't add it on. It's not as if you add it in during the development cycle; you have to plan for it at the beginning. That's one of the most important aspects because some components scale well, while others scale not as well."

Unfortunately, a company can understand that problem and plan for elastically scaling its computing capabilities all it wants, but as Alexander noted, "scaling is only as good as the algorithm behind it."

"You can plan to scale and have the best laid plans, the best intentions, but it's more than just saying, 'Hey, I've got this pool of resources,'" Alexander said. "That pool of resources has to be distributed in some way for users coming in from different areas, not just geographies but different networks. If you all of a sudden have a bunch of users coming over AT&T versus Comcast versus Verizon, and the interconnects to wherever your servers are get overloaded in terms of peering, then that peering might get re-routed around. You haven't planned for that, the ISPs haven't planned for it, and you can run into problems."

No matter how good the planning and preparation is, the key then becomes adaptability. But even when the company has its own business well in hand, Alexander said there's only so much it can do to mitigate problems on the user side, that "last mile" bottleneck he alluded to.

"There are actually more issues in the last mile than people realize," Alexander said. "There's a major issue going on which I think is going to get significantly worse before it gets better. Every app on every device just in your home has no concept that there are other apps or devices that need resources. They all act like they're the only ones who need resources, so when a new app fires up asking for resources in terms of connectivity, it's going to ask for a lot. And what happens is that degrades everything else on your network. Because of this, it creates all kinds of fluctuations in terms of availability of resources. So whether it's your PC, your PlayStation, or your Xbox firing up, it's fighting with a bunch of other things in the house that are constantly asking for resources."

"If there was a perfect model, someone would be making a lot of money off it right now. But there isn't."

Kris Alexander

Neither that problem--Alexander called it "contention"--nor the elasticity problems can be completely solved just by throwing money at the problem. And even though Akamai has some services that can help companies solve certain parts of this problem, Alexander doesn't pretend that the company has all the answers.

"I think we're always going to see them at some level or another," Alexander said. "If there was a perfect model, someone would be making a lot of money off it right now. But there isn't. There's just too many variables in terms of what end users might actually do and the conditions of those end users on the devices they're coming in on."

While these sort of launch snafus are likely to continue happening, Alexander said the industry is still learning from each misstep thanks to analytics. Even on a botched launch, the data collected gives companies information to help them make better decisions and assumptions for the next one.

"It's something I'm seeing a lot of companies do," Alexander said. "It's an important part of the equation because the data you collect allows you to create models to improve your planning. It's all about planning, and then it's all about flexibility. Not just what's the plan when you roll out, but what are your contingencies for how you flex when things happen?"

More stories

Inaction speaks louder than words | This Week in Business

Racists are comfortable in gaming communities because platforms haven't done enough to get rid of them

By Brendan Sinclair

Reggie Fils-Aimé on diversity, technology and memeability | Podcast

Former Nintendo of America joins us for our latest episode, available to download now

By GamesIndustry Staff

Latest comments (7)

Jeff Kleist Writer, Marketing, Licensing 8 years ago
It can be summed up thusly:

Developer: we need XXX servers spread out over these geographical zones with YYYY bandwidth to ensure smooth rollout"

Money people:"that's way too expensive! Surely you can get by with half that"

Developer "nope, really can't"

A middle is reached, glitchy and overload ensues. All parties agree that this totally won't happen next time, till it doed
2Sign inorRegisterto rate and reply
Jeffrey Kesselman Professor - Game Development, Daniel Webster College8 years ago
To answer your question: People do. I've been the CTO of 3 companies that have smoothly launched 5 massively multiplayer online game sites. One of those was the second fastest growing game on Facebook shortly after launch.

The biggest problem is that people seem to think putting up a highly scalable and secure server system is easy and it is anything but. Too many people hire cowboy kids who think they can build everything in a high-level tool like Rails and have no real scalability or security experience.

Then, when the site fails, everyone says "why did this happen?" Just like a bridge that falls down, it happened because you didn't pay the experts to do it right. Now you'll spend a lot MORE money trying to recover from your success disaster.
2Sign inorRegisterto rate and reply
Jeffrey Kesselman Professor - Game Development, Daniel Webster College8 years ago
In re resources and the above comment: That might have been true at one time.

But these days you can use an on demand hoster like AWS, Joyent, Azure, etc

Insufficient resources is not an excuse any more. Today, server systems fail under load primarily because of insufficient engineering (which includes insufficient load testing and staging).

Edited 2 times. Last edit by Jeffrey Kesselman on 20th November 2013 9:56pm

2Sign inorRegisterto rate and reply
Show all comments (7)
Jeffrey Kesselman Professor - Game Development, Daniel Webster College8 years ago
By the way... I learned much of what I know about launching such services from brilliant people at the FIRST online internet game service for package games... the Total Entertainment Network or TEN.

And TEN launched completely smoothly more then a decade ago.

Edited 1 times. Last edit by Jeffrey Kesselman on 21st November 2013 1:07am

0Sign inorRegisterto rate and reply
Rick Lopez Illustrator, Graphic Designer 8 years ago
Im pretty certain these network issues will be fixed. At least the hardware has worked just fine for me. Im just a bit dissapointed with all the glitches in Battlefield 4 and many of them are not network related. Ive lost my game progress and multiplayer level ups over 3 times now. Its so bad Ive put the game down until a patch has fixed these issues. Frankly i dont blame the BF4 glitches all on SONY. This why I dont mind if a game is delayed, as which the case with watchdogs and second son. It just means for a better game in the end. And like I said these network kinks havent really broken the fun Ive been having with the PS4. The party chat works just fine, The browser is as fluid and seemless as SONY promised, able to brows between, games, apps and media on the fly. And sharing content is also pretty well executed as well as game streaming through twitch. There are a few kinks in terms of network and software, but these things Im assuming will be corrected. The hardware is working just fine, mobile device integration and all.
1Sign inorRegisterto rate and reply
Joe Tay Senior Architect - Infra and Ops, Electronic Arts8 years ago
Thank you Alexander! Finally a voice of reason in the midst of the cacophony of discontent. And yet there is so much more to it then just those issues. Speaking in my personal capacity, I don't believe any company out there would want to launch a game or online service knowing it will fail.
Here's hoping for a smoother ride on Xbox One.

Edited 1 times. Last edit by Joe Tay on 21st November 2013 7:59am

0Sign inorRegisterto rate and reply
Roman Margold Rendering Software Engineer, Sucker Punch Productions8 years ago
@JeffreyK: I would believe to your reasoning 10 years ago, but not now. I don't believe these infrastructures in companies like Sony or Blizzard are built by kids these days. There is tons of experience behind it. For me the only plausible explanation of the issues that we see with these big launches is the one Sony itself suggested by warning customers and offering them update download *before* the console shipped (the best solution, if applicable, in my opinion): they just don't want to spend resources (whether on scaling capabilities or actual network elements) on something that would get used for one or two weeks during launch period. Once the initial wave of requests is gone and everything recovers, everybody's happy and people forget (the level of forgetfulness depends on the scale of the launch problems).
In layman's terms: "we predict we'll need 200 servers, we only have 100, so we get extra 50 and then hope for the best. The odds that it won't be a complete disaster are pretty good, and if the demand is much higher than what we anticipated then we still win (although PR will have to take longer hours)."
It's a perfectly valid pragmatic reasoning which only has the one drawback that, if applied regularly, a percentage of people will stop pre-ordering and will wait it out (already happening). It's really always just a matter of weeks (at most) though, so it's a reasonable choice for many and, in turn, it makes the resources needed for lunch smaller next time (or the ride smoother for early adopters).
1Sign inorRegisterto rate and reply

Sign in to contribute

Need an account? Register now.