Outages Can Happen to Any System

Over the past two weeks, there have been several web site outages that received varying amounts of attention – some of the more widespread outages were:

Microsoft Outlook.com, 8/14 – 8/17 – Microsoft describes the problem as due to a “failure in a caching service… caused these devices to receive an error and continuously try to connect to our service… We have learned from this incident, and have made two key changes to harden our systems against future failure – one that involved increasing network bandwidth in the affected part of the system, and one that involved changing the way error handling is done for devices using Exchange ActiveSync.”

Google on 8/16 for 1-5 minutes – No cause stated by Google

NASDAQ, 8/22, 3 hours – The NASDAQ press release on the outage attributed the problem to “connectivity issues”, and said that the “cause of the issue has been identified and addressed”.

Ebay, 8/23, 6 hours, “technical issues which occurred during regularly scheduled maintenance”.

AWS, US-East-1, 8/25, 51 minutes – AWS status dashboard provided informational messages listing a problem with a faulty network device.

The internet in China, 8/25 –Domains ending in .cn were subject to a DDoS attack, with outages ranging from 2 – 13 hours. The source of the attack has not yet been tracked down.

The immediate causes of these outages, where they were provided, ranged from software problems, to network problems, to hardware problems, to malicious intent. Some of these issues might have been foreseeable – but hardware can fail at any time, even experienced admins can make mistakes in configuring systems, and users can behave in ways that no rational developer would ever consider, with or without malicious intent.

The next question is – just how much of a problem is each of these categories? Is it more likely that my server will fail, or that we would be subject to a DDOS attack? Yes, you still need to monitor for hardware failures, but if they are significantly less likely than being hacked, shouldn’t more effort be put into monitoring for attacks?

ENISA’s (European Network and Information Security Agency) Annual Incident Reports 2012 looked at “significant incidents in the electronic communications sector” in Europe, and categorized them by cause, duration, and thousands of users affected (both mobile and wired users). The breakdown of outages for 2012 was:

	% Incidents	Duration (Hours)	# Connections affected (Thousands)
System Failure (Hardware or Software)	76	9	2330
Third Party Failure	13	13	2808
Malicious Actions	8	4	1528
Natural Phenomena	6	36	557
Human Error	5	26	447

(Incident percentages add to >100 because some incidents had multiple causes.)

Among the conclusions of the report:

System failures were the most common cause of outages, with switch problems being the most common hardware failure.
Third party failures were primarily due to power failures.
Outages due to natural phenomena take the longest to resolve

Given the high percentage of systems failures as the cause of outage incidents, monitoring your servers, network and applications should be near the top of your list of priorities for troubleshooting problems. It is still important to guard against hackers, and to track all network and system changes to try to limit admin error, but if a problem occurs – if the creeks aren’t rising, and the lights aren’t flickering, the first place to look should be at system failure.

The post Outages Can Happen to Any System appeared first on Heroix Blog.

Outages Can Happen to Any System

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112