Quantcast
Channel: Heroix Blog » enterprise-it
Viewing all articles
Browse latest Browse all 10

Outages Can Happen to Any System

$
0
0
Over the past two weeks, there have been several web site outages that received varying amounts of attention – some of the more widespread outages were:
  • Microsoft Outlook.com, 8/14 – 8/17 – Microsoft describes the problem as due to a “failure in a caching service… caused these devices to receive an error and continuously try to connect to our service… We have learned from this incident, and have made two key changes to harden our systems against future failure – one that involved increasing network bandwidth in the affected part of the system, and one that involved changing the way error handling is done for devices using Exchange ActiveSync.”
  •  

  • Google on 8/16 for 1-5 minutes – No cause stated by Google
  •  

  • NASDAQ, 8/22, 3 hours – The NASDAQ press release on the outage attributed the problem to “connectivity issues”, and said that the “cause of the issue has been identified and addressed”.
  •  

  • Ebay, 8/23, 6 hours, “technical issues which occurred during regularly scheduled maintenance”.
  •  

  • AWS, US-East-1, 8/25, 51 minutes – AWS status dashboard provided informational messages listing a problem with a faulty network device.
  •  

  • The internet in China, 8/25 –Domains ending in .cn were subject to a DDoS attack, with outages ranging from 2 – 13 hours.  The source of the attack has not yet been tracked down.

The immediate causes of these outages, where they were provided, ranged from software problems, to network problems, to hardware problems, to malicious intent.  Some of these issues might have been foreseeable – but hardware can fail at any time, even experienced admins can make mistakes in configuring systems, and users can behave in ways that no rational developer would ever consider, with or without malicious intent.

The next question is – just how much of a problem is each of these categories?  Is it more likely that my server will fail, or that we would be subject to a DDOS attack?  Yes, you still need to monitor for hardware failures, but if they are significantly less likely than being hacked, shouldn’t more effort be put into monitoring for attacks?

ENISA’s (European Network and Information Security Agency) Annual Incident Reports 2012 looked at “significant incidents in the electronic communications sector” in Europe, and categorized them by cause, duration, and thousands of users affected (both mobile and wired users).  The breakdown of outages for 2012 was:

% Incidents

Duration (Hours)

# Connections affected (Thousands)

System Failure (Hardware or Software)

76

9

2330

Third Party Failure

13

13

2808

Malicious Actions

8

4

1528

Natural Phenomena

6

36

557

Human Error

5

26

447

(Incident percentages add to >100 because some incidents had multiple causes.)

Among the conclusions of the report:

  • System failures were the most common cause of outages, with switch problems being the most common hardware failure.
  • Third party failures were primarily due to power failures.
  • Outages due to natural phenomena take the longest to resolve

Given the high percentage of systems failures as the cause of outage incidents, monitoring your servers, network and applications should be near the top of your list of priorities for troubleshooting problems.  It is still important to guard against hackers, and to track all network and system changes to try to limit admin error, but if a problem occurs – if the creeks aren’t rising, and the lights aren’t flickering, the first place to look should be at system failure.

The post Outages Can Happen to Any System appeared first on Heroix Blog.


Viewing all articles
Browse latest Browse all 10

Trending Articles