Heiko Specht About the Author

Interessiert an Faktoren, die mit Performance Monitoring im Großen und im Kleinen zu tun haben. Hobby ist Musik - daher auch mein kleiner Blog (siehe Links). Fotografie. Meine Familie. @Heispe

Major Internet Outage in China

Yesterday one of the biggest outages in history, if not the biggest outage, happened to the Internet in China.

Primarily and directly affected by that outage were most of the people living in China and browsing the internet in China. Secondary, all companies doing online business in China were affected.

The reasons for the outage are discussed below but I would like to focus on what has happened and what this means for all of us – including Compuware – with our Web presence in China.

For whatever reason at around 3pm (China time) 2/3 of any domain requests in China were routed to one single IP Address.

That single Datacenter of course went down immediately by the loads of requests hammering in within milliseconds. At that very moment the internet in China went down – with a few exceptions (VPN Users and those who had the DNS cached within their Client).

The outage lasted for north of 8 hours; during the core business time in China.

Interestingly enough, not all Domains were affected. Mainly those ending with .com and .net didn’t make it.. Others with ending com.cn were “only” partly offline but still figured problems. Adobe was technically not reachable. Nokia.com was available in parts of China but part of its content was directed to the suspect IP.

If one domain was reachable because the users Browser had still the domain/IP relation in mind other issues appeared.

I have mentioned Nokia.com just a second ago. Nokia.com itself was available but the page was kind of broken and loading very slow. Reason was: The domain r.nokia.com was falsely directed to the one and only IP everything got directed to.

 

Fig 1. HTML of Nokia.com was loaded but no css and js. That made the page unusable.

Fig 1. HTML of Nokia.com was loaded but no css and js. That made the page unusable.

 You can see how everything got routed to the one IP address – 6 connections by only one single host name. Imagine how many requests hammered on that Server when you know there are hundreds of millionsChinese users opening a webpage with having 21 different host names included (average number of hosts included in a webpage – source HTTParchive.com). This incident can be name also the biggest DDoS attack in History.

Now with that said – what else has been damaged?

One of the few pages that was (partly) still available in CN was Carrefour. But unfortunately that page has google-analytics.com included which was not available.

 Fig 2 Google Analytics didnt work

Fig 2: Google Analytics didn’t work – also hitting the one and only IP.Luckily this request did not block the page from loading.

A Browser is trying to establish a connection to a Host/Domain within 21 Seconds – after that it will drop that attempt. So if you got 1 Domain included which wasn’t connectable, your user’s browser tries for 20 Seconds. If you have 2, 3 or more Domains included you were probably technically available, but your page not usable.

 

Fig 3 a very short excerpt of what we have captured with our monitoring solution

Fig 3: a very short excerpt of what we have captured with our monitoring solution

The interesting momentum of the graph above is the sudden appearance of the issue and the sudden stop of the issue. Before 3pm (china time) no request went to that IP but all the sudden a vast amount of Domains got directed to that IP.

  Since we do not know how many users had the IP cached in the DNS Cache we can only say that the numbers delivered by Google-Analytics today for China are something for the Trash bin. And with having the Analytics in the trash bin it is impossible to estimate the business impact on that outage for our Business. However, there is a Real User Monitoring Solution in place which works from within the delivered and working domain name.

Another business case is SSL. So if the domain was reachable for some users and parts of your webpage using SSL encryption your page probably did not work either. Most of the current browser do a check if the certificate in use is a valid one (OCSP) – this check runs against other domain names. This mainly starts with ocsp.certificateorigin.com. Some of those domains were not reachable too. So users got the Browsers warning screen that this page trustworthiness is in question.

 

Fig.4 All the sudden at 3pm in China verisign have been offline – routing again to IP… after 5 long hours first requests went through to “real” IPs again.

Fig.4 All the sudden at 3pm in China verisign have been offline – routing again to IP… after 5 long hours first requests went through to “real” IPs again.

 The picture above indicates another issue that was issued by the false Domain routing-

The Cache of the Domain and IP Relation within the client. The client does not take care if it worked or not. The relation still remains in the Browser – at least for 8h. So even if the issue would have been fixed within only one hour the pages the user tried within this one hour would not work for another 7hours. Except he knows how to clear the DNS Cache. Would you know how to clean it up?

Summary:

Nearly every Chinese Internet user has been affected by the outage. Nearly every company doing online business in China has been affected by this outage.

It is interesting how sensitive the Internet reacts on a tiny DNS accident. How long it took to recover – but most important – how quiet everything was outside of China. The web was nearly not available to one of the strongest and fastest growing economies for 1 Business-Day.

Probably because only a very few companies can estimate the impact on their business because they use a Real User Monitoring Solution that does not save the data on different domain names.

If a page was available during these issues it was probably affected by the fact that one of the implemented 3 Party contents failed to deliver.

Vast majority of International BIG internet Players were not available in China – including of course Chinese pages too (those are not implemented here – like Baidu etc. – but press wrote about it)

While the outage seemed to be taken very cool by international press (because they are lack of knowledge?)  the Chinese Population went mad.

(Just do a Google Search for this IP: 65.49.2.178 and use the time limiter set to last 24h….)

This issue was important – not only because 8 h there was an issue with the internet business – which is hard enough – also because it stresses the trust of Chinese people into the internet again.

I suspect that it will take weeks to recover completely from this internet earthquake.

Comments

  1. It will not take weeks to recover. I am in China now and experienced the outage that day. And everything was back to normal by the night.
    Technically speaking, the affect will last no more than 12 hours due to the DNS cache strategy.

  2. Hi Andre,

    technically everything is in good shape (must I say temporarily?) again.
    But from a business perspective and from the people’s mindset it will take weeks. This outage cause a lot of companies thinking twice how to prevent issues like this having impact on their business. Have you noticed the rumble the outage provoked in the Chinese Blog scene ? I don’t think that full trust was back after the couple of hours of the outage. This will take weeks.

  3. I live here, and I never even noticed to be honest. We do our own DNS with unbound caching + dnssec validation though, so its likely it never passed the sniff test at our outbound routing. As for GA – Google Analytics is useless in China anyway, so it won’t make any difference – its blocked most of the time anyway.

  4. Hi Lawrence,

    thanks for sharing your experience here. There is a good reason why I have said “most” of the People in China.
    From what we see (outageanalyzer.com) there are rarely issues with Google-Analytics in China. So for western companies it is not unimportant do drive Analytics with GA in China – for sure they must know if they can trust their analytics – because they often drive Business decisions based on their users movement through the internet offering.

Comments

*


8 − two =