When plan B needs a plan C
Just a few days ago, news began circulating on the internet of a strange problem with Cisco routers which was easy to identify as it caused complete failure. Shortly afterwards, Cisco issued a warning for specific products which may still be functioning normally but could fail after approximately 18 months of operation without warning. Meanwhile, a list of affected devices has now been published online. But that’s far from the full story. It is now clear that the error was caused by Intel Atom C series processors. A faulty clock signal causes the devices to fail at an accelerated rate – and Cisco is not the only affected manufacturer. Atom processors are used by several other manufacturers including Juniper and Synology. Their customers are currently facing the decision of whether to replace hardware to avoid issues, sit back and hope for the best or vent their frustration at the manufacturers.
Due to the nature of the affected devices, the bug can cause unusual predicaments. If routers, switches or VPN gateways are part of critical redundant infrastructure, the replacement hardware is most likely to be an identical product. If redundant devices fail at the same time as the original device, companies will quickly realize that plan B wasn’t enough and many will not have a plan C.
Things can go wrong and this applies just as much to Intel as any other manufacturer. Software problems are also common and Microsoft’s routine patch day is a prime example. But when software goes wrong, the cause is in the code. Such errors can be easily verified, fixed and updated. But the Intel Atom Bug leads to bricked hardware. Processors in embedded hardware cannot be replaced easily. Defect devices are only fit for throwing away – or may only be repaired through great effort involving reflow soldering at a specialist facility. For Intel and device manufacturers, this could have substantial financial consequences. In a telephone conference, Intel’s CFO spoke of increased failure rates for a specific product and reserves made to deal with this in Q4 2016.
What lessons can be learned from situations like these? Problems of this kind are so rare, no recommendations can be made for specific products or technology – this time routers were affected but tomorrow it might be a SSD controller. However it is an important reminder that although difficult to avoid in today’s markets: relying too heavily on a single manufacturer may have consequences and in this situation, having a plan C is advisable. Perhaps it is time to test critical IT infrastructure and consider alternatives.