 
What Makes “Carrier Grade” Reliability So Hard?

I recently talked about the need for full Carrier Grade reliability in telecom networks. Today’s telecom networks are incredibly reliable. Over decades, service providers have engineered an extensive range of sophisticated features into their networks, to the point where they guarantee “six-nines” reliability. That means the network is guaranteed to be up 99.9999% of the time, implying a downtime of no more than 32 seconds per year.
Discussing what it means to deliver Carrier Grade reliability, we listed a just a few examples of what needs to be provided to meet telecom Carrier Grade system requirements. These broadly fall into four categories: network availability, security, performance and management.
Meeting these requirements represents a critical business challenge for telecom service providers as they refine their plans to progressively introduce Network Functions Virtualization (NFV) into their networks. They know that they need to continue to meet expectations for reliability as they transition to NFV; otherwise they run the risk of losing their high-value customers and seeing increased subscriber churn. That would seriously impact their ability to reduce OPEX and increase subscriber revenues, which are after all the core business objectives behind the NFV initiative.
Unfortunately for both service providers and Telecom Equipment Manufacturers (TEMs), it’s extremely difficult to develop a network infrastructure platform that delivers Carrier Grade reliability. And NFV makes this even harder, because so many software elements within the platform all have to achieve six-nines. These include not only the Operating System itself (e.g. Linux) but also the hypervisor, virtual switch, orchestrator (e.g. OpenStack) and middleware. There’s no way to meet the six-nines goal by using solutions designed for enterprise-class IT applications. You have to start from scratch, developing a platform specifically for this requirement and designing-in the reliability features from the start. This requires not only a major engineering investment but also an in-depth technical understanding of the complex challenges that are involved.
Let’s review some of the key technologies that are needed in order to achieve this all-important six-nines reliability in telecom infrastructure:
Looking first at what it takes to guarantee network availability for virtualized applications, an optimized hypervisor is required that minimizes the duration of outages during the live migration of Virtual Machines (VMs). The standard implementation of KVM, for example, doesn’t provide the response time that’s required time to minimize downtime during orchestration operations for power management, software upgrades, or reliability spare reconfiguration. In order to respond to failures of physical or virtual elements within the platform, the management software must be able to detect failed controllers, hosts or VMs very quickly and implement hot data synchronization, so that no calls are dropped or data lost when failovers occur. The system must automatically act to recover failed components and to restore sparing capability if that has been degraded. To do this, the platform must provide a full range of Carrier Grade availability APIs (hot sync, VM monitoring etc.), compatible with the needs of the OSS and orchestration systems and VNFs deployed by the service provider. The software design must ensure there is no single point of failure that can bring down a network component, nor any “silent” VM failures that can go undetected.
Network security requirements present major challenges for telecom infrastructure. Carrier Grade security can’t be implemented as a collection of bolt-on enhancements to enterprise-class software, rather it must be designed-in from the start as a set of coordinated, fully-embedded features. These features include: full protection for the program store and hypervisor; AAA (Authentication, Authorization and Accounting) security for the configuration and control point; rate limiting, overload and Denial-of-Service (DoS) protection to secure critical network and inter-VM connectivity; encryption and localization of tenant data; secure, isolated VM networks; secure password management and the prevention of OpenStack component spoofing.
A Carrier Grade network has stringent performance requirements, in terms of both throughput and latency. In an NFV architecture, the host virtual switch (vSwitch) must deliver high bandwidth to the guest VMs over secure tunnels. At the same time, the processor resources used by the vSwitch must be minimized, because service providers derive revenue from resources used to run services and applications, not those consumed by switching. The data plane processing functions running in the VMs must be accelerated to maximize the revenue-generating payload per Watt. In terms of latency constraints, the platform must ensure a deterministic interrupt latency of 10 microseconds or less, in order for virtualization to be feasible for the most demanding CPE and access functions. Finally, live migration of VMs must occur with an outage time less than 150ms, using a “share nothing” model in which all a subscriber’s data and state are transferred as part of the migration. The “share nothing” model, used in preference to the shared storage model in enterprise software, ensures that legacy applications are fully supported without needing to be rewritten for deployment in NFV.
Finally, key capabilities must be provided for network management. To eliminate the need for planned maintenance downtime windows, the system must support hitless software upgrades and hitless patches. The backup and recovery system must be fully integrated with the platform software. And support must be implemented for “Northbound” APIs that interface the infrastructure platform to the OSS/BSS and NFV orchestration software, including SNMP, Netconf, XML, REST APIs, OpenStack plug-ins and ACPI.
You can’t achieve these challenging requirements by starting from enterprise-class software that was originally developed for IT applications. This type of software usually achieves three-nines reliability, equivalent to a downtime of almost nine hours per year.
From a development perspective, addressing these requirements implies that the system should be able to meet the rigor of the TL9000 certification process. When evaluating technologies, it will be important to consider offerings that have been engineered by telecom experts, such as the Wind River Carrier Grade Communications Server, that guarantee the carrier grade levels of reliability demanded by both service providers and TEMs.
Blog post originally published on RCR Wireless: http://www.rcrwireless.com/article/20140512/opinion/reader-forum-makes-carrier-grade-reliability-hard/