...
→ 5.1 Overview of High Availability
1. Introduction – why telecom industry needs Cloud Native PaaS
...
The O&M interface, data model, data generation, etc. should be pre-developed based on PaaS Management requirements during design and development stage of PaaS Service.
5. High Availability
5.1 Overview of High Availability
High availability is usually needed to ensure service availability and continuity. Enabling mechanisms are distributed throughout the system, at all levels, and the redundancy of the hardware and software subsystems eliminate single points of failure (SPOFs) that can affect the service. Control mechanisms for these redundant components are distributed to associated control and redundancy management systems. The control processes should be implemented at the lowest level that has sufficient scope to deal with the underlying failure scenarios, with escalation to higher-level mechanisms when failure cannot be handled at lower levels. This ensures that the fastest mechanisms are always used, as the remediation control loop times generally increase when the decision logic is moved up in the stack or into broader control domains.
For the redundancy, the target is to decrease the amount of redundant resources required. This implies that the direction of the redundancy structures is away from dual-redundant (typically active-standby 1:1 prevalent in the “box” implementations) towards either N+1 or N:1 scheme. This is enabled by the reduction or elimination of the direct physical attachment associations with the use of the cloud native functions, a move towards micro-service types of architectures, and the ability to add, remove and relocate service instances dynamically, based on offered load and resource utilization state. The infrastructure support for such mechanisms is essential for success, and the new, more dynamic configuration of the applications and services has many implications for both network services as well as the whole control, orchestration, and assurance software stack.
The availability and continuity mechanisms use “intent”-driven interfaces. The intent specifies the desired state, connectivity, or other aspect of the system from the service user’s perspective. The orchestration and control systems determine the implementation of the request, using the network resources available at the time of new request, and subsequently ensure that the intent continues to be met while the service is in use.
The associated orchestration and control subsystems are always aware of both the intended configuration and the actual configuration and can autonomously work to drive the system to intended state should there be deviations (either due to failures or due to intent changes).
5.2 Telco High Availability Requirements on PaaS
Telecom applications usually requires reliability up to 99.999%. And this high reliability should be achieved by the overall network cloud, whose major components usually include hardware, virtualization layer, PaaS layer, application layer and management systems. All these components construct a chain. Any components stringed together in series the availability of the overall system is multiple of the availability of each component - hence the overall availability decreases because any one component in the chain fails the overall chain or service is down. So before considering the overall reliability, it is necessary to ensure that each component maintains high reliability. As PaaS is an important component for network cloud, for the following contents, we’ll introduce about Telco High Availability requirements on PaaS, which is applicable to XGVela and all other PaaS platforms.
5.2.1 Network Cloud Deployment Scenarios
When talking about network cloud, we usually are talking about a distributed cloud with multiple sites spreading all over the country. Knowing the deployment scenarios of network cloud can help to know the potential HA requirement of each component in network cloud, especially PaaS.
Figure 5-1 describes typical deployment scenarios of telecom network cloud. The telco network cloud usually can be separated into the following type: core cloud, reginal cloud, edge cloud and far edge nodes. The features of each deployment scenario are displayed in figure 5-1.
Figure 5-1 Typical Deployment Scenarios of Network Cloud
As each deployment scenario has different resource, carries different applications, and may have different architectures, they will have different HA requirements. For example, as core cloud carried the most importance telco core NFs and has sufficient resources, the HA level should be high, while edge mage has lower HA level due to lack of enough resources.
5.2.2 Deployment of PaaS in Network Cloud
XGVela PaaS can be deployed in the core cloud, reginal cloud, as well as edge cloud and far edge nodes. The locations may include data centers, central office, metro or reginal POPs, enterprise CPE, ruggedized IOT gateways etc.
According to PaaS technical architecture described in Chapter 4, PaaS usually includes PaaS Management, which is necessary for PaaS platform to provide PaaS service to users, and PaaS Service, which are a group of service required by users on demand. The resources occupied by PaaS Management is relatively stable as only management related functions are deployed. Only one function in PaaS Management may requires changeable resources – storage capacity of Image and Package Repository, which may increase as more PaaS services are integrated onto PaaS platform. However, the amount of resource used by PaaS Service vary. It depends on user requirements and may vary from no resource occupation (which refers to no PaaS service ordered by user) to very large amount. The influence factors of resources used by PaaS Service at least include the number of PaaS service instantiated on PaaS platform, and the resource configuration of each PaaS service.
As different deployment scenarios in figure 5-1 have different amount of resource, PaaS would be deployed differently in each deployment scenarios.
- Core Cloud & Reginal Cloud usually have sufficient resources for PaaS deployment. In these two deployment scenarios, PaaS Management is fully deployed; PaaS Management and PaaS Service will consider high-level of reliability.
- Edge Cloud usually has limited resources, so PaaS Management is not required to be deployed. For those edge clouds connected to core cloud or reginal cloud, edge cloud can use the PaaS Management in core/reginal cloud remotely. For those independent edge cloud, for example an enterprise edge cloud, PaaS Management can be deployed in a cropped version and can be mixed deployed with other application instances.
- Far Edge only has few nodes, so all resources should be saved to applications and required PaaS service while PaaS Management should be deployed in remote core cloud or reginal cloud.
5.2.3 Infrastructure HA Categories
As infrastructure is the bottom layer of whole network cloud, PaaS, applications, management systems are all deployed on infrastructure, the HA condition of infrastructure will influence the HA strategy of all the other components in network cloud. So, before analyzing the HA requirement of PaaS, it is necessary to look at the different HA solutions of infrastructure.
Telco infrastructure can be divided into three categories for availability of the infrastructure:
- No HA: No HA means there is no redundancy of infrastructure. For this category, the high availability for services and applications can be achieved through stateful applications/services that can reconnect to different upstream or downstream devices and applications/services in case of failure. However, upon failure of the device, the application or the microservice running in that environment is shut down and no service is available from that instance of the application.
- Partial HA: In this context there is high availability of the platform but limited to some failure scenarios such as single failure or double failure where the infrastructure can continue to operate but in a degraded environment.
- Full HA: The infrastructure operating the cloud environment is operating in full HA mode where upon failure of any server hardware component or software component is full capable of continued operations as if nothing happened.
5.2.4 PaaS HA Requirements
5.2.4.1 General HA Requirements on XGVela PaaS
Before analyzing HA requirements for XGVela PaaS on different infrastructure HA categories, there are the following general HA requirements:
- The PaaS SHALL be capable of being deployed in any footprint – VM or bare metal and HA is equally applicable to both models.
- The PaaS MUST support quorum-based redundancy models for PaaS Management. This implies there are n>=3 PaaS Management instances where n is odd such that (n/2)+1 PaaS Management instances are required to be operational for full quorum.
- The PaaS SHALL support Active-Standby redundancy model for PaaS Management. If a quorum-based model is not supported then PaaS Management shall support an active standby model at a minimum. Active-Standby implies one management instance is active while another one is on hot or cold standby and can take over operations upon detection of failure or via manual intervention.
- Active-Standby model can be a 1+1 or N+M where N and M >= 1. This model while complex in implementation is well known and understood by Telcos in which N active controllers are backed up by M standby controllers
- The PaaS MUST allow deployment of stateful and stateless applications. The HA of the PaaS MUST allow for deployment of all types of applications and MUST NOT restrict any application types due to sharing of resources such as registry, databases etc.
- PaaS shall perform remedial actions such as restarting the service Pods if service not working properly, or service pod fails.
5.2.4.2 HA Requirements on PaaS Under No HA scenario
This operational environment is a PaaS or its components deployed in a single nodes or a server in a remote location such as far edge. If the server/device fails there is no service – because there are no redundant servers. If it is a single node deployed then the PaaS is operating in a non-redundant environment and is prone to hardware or software failure. In such deployment models, Application HA may be accomplished where applications and network functions are lighted up across multiple single nodes and communicate with each other via a heartbeat function to detect failures or connect with each other upstream or downstream nodes and are activated either automatically or via orchestration during such failure scenarios.
5.2.4.3 HA Requirements on PaaS Under Partial HA scenario
Partial HA can be defined as high availability under limited failures. In this scenario, such as the edge cloud, the system is designed for optimal costs with minimum components to meet the basic HA requirements. So, under single failure or limited failures the PaaS continues to operate with no degradation in service. Once the threshold is crossed with respect to failure scenarios the PaaS goes into graceful shutdown mode or read only mode of operation. Following are the requirements that must be supported by the PaaS in such scenario.
- The PaaS Service MUST continue to be operational and continue to provide service to the applications and network functions when connectivity is lost to the PaaS Management.
- The PaaS service instance when reconnected back with the PaaS Management MUST NOT kill the PaaS service or recreate a new one. The PaaS Management may mark the service tainted or stale until it can authenticate the service and verify the running status of service itself and related applications, and then decide based on policy if the service needs to be reset or refreshed.
5.2.4.4 HA Requirements on PaaS Under Full HA scenario
Full HA is defined as the PaaS is operational in a highly available environment. These systems are designed with lots of redundancy in hardware and software components. The PaaS Management components operate in N+M or quorum mode where it can tolerate (n/2)-1, n>=3 node failures and still be fully functional. The system would be designed in such a way that node, platform and application failures are tolerated including multiple failures. Full HA designs are more costly and at the discretion of Telco. The amount of redundancy varies from deployment to deployment based on cost and operational comfort.
In full HA scenario, resources of multiple nodes are used to deploy PaaS Management, and the PaaS services can be deployed in redundant modes such as replica sets to ensure high availability. Traffic and sessions are automatically load balanced across replica PaaS service instances based on defined policy. Following are the requirements must be supported by the PaaS in Full HA scenario.
- PaaS MUST continue to operate under multiple node/server failures until quorum is maintained.
- When PaaS Management quorum is lost, PaaS MUST operate in read only mode where PaaS services that are running must remain up and continue to provide service to applications. No new PaaS Service can be scheduled in this mode of operation.
5.2.5 Existing Solutions
All the above requirements are applicable to PaaS platform, which includes both General PaaS and Telco PaaS. Most of the general PaaS available today support such quorum or active standby controller redundancy model. If the requirements defined in this document are supported by general PaaS, then with respect to HA there is no distinction between a Telco PaaS and general PaaS. However, the availability model for general PaaS may be less stringent than a Telco PaaS and hence these requirements defined in this document are necessary and sufficient.
An open source platform like OKD supports a quorum based HA model for PaaS Management that quorum based HA model fits the requirements of Telco PaaS. Additionally a StarlingX based platform may support and N+M or Active Standby or load shared two controller model for the control plane of the PaaS. That also satisfies the requirements of Telco PaaS. This provides a choice to Telco customers in terms of what model is more convenient to their operational environment and they can choose that mode of operation.
5.2.6 Proposed HA Architecture for Telco PaaS
With the overall architecture of XGVela, the Telco PaaS components sit on top of a General PaaS environment. In that context the High availability of Telco PaaS components is dependent on the General PaaS availability. Hence the General PaaS is required to support the HA and Kubernetes or OpenStack environment.
The lifecycle management of those components is outside scope of the HA chapter but is covered in the overall architecture. The lifecycle of Telco PaaS layer or components in the Telco PaaS impacts the availability of the overall services. So, features and functions such as In-Service upgrades of components become necessary to ensure the entire PaaS environment operates in an efficient manner.
The proposed HA architecture for Telco PaaS is the same as that of general PaaS with the following attributes.
- General PaaS may also host XGVela specific control functions such as Telemetry collectors, log collectors, API gateways, multi-cluster management capabilities etc. These components are specific to telco usage and may be part of the XGVela Telco PaaS layer.
- XGVela components can all run as “applications and services” on the General PaaS.
- These components defined above then use the PaaS replica sets properties to instantiate multiple replicas of applications and controllers on the general to deliver HA based services.
- Any device profiles, service profiles and user profile configurations consumed by XGVela can be stored in local attached persistent storage and retrieved at will for initial or re-deployment.
- Telco PaaS components must follow the same guidelines as that of the underlying PaaS/cloud native infrastructure for deployment of their functionality unless otherwise explicitly specified.