This document will be open for review till October 26. Please leave your comment inline. Thanks!
Primary Reviewers: Azhar Sayeed Vance Shipley Sandeep Karkala Seshu Kumar Mudiganti Huifeng Le (Deactivated)
Content:
→ 1. Introduction – why telecom industry needs Cloud Native PaaS
...
→ 4.2.3.4 Telco PaaS Solution Proposed by Intel
→ 4.2.3.5 Telco PaaS Solution Proposed by China Mobile
→ 5. HA (to be added)
1. Introduction – why telecom industry needs Cloud Native PaaS
Since NFV architecture was firstly proposed as ETSI standard in 2012, telecommunication industry and telecom operators have started the cloud transformation of network, moving network functions from dedicated physical devices to standard virtualized environment. By 2021, majority of global telecom operators have built Network Cloud, which is private cloud carrying 4G/5G network functions, value-added network functions, network management systems, network orchestration systems, etc. Operator representatives include AT&T, Verizon, Telstra, BT, DT, China Mobile, China Telecom, China Unicom, NTT, etc. Take China Mobile for example, by the end of 2020, China Mobile has built Network Cloud in 8 major districts within China to carry 37 types of network functions including 5GC, IMS, EPC. The proportion of network cloud is up to 75%, which keeps increasing with the construction of edge cloud.
Network Cloud now follows NFV (Network Function Virtualization) architecture (refers to ETSI GS NFV 002). Network functions are implemented as VNFs (Virtualized Network Function), which use virtual machine as virtualized infrastructure and use OpenStack to manage virtual resources.
With the maturity and popularity of 5G network and edge computing, the customer of telecommunication network expands from individual users to enterprise/industry users. The relatively fixed functions and configurations of core network, which mainly used by individual users, now becomes changeable due to diverse vertical industry use cases. Take 5G core slicing as example, use cases of different industries needs different 5G core network. Network slice used by NB-IOT applications, such as smart metering, requires mainly signaling functions and bandwidth only up to several Kbps. While network slice used by VR and video broadcast requires mainly data transmission capabilities and bandwidth varying from 10Mbps to 100Mbps. To better meet customers’ requirements, Network Cloud needs to improve its agility, flexibility and reliability.
However, existing Network Cloud has the following problems:
- Inflexible network function architecture: VNFs are now developed and delivered in coarse size, which is hard to achieve accurate upgrading, scaling in/out, fault location and etc. And the network functions is not able to support customized design, deployment, configuration to meet diverse requirements from vertical industry.
- Monotonous capability of network cloud: current network cloud provides only infrastructure resources, which is not able to support convenient innovation of network functions/application. Using pure infrastructure is complicated and requires developers and operators obtaining high ability. Besides, Telco’s network cloud involves multiple network functions and infrastructures from multiple vendors, which are all designed in their own form and shape, and these cross-vendor gaps need to be filled by PaaS above infrastructure.
- Complex, costly and slow delivery process: the business processes of existing telecommunication network functions strictly follow the sequence of requirements analysis, design, developing, integration, testing and delivery. Product delivery duration is usually calculated by month. Network requirements of future use cases/ scenarios are changeable. Customer requirements on network may continuously change as their service grow. Traditional development and management processes need to be agile and automatic.
Cloud Native, as the best practice in cloud computing in IT industry, can help network cloud achieve agility, flexibility and reliability. Technologies and concepts such as container, microservice and DevOps can effectively alleviate the above problems. Container, as lightweight, flexible, and commonly used infrastructure can increase agility. Micro service architecture is an effective way in implementing complex software stack. It supports best flexibility in function isolation and evolution. Microservice-type network function supports isolated management, fast customized design/configuration and function combination under different use cases. DevOps can build CI/CD pipeline, and establish continuous feedback during the development and construction process of network cloud, which will improve the end-to-end automation and improve response speed to vertical industry speed. Cloud native is the inevitable direction of network cloud evolution.
Cloud native requires the cloud platform to provide the capabilities required by the applications as much as possible, and the applications use the capabilities provided by the cloud as much as possible so that to grow on the cloud. PaaS, as a bearing platform for required capabilities such as microservice architecture, CI/CD pipeline, network management functionalities and network function reusable modules, is acknowledged as enablement platform of cloud native for network cloud and telecommunication industry.
According to the above analysis, cloud native PaaS, which refers to PaaS providing cloud native capabilities such as containers, microservice, automation tools, and cloud native network function components, is necessary and worth researching in telecommunication industry. Exploring cloud native PaaS in open source community can provide reusable capabilities and reference implementation for the industry.
2. PaaS Condition in Network Cloud
Although it can be predicted that cloud native PaaS can improve the flexibility of network cloud, if looking at current situation of telecommunication industry, it is not clear how to introduce PaaS into network cloud, or what PaaS capabilities are required by network. Most of the network clouds, which have been putting into use, follow NFV architecture and do not take PaaS as an independent system or visible object. These operators and vendors use virtual machines and containers directly as infrastructure, and package all required functional modules and software reliance into VNF delivery image. Figure 1 shows a simple diagram of current VNF condition that only resources are reused among VNFs while VNF related software are dedicated to each VNF.
Figure 2-1 VNF Diagram
However, each vendor has its own internal PaaS to support development/maintenance and provide NF (Network Function) required capabilities/modulars; and some operators have designed internal CI/CD pipeline and testing tools. But few of these capabilities are called as PaaS capabilities in network cloud at current stage.
In the field of standardization, ETSI GR NFV-IFA 029 is the representative standard of PaaS for network cloud. It points out the role of PaaS, which includes: 1) eliminating complex operation of NFVI platforms for network function developers and supports auto management/orchestration of NFVI; 2) shielding infrastructure difference of different vendors; 3) providing NFV network service to external customers (e.g. government, bank or enterprises in a vertical industry). ETSI GR NFV-IFA 029 also proposed three potential NFV architectures enhanced with PaaS, but no decision has been made. There is no detailed PaaS capabilities defined in existing standards.
In the field of open source, CNCF provides large amount of cloud native PaaS capabilities that can be used in practical product implementation directly and indirectly (with enhancement). There is no specific indication of which software is necessary or suitable to play as specific PaaS capability in network cloud. As open source can provides reference implementation and is tightly related to practical operation, it is a good start point to explore cloud native PaaS for network cloud.
According to incomplete investigation among industry, standards and open source, possible types of cloud native PaaS capabilities for network cloud are listed as below:
- PaaS capabilities required to implement NF functions: This type of PaaS capability is necessary to achieve NF functions and logics, for example database, load balancer, protocol processing capability (PFCP, GTPU…), message bus, possible NF functional modulars, and some infrastructure related capabilities (such as hardware acceleration). User of this type of capabilities is mainly developers. These capabilities could be common among different NFs but unique for different venders/operators.
- PaaS capabilities required to manage NF functions: This type of PaaS capability helps to optimize the operation and management of network cloud, but has no direct influence on NF logics. Possible capabilities includes observability capabilities, CI/CD tools, testing tools, FCAPS management tools, etc. User of this type of capabilities is mainly developers and operation staffs. These capabilities can also be used to support NFV management systems such as NFVO, VNFM, EMS, OSS.
- PaaS capabilities to expose NF service to external customers: Representative capabilities of this type include bandwidth management capability, user identification capability, mobility management capability, UPF traffic routing function, and other edge computing network functions. User of this type of capabilities is external customers. Currently this type of capabilities is commonly carried and provided by MEC platform.
3. XGVela Overview
According to previous chapters, as the PaaS related standards and research are relatively indolent, it is a good way for telecommunication industry to start with existing open-source PaaS capabilities and build reference implementations. We could explore the enhancement of existing PaaS capabilities in telecom scenarios as well as new PaaS capabilities dedicatedly used telco network cloud. This is the reason that we start XGVela project.
XGVela is a telecom cloud native PaaS platform for 5G and future network cloud. It is targeting on delivering common and reusable PaaS capabilities required in the processes of network function development/running, network cloud management/maintenance, network cloud capability exposure and etc., so that applications are lightweight and contain only code to deliver the intended business logic.
XGVela was firstly launched in April 2020. It joint Linux Foundation Networking as Sandbox project in January 2021. China Mobile, Mavenir, Redhat, Huawei, Sgiscale, Intel, ZTE, STC, Ericsson, China Telecom, China Unicom, Nokia, WindRiver forms the first TSC group. XGVela got its first batch of seed code, which delivers telecom management PaaS functions, from Mavenir in December 2020.
3.1 XGVela High-Level Architecture
Figure 3-1 XGVela High-Level Architecture
Figure 3-1 shows the high-level architecture of XGVela.
XGVela, as the cloud native PAAS platform, runs on the container environment by default, and interwork with K8S to realize the orchestration and management of containers. Network Functions (NFs) and applications obtains needed common PaaS capabilities from XGVela. Upper layer telco management systems can select XGVela PaaS capabilities to play as their sub-modules to achieve O&M functions; XGVela PaaS capabilities can also be treated as “platform” resources that support to be orchestrated by management systems.
To distinguish from the existing PaaS implementations in the open-source communities, XGVela divides PaaS capabilities into three categories:
- Category 1: General PaaS
Represented by the blue block in Figure 3-1.
General PaaS can provide the abstract tools, services and environment required in the process of development, deployment, running, operation and management of applications and their associated services.
General PaaS capabilities are standard cloud capability which can be offered by any cloud provider, and it will have the capabilities to be shared by a number of industry specific PaaS including Telco PaaS.
In short, General PaaS refers to commonly-used PaaS capabilities by all industries(e.g. service mesh, API GW, LB, observability) and existing open-source PaaS implementations (e.g. Istio, envoy, Zookeeper, Grafana). XGVela takes those implementations for integration instead of re-inventing the wheels.
- Category 2: Adaptation Layer
Represented by the yellow block in Figure 3-1.
Adaptation Layer is unique enhancement of General PaaS capability when applied to telecom scenario. To avoid coupling with General PaaS, the enhancement would be implemented in the form of plug-ins, drivers, and other non-invasive forms.
In telecom scenario, the General PaaS capability is usually used in combination with corresponding adaptation layer enhancement points. For example, many load balancer software supports HTTP, TCP, UDP, while for 5G network functions, e.g. UPF, protocols like PFCP, GTPU are used in control plane and data plane flow. If a developer wants to use existing open-source load balancer in 5G core, PFCP and GTPU protocol analysis capability should be enhanced for that LB and delivered together with the LB to provide service.
- Category 3: Telco PaaS
Represented by green block in Figure 3-1.
Telco PaaS focuses on delivering telecom specific PaaS capabilities, which implements telecom features such as multi-tenancy, multiple network plane, network function topology, network function configuration, etc. These capabilities are used to serve telecom network functions and telecom management systems. Some Telco PaaS needs to interwork with General PaaS to deliver complete PaaS service.
The above three categories of PaaS capabilities together constitute XGVela, which provides all "platform" services for cloud native Telecom workloads. Developers can make combinations freely of general PAAS, general PAAS + adaptation and telco PAAS based on requirements.
3.2 XGVela Project Scope
- Define PaaS platform architecture, necessary functions / interfaces, processes, common software, etc. for Telecom scenarios. Content will cover general PaaS, adaptation layer and telco PaaS, of which General PaaS capabilities refers to existing implementations.
- Explore requirements of adaptation layer and telco PaaS base on telecom use cases, and implement functionality, interface, etc.
- Build reference implementation of telecom cloud native PaaS platform.
4. XGVela Technical Insight
4.1 Technical Architecture
Figure 4-1 XGVela Technical Architecture
XGVela, as PaaS platform for telecom scenarios, its functional framework is basically consistent with most of the commercial PaaS platforms. As shown in figure 4-1, PaaS platform contains two parts: PaaS Management and PaaS Service.
PaaS Management is responsible for managing PaaS services, which ensures PaaS platform to provide PaaS services to customers. PaaS Management complete the onboarding, orchestration, monitoring and maintenance of PaaS services in the background, make sure that applications and developer/operating stuff can select and use PaaS services, and help to manage operation status of PaaS services. Figure 4-1 shows the most basic functions that PaaS Management should have, which are service catalog, service lifecycle management, Image repository and package repository, API gateway, service & resource monitoring, service & resource log & event. For detailed description and requirements of each function, please refer to chapter 4.2.1
PaaS Service represents the capabilities/functions required by applications, developer and operation stuff, which achieves the core value of PaaS platform. It is managed and orchestrated by PaaS Management. The three types of PaaS capabilities concluded in chapter 2.2 belong to PaaS Service which are also separated as three categories (General PaaS, General PaaS + Adaptation Layer, Telco PaaS) based on using scenarios. The number of PaaS services keeps increasing with the diverse of use cases. And PaaS services keep upgrading based on customers’ needs. Currently, some functions/services have been summarized for different categories. For detailed description and requirements of each service, please refer to chapter 4.2.2 and 4.2.3.
If referring to existing commercial PaaS platform, PaaS Management and General PaaS usually together constitute the PaaS platform of IT industry. Therefore, blue block is used to represent these two together. Corresponding to chapter 3.2, what XGVela is doing is to pick required General PaaS capabilities (blue block) by telco use cases, then find the enhancement point of these General PaaS capabilities as adaptation layer, and explore Telco PaaS capabilities.
...
4.2.1PaaS Management
PaaS Management is responsible for the management of PaaS services. It is required to include at lease the following capabilities:
Service Catalog
Service Catalog is a directory of services, which lists all PaaS services provided by PaaS platform to customers. Specific requirements of service catalog includes:
- It is required that Service Catalog support adding new PaaS service to itself, and deleting, modifying, and querying existing PaaS services. The added PaaS services can exist in multiple forms, which can be local operator or helm package, as well as PaaS services on public cloud that adapted remotely through interfaces.
- It is required that Service Catalog support users/applications/external systems to order PaaS services on demand, and trigger the instantiation and configuration of selected PaaS services.
- It is recommended to achieve Service Catalog with well-designed UI.
Service Lifecycle Management
Service Lifecycle Management (Service LCM) is responsible for managing the lifecycle of PaaS services. Specific requirements include:
- It is required that Service LCM to select images and packages from Image & Package Repository based on user’s selection in Service Catalog, and complete instantiation of PaaS services, which are packaged as Operator or Helm Chart. It also requires Service LCM to start, stop, upgrade, delete PaaS services.
- Service LCM uses Kubernetes interface for resource application and orchestration of PaaS services.
- It is required that Service LCM manages scaling in/out of PaaS service instance based on monitoring KPIs and scaling policy set by customers.
- It is recommended to implement UI for service LCM.
- It is required that Service LCM supports access control, which provide unified user management and authentication mechanism, and support RBAC. Users can obtain management permission for PaaS services through role settings.
Image and Package Repository
Image & Package Repository is responsible for the storage and management of PaaS service images and deployment packages. The images and packages can be stored locally in PaaS platform repository, as well was remotely in cloud repositories (GitHub, docker hub, etc.). When user selects PaaS services in Service Catalog and triggers service instantiation process, Service LCM will ask Image & Package Repository for images and packages.
API Gateway
API Gateway is an PaaS service as well as an important PaaS Management function. When using as PaaS Management function, API Gateway is responsible for exposing the service API of PaaS services, and routing user traffic to target processing unit. Specific requirements include:
- It is required that API Gateway to support identity authentication and authorization of API calls, verifying the legitimacy and authority of API calling entity (user, other software, etc.), releasing secure access control, avoiding security threats on PaaS service.
- It is required that API Gateway to support forwarding user request to correct processing unit. The forwarding process supports load balancing, and traffic control actions based on pre-defined traffic management policies, which may include health check, rate limiting, time out & retry, circuit breaker, etc.
- It is recommended that API Gateway to support protocol processing, including HTTP, HTTP2, GRPC, web socket, etc.
- It is required that API Gateway support managing the service API of PaaS services, including API definition (path, parameters, etc.), APU publishing/ suspending/ online/ offline/ withdraw/ etc. This feature is mainly used by developers/operators to provide API-type PaaS services for external usage.
- It is required to support monitoring API usage and API data analysis, which cover performance, usage, alarm, log, etc.
Service & Resource Monitoring
Service & Resource Monitoring is responsible for monitoring PaaS service instances and resources the service used, collecting and displaying monitoring data through well-designed dashboard, and alarm setting. The monitoring KPI and data are determined by PaaS services themselves. This function assists PaaS High Availability
→ 5.1 Overview of High Availability
→ 5.2 Telco High Availability Requirement on PaaS
→ 5.2.1 Network Cloud Deployment Scenarios
→ 5.2.2 Deployment of PaaS in Network Cloud
→ 5.2.3 Infrastructure HA Categories
→ 5.2.4.1 General HA Requirements on XGVela PaaS
→ 5.2.4.2 HA Requirements on PaaS under No HA Scenario
→ 5.2.4.3 HA Requirements on PaaS under Partial HA Scenario
→ 5.2.4.4 HA Requirements on PaaS under Full HA Scenario
→ 5.2.6 Proposed HA Architecture for Telco PaaS
1. Introduction – why telecom industry needs Cloud Native PaaS
Since NFV architecture was firstly proposed as ETSI standard in 2012, telecommunication industry and telecom operators have started the cloud transformation of network, moving network functions from dedicated physical devices to standard virtualized environment. By 2021, majority of global telecom operators have built Network Cloud, which is private cloud carrying 4G/5G network functions, value-added network functions, network management systems, network orchestration systems, etc. Operator representatives include AT&T, Verizon, Telstra, BT, DT, China Mobile, China Telecom, China Unicom, NTT, etc. Take China Mobile for example, by the end of 2020, China Mobile has built Network Cloud in 8 major districts within China to carry 37 types of network functions including 5GC, IMS, EPC. The proportion of network cloud is up to 75%, which keeps increasing with the construction of edge cloud.
Network Cloud now follows NFV (Network Function Virtualization) architecture (refers to ETSI GS NFV 002). Network functions are implemented as VNFs (Virtualized Network Function), which use virtual machine as virtualized infrastructure and use OpenStack to manage virtual resources.
With the maturity and popularity of 5G network and edge computing, the customer of telecommunication network expands from individual users to enterprise/industry users. The relatively fixed functions and configurations of core network, which mainly used by individual users, now becomes changeable due to diverse vertical industry use cases. Take 5G core slicing as example, use cases of different industries needs different 5G core network. Network slice used by NB-IOT applications, such as smart metering, requires mainly signaling functions and bandwidth only up to several Kbps. While network slice used by VR and video broadcast requires mainly data transmission capabilities and bandwidth varying from 10Mbps to 100Mbps. To better meet customers’ requirements, Network Cloud needs to improve its agility, flexibility and reliability.
However, existing Network Cloud has the following problems:
- Inflexible network function architecture: VNFs are now developed and delivered in coarse size, which is hard to achieve accurate upgrading, scaling in/out, fault location and etc. And the network functions is not able to support customized design, deployment, configuration to meet diverse requirements from vertical industry.
- Monotonous capability of network cloud: current network cloud provides only infrastructure resources, which is not able to support convenient innovation of network functions/application. Using pure infrastructure is complicated and requires developers and operators obtaining high ability. Besides, Telco’s network cloud involves multiple network functions and infrastructures from multiple vendors, which are all designed in their own form and shape, and these cross-vendor gaps need to be filled by PaaS above infrastructure.
- Complex, costly and slow delivery process: the business processes of existing telecommunication network functions strictly follow the sequence of requirements analysis, design, developing, integration, testing and delivery. Product delivery duration is usually calculated by month. Network requirements of future use cases/ scenarios are changeable. Customer requirements on network may continuously change as their service grow. Traditional development and management processes need to be agile and automatic.
Cloud Native, as the best practice in cloud computing in IT industry, can help network cloud achieve agility, flexibility and reliability. Technologies and concepts such as container, microservice and DevOps can effectively alleviate the above problems. Container, as lightweight, flexible, and commonly used infrastructure can increase agility. Micro service architecture is an effective way in implementing complex software stack. It supports best flexibility in function isolation and evolution. Microservice-type network function supports isolated management, fast customized design/configuration and function combination under different use cases. DevOps can build CI/CD pipeline, and establish continuous feedback during the development and construction process of network cloud, which will improve the end-to-end automation and improve response speed to vertical industry speed. Cloud native is the inevitable direction of network cloud evolution.
Cloud native requires the cloud platform to provide the capabilities required by the applications as much as possible, and the applications use the capabilities provided by the cloud as much as possible so that to grow on the cloud. PaaS, as a bearing platform for required capabilities such as microservice architecture, CI/CD pipeline, network management functionalities and network function reusable modules, is acknowledged as enablement platform of cloud native for network cloud and telecommunication industry.
According to the above analysis, cloud native PaaS, which refers to PaaS providing cloud native capabilities such as containers, microservice, automation tools, and cloud native network function components, is necessary and worth researching in telecommunication industry. Exploring cloud native PaaS in open source community can provide reusable capabilities and reference implementation for the industry.
2. PaaS Condition in Network Cloud
Although it can be predicted that cloud native PaaS can improve the flexibility of network cloud, if looking at current situation of telecommunication industry, it is not clear how to introduce PaaS into network cloud, or what PaaS capabilities are required by network. Most of the network clouds, which have been putting into use, follow NFV architecture and do not take PaaS as an independent system or visible object. These operators and vendors use virtual machines and containers directly as infrastructure, and package all required functional modules and software reliance into VNF delivery image. Figure 1 shows a simple diagram of current VNF condition that only resources are reused among VNFs while VNF related software are dedicated to each VNF.
Figure 2-1 VNF Diagram
However, each vendor has its own internal PaaS to support development/maintenance and provide NF (Network Function) required capabilities/modulars; and some operators have designed internal CI/CD pipeline and testing tools. But few of these capabilities are called as PaaS capabilities in network cloud at current stage.
In the field of standardization, ETSI GR NFV-IFA 029 is the representative standard of PaaS for network cloud. It points out the role of PaaS, which includes: 1) eliminating complex operation of NFVI platforms for network function developers and supports auto management/orchestration of NFVI; 2) shielding infrastructure difference of different vendors; 3) providing NFV network service to external customers (e.g. government, bank or enterprises in a vertical industry). ETSI GR NFV-IFA 029 also proposed three potential NFV architectures enhanced with PaaS, but no decision has been made. There is no detailed PaaS capabilities defined in existing standards.
In the field of open source, CNCF provides large amount of cloud native PaaS capabilities that can be used in practical product implementation directly and indirectly (with enhancement). There is no specific indication of which software is necessary or suitable to play as specific PaaS capability in network cloud. As open source can provides reference implementation and is tightly related to practical operation, it is a good start point to explore cloud native PaaS for network cloud.
According to incomplete investigation among industry, standards and open source, possible types of cloud native PaaS capabilities for network cloud are listed as below:
- PaaS capabilities required to implement NF functions: This type of PaaS capability is necessary to achieve NF functions and logics, for example database, load balancer, protocol processing capability (PFCP, GTPU…), message bus, possible NF functional modulars, and some infrastructure related capabilities (such as hardware acceleration). User of this type of capabilities is mainly developers. These capabilities could be common among different NFs but unique for different venders/operators.
- PaaS capabilities required to manage NF functions: This type of PaaS capability helps to optimize the operation and management of network cloud, but has no direct influence on NF logics. Possible capabilities includes observability capabilities, CI/CD tools, testing tools, FCAPS management tools, etc. User of this type of capabilities is mainly developers and operation staffs. These capabilities can also be used to support NFV management systems such as NFVO, VNFM, EMS, OSS.
- PaaS capabilities to expose NF service to external customers: Representative capabilities of this type include bandwidth management capability, user identification capability, mobility management capability, UPF traffic routing function, and other edge computing network functions. User of this type of capabilities is external customers. Currently this type of capabilities is commonly carried and provided by MEC platform.
3. XGVela Overview
According to previous chapters, as the PaaS related standards and research are relatively indolent, it is a good way for telecommunication industry to start with existing open-source PaaS capabilities and build reference implementations. We could explore the enhancement of existing PaaS capabilities in telecom scenarios as well as new PaaS capabilities dedicatedly used telco network cloud. This is the reason that we start XGVela project.
XGVela is a telecom cloud native PaaS platform for 5G and future network cloud. It is targeting on delivering common and reusable PaaS capabilities required in the processes of network function development/running, network cloud management/maintenance, network cloud capability exposure and etc., so that applications are lightweight and contain only code to deliver the intended business logic.
XGVela was firstly launched in April 2020. It joint Linux Foundation Networking as Sandbox project in January 2021. China Mobile, Mavenir, Redhat, Huawei, Sgiscale, Intel, ZTE, STC, Ericsson, China Telecom, China Unicom, Nokia, WindRiver forms the first TSC group. XGVela got its first batch of seed code, which delivers telecom management PaaS functions, from Mavenir in December 2020.
3.1 XGVela High-Level Architecture
Figure 3-1 XGVela High-Level Architecture
Figure 3-1 shows the high-level architecture of XGVela.
XGVela, as the cloud native PaaS platform, runs on the container environment by default, and interwork with K8S to realize the orchestration and management of containers. Network Functions (NFs) and applications obtains needed common PaaS capabilities from XGVela. Upper layer telco management systems can select XGVela PaaS capabilities to play as their sub-modules to achieve O&M functions; XGVela PaaS capabilities can also be treated as “platform” resources that support to be orchestrated by management systems.
To distinguish from the existing PaaS implementations in the open-source communities, XGVela divides PaaS capabilities into three categories:
- Category 1: General PaaS
Represented by the blue block in Figure 3-1.
General PaaS can provide the abstract tools, services and environment required in the process of development, deployment, running, operation and management of applications and their associated services.
General PaaS capabilities are standard cloud capability which can be offered by any cloud provider, and it will have the capabilities to be shared by a number of industry specific PaaS including Telco PaaS.
In short, General PaaS refers to commonly-used PaaS capabilities by all industries(e.g. service mesh, API GW, LB, observability) and existing open-source PaaS implementations (e.g. Istio, envoy, Zookeeper, Grafana). XGVela takes those implementations for integration instead of re-inventing the wheels.
- Category 2: Adaptation Layer
Represented by the yellow block in Figure 3-1.
Adaptation Layer is unique enhancement of General PaaS capability when applied to telecom scenario. To avoid coupling with General PaaS, the enhancement would be implemented in the form of plug-ins, drivers, and other non-invasive forms.
In telecom scenario, the General PaaS capability is usually used in combination with corresponding adaptation layer enhancement points. For example, many load balancer software supports HTTP, TCP, UDP, while for 5G network functions, e.g. UPF, protocols like PFCP, GTPU are used in control plane and data plane flow. If a developer wants to use existing open-source load balancer in 5G core, PFCP and GTPU protocol analysis capability should be enhanced for that LB and delivered together with the LB to provide service.
- Category 3: Telco PaaS
Represented by green block in Figure 3-1.
Telco PaaS focuses on delivering telecom specific PaaS capabilities, which implements telecom features such as multi-tenancy, multiple network plane, network function topology, network function configuration, etc. These capabilities are used to serve telecom network functions and telecom management systems. Some Telco PaaS needs to interwork with General PaaS to deliver complete PaaS service.
The above three categories of PaaS capabilities together constitute XGVela, which provides all "platform" services for cloud native Telecom workloads. Developers can make combinations freely of general PAAS, general PAAS + adaptation and telco PAAS based on requirements.
3.2 XGVela Project Scope
- Define PaaS platform architecture, necessary functions / interfaces, processes, common software, etc. for Telecom scenarios. Content will cover general PaaS, adaptation layer and telco PaaS, of which General PaaS capabilities refers to existing implementations.
- Explore requirements of adaptation layer and telco PaaS base on telecom use cases, and implement functionality, interface, etc.
- Build reference implementation of telecom cloud native PaaS platform.
4. XGVela Technical Insight
4.1 Technical Architecture
Figure 4-1 XGVela Technical Architecture
XGVela, as PaaS platform for telecom scenarios, its functional framework is basically consistent with most of the commercial PaaS platforms. As shown in figure 4-1, PaaS platform contains two parts: PaaS Management and PaaS Service.
PaaS Management is responsible for managing PaaS services, which ensures PaaS platform to provide PaaS services to customers. PaaS Management complete the onboarding, orchestration, monitoring and maintenance of PaaS services in the background, make sure that applications and developer/operating stuff can select and use PaaS services, and help to manage operation status of PaaS services. Figure 4-1 shows the most basic functions that PaaS Management should have, which are service catalog, service lifecycle management, Image repository and package repository, API gateway, service & resource monitoring, service & resource log & event. For detailed description and requirements of each function, please refer to chapter 4.2.1
PaaS Service represents the capabilities/functions required by applications, developer and operation stuff, which achieves the core value of PaaS platform. It is managed and orchestrated by PaaS Management. The three types of PaaS capabilities concluded in chapter 2 belong to PaaS Service which are also separated as three categories (General PaaS, General PaaS + Adaptation Layer, Telco PaaS) based on using scenarios. The number of PaaS services keeps increasing with the diverse of use cases. And PaaS services keep upgrading based on customers’ needs. Currently, some functions/services have been summarized for different categories. For detailed description and requirements of each service, please refer to chapter 4.2.2 and 4.2.3.
If referring to existing commercial PaaS platform, PaaS Management and General PaaS usually together constitute the PaaS platform of IT industry. Therefore, blue block is used to represent these two together. Corresponding to chapter 3.2, what XGVela is doing is to pick required General PaaS capabilities (blue block) by telco use cases, then find the enhancement point of these General PaaS capabilities as adaptation layer, and explore Telco PaaS capabilities.
4.2 XGVela PaaS Functional Requirements
4.2.1PaaS Management
PaaS Management is responsible for the management of PaaS services. It is required to include at lease the following capabilities:
Service Catalog
Service Catalog is a directory of services, which lists all PaaS services provided by PaaS platform to customers. Specific requirements of service catalog includes:
- It is required that Service Catalog support adding new PaaS service to itself, and deleting, modifying, and querying existing PaaS services. The added PaaS services can exist in multiple forms, which can be local operator or helm package, as well as PaaS services on public cloud that adapted remotely through interfaces.
- It is required that Service Catalog support users/applications/external systems to order PaaS services on demand, and trigger the instantiation and configuration of selected PaaS services.
- It is recommended to achieve Service Catalog with well-designed UI.
Service Lifecycle Management
Service Lifecycle Management (Service LCM) is responsible for managing the lifecycle of PaaS services. Specific requirements include:
- It is required that Service LCM to select images and packages from Image & Package Repository based on user’s selection in Service Catalog, and complete instantiation of PaaS services, which are packaged as Operator or Helm Chart. It also requires Service LCM to start, stop, upgrade, delete PaaS services.
- Service LCM uses Kubernetes interface for resource application and orchestration of PaaS services.
- It is required that Service LCM manages scaling in/out of PaaS service instance based on monitoring KPIs and scaling policy set by customers.
- It is recommended to implement UI for service LCM.
- It is required that Service LCM supports access control, which provide unified user management and authentication mechanism, and support RBAC. Users can obtain management permission for PaaS services through role settings.
Image and Package Repository
Image & Package Repository is responsible for the storage and management of PaaS service images and deployment packages. The images and packages can be stored locally in PaaS platform repository, as well was remotely in cloud repositories (GitHub, docker hub, etc.). When user selects PaaS services in Service Catalog and triggers service instantiation process, Service LCM will ask Image & Package Repository for images and packages.
API Gateway
API Gateway is an PaaS service as well as an important PaaS Management function. When using as PaaS Management function, API Gateway is responsible for exposing the service API of PaaS services, and routing user traffic to target processing unit. Specific requirements include:
- It is required that API Gateway to support identity authentication and authorization of API calls, verifying the legitimacy and authority of API calling entity (user, other software, etc.), releasing secure access control, avoiding security threats on PaaS service.
- It is required that API Gateway to support forwarding user request to correct processing unit. The forwarding process supports load balancing, and traffic control actions based on pre-defined traffic management policies, which may include health check, rate limiting, time out & retry, circuit breaker, etc.
- It is recommended that API Gateway to support protocol processing, including HTTP, HTTP2, GRPC, web socket, etc.
- It is required that API Gateway support managing the service API of PaaS services, including API definition (path, parameters, etc.), API publishing/ suspending/ online/ offline/ withdraw/ etc. This feature is mainly used by developers/operators to provide API-type PaaS services for external usage.
- It is required to support monitoring API usage and API data analysis, which cover performance, usage, alarm, log, etc.
Service & Resource Monitoring
Service & Resource Monitoring is responsible for monitoring PaaS service instances and resources the service used, collecting and displaying monitoring data through well-designed dashboard, and alarm setting. The monitoring KPI and data are determined by PaaS services themselves. This function assists PaaS platform manager and PaaS service consumer to track the status of PaaS services and resources.
...
- General PaaS shall complete service lifecycle management through PaaS Management. There are usually two ways for users to use a PaaS service: call API of API-type PaaS service and create an instance-type PaaS service. API-type PaaS services (like AI service-facial recognition, image processing, etc.) are managed by API Gateway of PaaS Management. Instance-type PaaS Services (like DB/ LB, etc.) are managed by Service Lifecycle Management.
- General PaaS services shall be packaged as Operator or Helm Chart. Related image and package can be stored in local Image & Package Repository, as well as in remote repository while pre-configuring automatic access to the remote repo.
- All General PaaS service shall support to general monitoring data, logs, events, alarms, etc., and report to Service & Resource Monitoring function and Service & Resource Log & Event function in PaaS Management.
- General PaaS services shall support to be deployed on any Kubernetes cased CaaS (Container as a Service) layer.
- General PaaS shall support custom configuration. The configuration parameters are designed by developersPaaS Service Provider, the configuration contents are provided by users according to application requirements following certain rules.
- It is recommended to select General PaaS software from commonly used CNCF projects.
...
Commonly used open-source tools are Kong, Tyk, 3-Scale, Istio, EMCO.
Service Discovery & Registration
General PAAS shall provide Service Discovery & Registration functions and software to help micro services of application/system to obtain each other's access information
- Service Discovery & Registration functions shall maintain real-time microservice access info, which includes adding address of new microservice, update microservice instance address, deleting information of fault microservice, etc.
- Commonly used open -source software include CoreDNS,etcd,Zookeeper,Netflix,Nacos, among which CoreDNS, etcd, Zookeeper are popular choice.
...
However, typically, in Kubernetes each pod only has one network interface (apart from a loopback) which is not enough for a production ready telco network function. For this problem, there is an open-source solution, Multus, that can solve it. Multus is a container network interface (CNI) plugin for Kubernetes that enables attaching multiple network interfaces to pods. With Multus, you can create a multi-homed pod that has multiple interfaces. Multus CNI has already support many network CNIs, including Calico, Flannel, Userspace CNI (oveovs-dpdk/vpp), OVN-Multi CNI, SRIOV-NIC CNI and SMartNIC CNI. Developers can specify different network CNI for different network planes to realize container multi network planes. For example, control network plane can chose traditional CNIs with lower performance like Calico, Flannel; while data/user network plane can chose CNIs with better forwarding performance, for example SRIOV-NIC CNI.
According to above contents, we can see that Multus ensures pod to implement multiple virtual network interface/card and implement multiple network plane for containers at infrastructure level. However, for developers of network functions, they still need to understand the working principles and using methods of different CNIs. The difficulty of network development has nor been reduced. Therefore, based on Multus solution, a Telco PaaS function named NMaaS is proposed in this Chapter.
NMaaS, Network Management as a Service, is proposed to expose the service of container multiple network planes. It has the following features:
- Exposes NB APIs for Orchestration/Management systems or developers to configure the Infrastructure NIC, add/delete interface to NF at runtime, SRIOV configuration etc.
- Shields the different using methods of different CNIs and provides consistent user experience on NIC management. For example, developers can simply specify the number of vNICs and SLA requirements on these vNICs to complete container vNIC configuration.
- Supports customized value settings for vNIC parameters such as latency, jitter, bandwidth, throughput, performance, and trigger CaaS layer to setup the required underlaying driver/software based on these requirements.
Figure 4-7 Proposed NMaaS Diagram (need update)
Figure 4-7 shows the basic working principle of NMaaS. NMaaS pod receives a vNIC request from orchestration/management systems or command line through standard NMaaS NBI. NMaaS will convert vNIC request into contents that can be understand by CaaS layer, including vNIC type, amount, configutation, etc. And the converted information will be sent to Multus to setup required underlaying driver/software. Then, NMaaS will trigger K8S create new vNIC for target Pod, and update the traffic and vNIC configuration in Pod.
4.2.4 XGVela PaaS Workflow
After summarizing the technical architecture and functional requirements, XGVela related workflows will be covered in this Chapter. These workflows are applicable to all PaaS platforms, which includes XGVela. In this Chapter, we only cover the simple and general workflows that have no telecom features, while the interaction with NFVO, VNFM and other telecom systems will be considered in future release.
Figure 4-7 PaaS Platform Workflow
We simplify the PaaS technical architecture and its relationship with outer management systems as blocks in figure 4-7, which includes NF/application, User, Operator/Management systems, PaaS Service, PaaS Management, and CaaS. The major workflow are listed below.
Service Order Process
Service Order Process ensure PaaS services to be ordered by Users. Firstly, User will browse all PaaS Services through Service Catalog in PaaS Management and select required PaaS Service. Then, User will configure user-defined parameters of the selected PaaS Service, such as interface, performance parameter, reliability parameter, service model (dedicated or shared), etc.
Service Instantiation Process
Service Instantiation Process instantiates PaaS Service. After ordering required PaaS Service, the User will trigger the service instantiation. The Service Lifecycle Management in PaaS Management will instantiate selected PaaS Service. For shared PaaS Service, instantiation means selecting existing and running PaaS service on PaaS platform, completing configuration including RBAC/access/etc., and return access API to User. For dedicated PaaS Service, instantiation means selecting images/packages/descriptors in Image & Package Repository of PaaS Management, sending these files together with user-defined configurations to CaaS, and triggering K8S to complete PaaS Service instantiation. The CaaS layer will reply access API and monitoring data of PaaS Service to PaaS Management.
Service Unsubscribe Process
Service Unsubscribe Process lets Users to unsubscribe the ordered and instantiated PaaS Service through PaaS Management. For shared PaaS Service, PaaS Management will stop PaaS Service for Users through changing the PaaS Service configuration like RBAC. For dedicated PaaS Service, PaaS Management will delete PaaS Service instance on CaaS.
Service Configuration Updating Process
Service Configuration Updating Process can help User/Operator/ Management Systems to update configuration of PaaS Service through configuration center in PaaS Management. The Service LCM and CaaS will execute new configuration on Service and Pod.
Adding Service Process
Adding Service Process lets Operators to add new PaaS Service to PaaS platform. The process includes uploading images and packages of new PaaS Service into Image & Package Repository, onboarding PaaS Service in Service Catalog, setting the basic configuration (including configuration items and parameters) and user-defined configuration items.
Deleting Service Process
Deleting Service Process lets Operators to offline and remove PaaS Service on PaaS platform. PaaS Management will firstly check the usage condition of PaaS Service selected to be deleted. PaaS Service in use cannot be deleted. If the PaaS Service is not used, PaaS Management will then remove it from Service Catalog, and deleting related images and packages in Image & Package Repository.
Service Operation and Maintenance Process
Service & Resource Monitoring/Log/Event in PaaS Management achieve health check, monitoring, log management, alarm management of PaaS Service through the interfaces with PaaS Service instance.
The O&M interface, data model, data generation, etc. should be pre-developed based on PaaS Management requirements during design and development stage of PaaS Service.
...
network planes to realize container multi network planes. For example, control network plane can chose traditional CNIs with lower performance like Calico, Flannel; while data/user network plane can chose CNIs with better forwarding performance, for example SRIOV-NIC CNI.
According to above contents, we can see that Multus ensures pod to implement multiple virtual network interface/card and implement multiple network plane for containers at infrastructure level. However, for developers of network functions, they still need to understand the working principles and using methods of different CNIs. The difficulty of network development has not been reduced. Therefore, based on Multus solution, a Telco PaaS function named NMaaS is proposed in this Chapter.
NMaaS, Network Management as a Service, is proposed to expose the service of container multiple network planes. It has the following features:
- Exposes NB APIs for Orchestration/Management systems or developers to configure the Infrastructure NIC, add/delete interface to NF at runtime, SRIOV configuration etc.
- Shields the different using methods of different CNIs and provides consistent user experience on NIC management. For example, developers can simply specify the number of vNICs and SLA requirements on these vNICs to complete container vNIC configuration.
- Supports customized value settings for vNIC parameters such as latency, jitter, bandwidth, throughput, performance, and trigger CaaS layer to setup the required underlaying driver/software based on these requirements.
Figure 4-7 Proposed NMaaS Diagram
Figure 4-7 shows the basic working principle of NMaaS. NMaaS pod receives a vNIC request from orchestration/management systems or command line through standard NMaaS NBI. NMaaS will convert vNIC request into contents that can be understand by CaaS layer, including vNIC type, amount, configutation, etc. And the converted information will be sent to Multus to setup required underlaying driver/software. Then, NMaaS will trigger K8S create new vNIC for target Pod, and update the traffic and vNIC configuration in Pod.
4.2.3.5 Telco PaaS Solution Proposed by China Mobile
For telecommunication industry, network functions are designed following a paradigm and in distributed architecture. No matter it is PNF, VNF or CNf, each NF is desgined in distributed architecture with multiple modules, which typically includes interfaces module (which is in charge of south-north communication with other NFs), business processing modules (which deals with NF business logics), data storage module (which stores NF data), and operation andf management module (which is in charge of NF operation and management). Within all these functional modules, operation and management module (OAM module), as it has similar functions to different NFs, can be shared as common module.
The common operation and management functions used by NF usually include but not limited to: configuration management, heartbeat management, performance management, log management, alarm management. If checking those open-sourced network function projects, such as OAI and Free5GC, it is shown that operation and management functions are non-business-logic related functions, which are usually not included in network functions source code. However, these OAM functions are necessary in NF productions, which usually are developed by vendors or operators. Besides, according to cloud native application design principles, observability is one of the most important feature an application should have. With these OAM functions, NF can achieve observability easily.
As OAM functions are common, reusable, non-business-log-related and necessary functions to NF, they can be treated as PaaS capabilities provided by cloud service provider to NF developers. Using platform provided OAM fucntions can help the NF developers focue more on core business design while be reliefed from repetitive OAM implementation. So cloud native OAM, which is a group of operation and management functions for NFs, can be treated as Telco PaaS capabilities.
Typical functions of cloud native OAM:
- NF configuration management function: this function manages configurations of NFs. It supports to accept NF configuration command from orchestration systems, manage current and historical configurations of NF, and configuring NF. As NF's configurations are usually stored in configuration files and has vendor-different formats, the management target for this function is only configuration file without defining the file contents. For some newly developed NF strictly following cloud native principle and using Kubernetes/Docker as resources, their configurations can be stored directly in etcd as configmaps.
- NF heartbeat management function: this function helps to monitors the NF's health status. This function has different levels of implementation. The easiest-level implementation is collecting NF-level heartbeat outside NF. This relies on the NF itself to generate heartbeat within each microservice instances, monitoring internal heartbeats, determine NF health status internally with overall microservice heartbeat data. The heartbeat collected on this level can only reflect whether the NF is normal or not, while not knowing whether it is healthy inside. The middle-level implementation is collecting NF microservice-level heartbeat. This relies on each NF microservice to report its heartbeat, with which the function can determine whether the NF is healthy based on some pre-defined rules. The hardest-level implementation, which is also the most cloud native one, is monitors the health status of pod through kubernetes liveness probe, and analyze microservices-level and NF-level health status sutomatically. As current NF are mostly self-contained, the easiest-level implementation is mostly used, while the middle-level implementation and hardest-level implementation are prefered to be followed in the future.
- NF performance management function: this function helps to monitors the metrics of NF. It supports to collect the metrics data and do analysis based on pre-defined rules. Prometheus is the most commonly selected software.
- NF log management function: this function helps to collect and store logs of NF.
- NF alarm management function: this functions helps to collect alarms of NF both at NF-level and resource-level. It also supports to clean the alarm data.
For more design and interface details, please go to:
- https://github.com/XGVela/cloud-native-OAM/blob/main/docs/cloud_native_oam_design.md
- https://github.com/XGVela/cloud-native-OAM/blob/main/docs/API_guide-northbound_interfaces.md
4.2.4 XGVela PaaS Workflow
After summarizing the technical architecture and functional requirements, XGVela related workflows will be covered in this Chapter. These workflows are applicable to all PaaS platforms, which includes XGVela. In this Chapter, we only cover the simple and general workflows that have no telecom features, while the interaction with NFVO, VNFM and other telecom systems will be considered in future release.
Figure 4-8 PaaS Platform Workflow
We simplify the PaaS technical architecture and its relationship with outer management systems as blocks in figure 4-7, which includes NF/application, User, Operator/Management systems, PaaS Service, PaaS Management, and CaaS. The major workflow are listed below.
Service Order Process
Service Order Process ensure PaaS services to be ordered by Users. Firstly, User will browse all PaaS Services through Service Catalog in PaaS Management and select required PaaS Service. Then, User will configure user-defined parameters of the selected PaaS Service, such as interface, performance parameter, reliability parameter, service model (dedicated or shared), etc.
Service Instantiation Process
Service Instantiation Process instantiates PaaS Service. After ordering required PaaS Service, the User will trigger the service instantiation. The Service Lifecycle Management in PaaS Management will instantiate selected PaaS Service. For shared PaaS Service, instantiation means selecting existing and running PaaS service on PaaS platform, completing configuration including RBAC/access/etc., and return access API to User. For dedicated PaaS Service, instantiation means selecting images/packages/descriptors in Image & Package Repository of PaaS Management, sending these files together with user-defined configurations to CaaS, and triggering K8S to complete PaaS Service instantiation. The CaaS layer will reply access API and monitoring data of PaaS Service to PaaS Management.
Service Unsubscribe Process
Service Unsubscribe Process lets Users to unsubscribe the ordered and instantiated PaaS Service through PaaS Management. For shared PaaS Service, PaaS Management will stop PaaS Service for Users through changing the PaaS Service configuration like RBAC. For dedicated PaaS Service, PaaS Management will delete PaaS Service instance on CaaS.
Service Configuration Updating Process
Service Configuration Updating Process can help User/Operator/ Management Systems to update configuration of PaaS Service through configuration center in PaaS Management. The Service LCM and CaaS will execute new configuration on Service and Pod.
Adding Service Process
Adding Service Process lets Operators to add new PaaS Service to PaaS platform. The process includes uploading images and packages of new PaaS Service into Image & Package Repository, onboarding PaaS Service in Service Catalog, setting the basic configuration (including configuration items and parameters) and user-defined configuration items.
Deleting Service Process
Deleting Service Process lets Operators to offline and remove PaaS Service on PaaS platform. PaaS Management will firstly check the usage condition of PaaS Service selected to be deleted. PaaS Service in use cannot be deleted. If the PaaS Service is not used, PaaS Management will then remove it from Service Catalog, and deleting related images and packages in Image & Package Repository.
Service Operation and Maintenance Process
Service & Resource Monitoring/Log/Event in PaaS Management achieve health check, monitoring, log management, alarm management of PaaS Service through the interfaces with PaaS Service instance.
The O&M interface, data model, data generation, etc. should be pre-developed based on PaaS Management requirements during design and development stage of PaaS Service.
5. High Availability
5.1 Overview of High Availability
High availability is usually needed to ensure service availability and continuity. Enabling mechanisms are distributed throughout the system, at all levels, and the redundancy of the hardware and software subsystems eliminate single points of failure (SPOFs) that can affect the service. Control mechanisms for these redundant components are distributed to associated control and redundancy management systems. The control processes should be implemented at the lowest level that has sufficient scope to deal with the underlying failure scenarios, with escalation to higher-level mechanisms when failure cannot be handled at lower levels. This ensures that the fastest mechanisms are always used, as the remediation control loop times generally increase when the decision logic is moved up in the stack or into broader control domains.
For the redundancy, the target is to decrease the amount of redundant resources required. This implies that the direction of the redundancy structures is away from dual-redundant (typically active-standby 1:1 prevalent in the “box” implementations) towards either N+1 or N:1 scheme. This is enabled by the reduction or elimination of the direct physical attachment associations with the use of the cloud native functions, a move towards micro-service types of architectures, and the ability to add, remove and relocate service instances dynamically, based on offered load and resource utilization state. The infrastructure support for such mechanisms is essential for success, and the new, more dynamic configuration of the applications and services has many implications for both network services as well as the whole control, orchestration, and assurance software stack.
The availability and continuity mechanisms use “intent”-driven interfaces. The intent specifies the desired state, connectivity, or other aspect of the system from the service user’s perspective. The orchestration and control systems determine the implementation of the request, using the network resources available at the time of new request, and subsequently ensure that the intent continues to be met while the service is in use.
The associated orchestration and control subsystems are always aware of both the intended configuration and the actual configuration and can autonomously work to drive the system to intended state should there be deviations (either due to failures or due to intent changes).
5.2 Telco High Availability Requirements on PaaS
Telecom applications usually requires reliability up to 99.999%. And this high reliability should be achieved by the overall network cloud, whose major components usually include hardware, virtualization layer, PaaS layer, application layer and management systems. All these components construct a chain. Any components stringed together in series the availability of the overall system is multiple of the availability of each component - hence the overall availability decreases because any one component in the chain fails the overall chain or service is down. So before considering the overall reliability, it is necessary to ensure that each component maintains high reliability. As PaaS is an important component for network cloud, for the following contents, we’ll introduce about Telco High Availability requirements on PaaS, which is applicable to XGVela and all other PaaS platforms.
5.2.1 Network Cloud Deployment Scenarios
When talking about network cloud, we usually are talking about a distributed cloud with multiple sites spreading all over the country. Knowing the deployment scenarios of network cloud can help to know the potential HA requirement of each component in network cloud, especially PaaS.
Figure 5-1 describes typical deployment scenarios of telecom network cloud. The telco network cloud usually can be separated into the following type: core cloud, reginal cloud, edge cloud and far edge nodes. The features of each deployment scenario are displayed in figure 5-1.
Figure 5-1 Typical Deployment Scenarios of Network Cloud
As each deployment scenario has different resource, carries different applications, and may have different architectures, they will have different HA requirements. For example, as core cloud carried the most importance telco core NFs and has sufficient resources, the HA level should be high, while edge mage has lower HA level due to lack of enough resources.
5.2.2 Deployment of PaaS in Network Cloud
XGVela PaaS can be deployed in the core cloud, reginal cloud, as well as edge cloud and far edge nodes. The locations may include data centers, central office, metro or reginal POPs, enterprise CPE, ruggedized IOT gateways etc.
According to PaaS technical architecture described in Chapter 4, PaaS usually includes PaaS Management, which is necessary for PaaS platform to provide PaaS service to users, and PaaS Service, which are a group of service required by users on demand. The resources occupied by PaaS Management is relatively stable as only management related functions are deployed. Only one function in PaaS Management may requires changeable resources – storage capacity of Image and Package Repository, which may increase as more PaaS services are integrated onto PaaS platform. However, the amount of resource used by PaaS Service vary. It depends on user requirements and may vary from no resource occupation (which refers to no PaaS service ordered by user) to very large amount. The influence factors of resources used by PaaS Service at least include the number of PaaS service instantiated on PaaS platform, and the resource configuration of each PaaS service.
As different deployment scenarios in figure 5-1 have different amount of resource, PaaS would be deployed differently in each deployment scenarios.
- Core Cloud & Reginal Cloud usually have sufficient resources for PaaS deployment. In these two deployment scenarios, PaaS Management is fully deployed; PaaS Management and PaaS Service will consider high-level of reliability.
- Edge Cloud usually has limited resources, so PaaS Management is not required to be deployed. For those edge clouds connected to core cloud or reginal cloud, edge cloud can use the PaaS Management in core/reginal cloud remotely. For those independent edge cloud, for example an enterprise edge cloud, PaaS Management can be deployed in a cropped version and can be mixed deployed with other application instances.
- Far Edge only has few nodes, so all resources should be saved to applications and required PaaS service while PaaS Management should be deployed in remote core cloud or reginal cloud.
5.2.3 Infrastructure HA Categories
As infrastructure is the bottom layer of whole network cloud, PaaS, applications, management systems are all deployed on infrastructure, the HA condition of infrastructure will influence the HA strategy of all the other components in network cloud. So, before analyzing the HA requirement of PaaS, it is necessary to look at the different HA solutions of infrastructure.
Telco infrastructure can be divided into three categories for availability of the infrastructure:
- No HA: No HA means there is no redundancy of infrastructure. For this category, the high availability for services and applications can be achieved through stateful applications/services that can reconnect to different upstream or downstream devices and applications/services in case of failure. However, upon failure of the device, the application or the microservice running in that environment is shut down and no service is available from that instance of the application.
- Partial HA: In this context there is high availability of the platform but limited to some failure scenarios such as single failure or double failure where the infrastructure can continue to operate but in a degraded environment.
- Full HA: The infrastructure operating the cloud environment is operating in full HA mode where upon failure of any server hardware component or software component is full capable of continued operations as if nothing happened.
5.2.4 PaaS HA Requirements
5.2.4.1 General HA Requirements on XGVela PaaS
Before analyzing HA requirements for XGVela PaaS on different infrastructure HA categories, there are the following general HA requirements:
- The PaaS SHALL be capable of being deployed in any footprint – VM or bare metal and HA is equally applicable to both models.
- The PaaS MUST support quorum-based redundancy models for PaaS Management. This implies there are n>=3 PaaS Management instances where n is odd such that (n/2)+1 PaaS Management instances are required to be operational for full quorum.
- The PaaS SHALL support Active-Standby redundancy model for PaaS Management. If a quorum-based model is not supported then PaaS Management shall support an active standby model at a minimum. Active-Standby implies one management instance is active while another one is on hot or cold standby and can take over operations upon detection of failure or via manual intervention.
- Active-Standby model can be a 1+1 or N+M where N and M >= 1. This model while complex in implementation is well known and understood by Telcos in which N active controllers are backed up by M standby controllers
- The PaaS MUST allow deployment of stateful and stateless applications. The HA of the PaaS MUST allow for deployment of all types of applications and MUST NOT restrict any application types due to sharing of resources such as registry, databases etc.
- PaaS shall perform remedial actions such as restarting the service Pods if service not working properly, or service pod fails.
5.2.4.2 HA Requirements on PaaS Under No HA scenario
This operational environment is a PaaS or its components deployed in a single nodes or a server in a remote location such as far edge. If the server/device fails there is no service – because there are no redundant servers. If it is a single node deployed then the PaaS is operating in a non-redundant environment and is prone to hardware or software failure. In such deployment models, Application HA may be accomplished where applications and network functions are lighted up across multiple single nodes and communicate with each other via a heartbeat function to detect failures or connect with each other upstream or downstream nodes and are activated either automatically or via orchestration during such failure scenarios.
5.2.4.3 HA Requirements on PaaS Under Partial HA scenario
Partial HA can be defined as high availability under limited failures. In this scenario, such as the edge cloud, the system is designed for optimal costs with minimum components to meet the basic HA requirements. So, under single failure or limited failures the PaaS continues to operate with no degradation in service. Once the threshold is crossed with respect to failure scenarios the PaaS goes into graceful shutdown mode or read only mode of operation. Following are the requirements that must be supported by the PaaS in such scenario.
- The PaaS Service MUST continue to be operational and continue to provide service to the applications and network functions when connectivity is lost to the PaaS Management.
- The PaaS service instance when reconnected back with the PaaS Management MUST NOT kill the PaaS service or recreate a new one. The PaaS Management may mark the service tainted or stale until it can authenticate the service and verify the running status of service itself and related applications, and then decide based on policy if the service needs to be reset or refreshed.
5.2.4.4 HA Requirements on PaaS Under Full HA scenario
Full HA is defined as the PaaS is operational in a highly available environment. These systems are designed with lots of redundancy in hardware and software components. The PaaS Management components operate in N+M or quorum mode where it can tolerate (n/2)-1, n>=3 node failures and still be fully functional. The system would be designed in such a way that node, platform and application failures are tolerated including multiple failures. Full HA designs are more costly and at the discretion of Telco. The amount of redundancy varies from deployment to deployment based on cost and operational comfort.
In full HA scenario, resources of multiple nodes are used to deploy PaaS Management, and the PaaS services can be deployed in redundant modes such as replica sets to ensure high availability. Traffic and sessions are automatically load balanced across replica PaaS service instances based on defined policy. Following are the requirements must be supported by the PaaS in Full HA scenario.
- PaaS MUST continue to operate under multiple node/server failures until quorum is maintained.
- When PaaS Management quorum is lost, PaaS MUST operate in read only mode where PaaS services that are running must remain up and continue to provide service to applications. No new PaaS Service can be scheduled in this mode of operation.
5.2.5 Existing Solutions
All the above requirements are applicable to PaaS platform, which includes both General PaaS and Telco PaaS. Most of the general PaaS available today support such quorum or active standby controller redundancy model. If the requirements defined in this document are supported by general PaaS, then with respect to HA there is no distinction between a Telco PaaS and general PaaS. However, the availability model for general PaaS may be less stringent than a Telco PaaS and hence these requirements defined in this document are necessary and sufficient.
An open source platform like OKD supports a quorum based HA model for PaaS Management that quorum based HA model fits the requirements of Telco PaaS. Additionally a StarlingX based platform may support and N+M or Active Standby or load shared two controller model for the control plane of the PaaS. That also satisfies the requirements of Telco PaaS. This provides a choice to Telco customers in terms of what model is more convenient to their operational environment and they can choose that mode of operation.
5.2.6 Proposed HA Architecture for Telco PaaS
With the overall architecture of XGVela, the Telco PaaS components sit on top of a General PaaS environment. In that context the High availability of Telco PaaS components is dependent on the General PaaS availability. Hence the General PaaS is required to support the HA and Kubernetes or OpenStack environment.
The lifecycle management of those components is outside scope of the HA chapter but is covered in the overall architecture. The lifecycle of Telco PaaS layer or components in the Telco PaaS impacts the availability of the overall services. So, features and functions such as In-Service upgrades of components become necessary to ensure the entire PaaS environment operates in an efficient manner.
The proposed HA architecture for Telco PaaS is the same as that of general PaaS with the following attributes.
- General PaaS may also host XGVela specific control functions such as Telemetry collectors, log collectors, API gateways, multi-cluster management capabilities etc. These components are specific to telco usage and may be part of the XGVela Telco PaaS layer.
- XGVela components can all run as “applications and services” on the General PaaS.
- These components defined above then use the PaaS replica sets properties to instantiate multiple replicas of applications and controllers on the general to deliver HA based services.
- Any device profiles, service profiles and user profile configurations consumed by XGVela can be stored in local attached persistent storage and retrieved at will for initial or re-deployment.
- Telco PaaS components must follow the same guidelines as that of the underlying PaaS/cloud native infrastructure for deployment of their functionality unless otherwise explicitly specified.