PROBLEM STATEMENT
There are several situations wherein an edge computing application running in a MEC cluster, which is serving clients over a 5G wireless network, needs to be relocated to another MEC cluster, i.e., a new instance of that application is deployed in the target MEC cluster without necessarily disrupting the application instance in the source cluster. Some examples of such situations are these:
- A User Element (UE) that is communicating with an edge application running in a MEC cluster roams to a different cell tower, and there is another MEC cluster that offers lower latency in the new location than the source MEC cluster.
- A new UE opens a PDU session from a cell tower where another MEC cluster can offer lower latency.
- The source MEC cluster needs to be brought down for maintenance.
- The source MEC cluster is approaching its capacity limits.
Note that an application may be composed of multiple microservices defined together in a Helm chart. The application may be part of a composite application, whose component applications may be running in different clusters, including public/private clouds, telco/co-located edge clusters, and on-prem data centers.
We take it as a given that the composite application, of which the relocatable application is a part, was deployed by EMCO.
When the application is relocated, the following requirements must be met:
- Service continuity must be assured to the UE, i.e., the UE should not suffer any loss of service unless it by itself triggers a change, such as a connection/session reset. The term 'service' here includes not only data traffic but also allied features such as security and latency of said traffic.
- As an important corollary, the new instance of the application must be declared to be 'ready' only when all its associated state has been relocated. The state may include not only databases and other configuration, but also configuration of networking policies (such as firewall rules or security policies) in the target environment. Otherwise, service continuity cannot be assured.
- If there are several candidates for the target MEC cluster, the final choice must be made based on the following criteria:
- Proximity/latency from the UE to the cluster. The proximity is defined in terms of latitude/longitude-based geographical distance while the latency may be measured in transport time/cost.
- Global resource utilization in the cluster, such as cluster-average CPU or memory usage.
- Conformance of the cluster for the original EMCO intents for that application's deployment, such as Hardware Platform Awareness (HPA) intents, etc.
- The deployment of the application in the target cluster must be subject to the original EMCO intents, such as network policy intents and Generic Action Controller intents.
ARCHITECTURE
Initially, we will focus on the UE roaming use case. The diagram below depicts this scenario:
The broad steps of any solution to the above problem are these:
- A. Listen for the events that can trigger a relocation, e.g., notifications from the 5G core regarding user mobility.
- B. Determine whether to relocate the application.
- C. Determine the ‘best’ target MEC cluster based on many criteria, as stated in the problem description.
- Proximity/latency criteria.
- Global cluster utilization criteria.
- Other app-specific criteria expressed as EMCO intents – e.g. HPA criteria for per-microservice CPU/memory/GPU and imposing resource limits.
- D. Perform the relocation.
We will focus on the class of solutions where steps A and B are done by a workflow, step C is performed by EMCO, and step D is done by another workflow that invokes EMCO to do the relocation. The said workflows themselves may have been deployed by EMCO.
As Orange indicated in the EMCO TSC presentation, there is an I-UPF that the UE connects to. Beyond that, there are 3 possible deployment scenarios:
- A. There is an anchor UPF in each MEC cluster, to which the I-UPF forwards traffic via an N9 tunnel. This scenario is part of the ETSI standard. Other industry players are also assuming or advocating for this model. We can assume that the anchor UPF exists in each MEC cluster that is a candidate target for relocation
- B. There is no anchor UPF in each MEC cluster, but the I-UPF is deployed with a Traffic Steering Controller (which could be in the same pod or cluster as the I-UPF, or be in a separate SD-EWAN hub)
- C. There is neither an anchor UPF nor a Traffic Steering Controller.
We will prioritize Scenario A.
The notion that edge relocations should be transparent to the UE is very difficult to realize with service continuity. Rather, we will stipulate that:
- Existing TCP connections and PDU sessions from the UE are left intact.
- No new PDU sessions need to be initiated by the UE, though we may modify the existing PDU session, depending on the solution.
- The UE needs to retry or initiate a new TCP connection within the same (possibly modified) PDU session.
OPEN QUESTIONS
- Relocation Decision: Since the UE needs to retry, how do we know whether the UE will do that? IOW, in Step B, how do we decide which UE or which PDU session can be subjected to relocation?
- Traffic Steering: How do we ensure new TCP connections in the same PDU session are directed towards the relocated app?
- For Scenario B, one option is to set up a DNS cache in the TSC for each PDU session.
- For scenario A, we should investigate whether we can program the I-UPF to update a PDU session such that existing TCP connections stay unaffected while new TCP connections are forwarded to the relocated app. Ideally, this should be implementation-independent. There is some doubt whether Free5GC supports modification of an existing PDU session.