Edge Relocation
PROBLEM STATEMENT
There are several situations wherein an edge computing application running in a MEC cluster, which is serving clients over a 5G wireless network, needs to be relocated to another MEC cluster, i.e., a new instance of that application is deployed in the target MEC cluster without necessarily disrupting the application instance in the source cluster. Some examples of such situations are these:
- A User Element (UE) that is communicating with an edge application running in a MEC cluster roams to a different cell tower, and there is another MEC cluster that offers lower latency in the new location than the source MEC cluster.
- A new UE opens a PDU session from a cell tower whereย another MEC cluster can offer lower latency.
- The source MEC cluster needs to be brought down for maintenance.
- The source MEC cluster is approaching its capacity limits.
Note that an application may be composed of multiple microservices defined together in a Helm chart. The application may be part of a composite application, whose component applications may be running in different clusters, including public/private clouds, telco/co-located edge clusters, and on-prem data centers.ย
We take it as a given that the composite application, of which the relocatable application is a part, was deployed by EMCO.
When the application is relocated, the following requirements must be met:
- Service continuity must be assured to the UE, i.e., the UE should not suffer any loss of service unless it by itself triggers a change, such as a connection/session reset. The term 'service' here includes not only data traffic but also allied features such as security and latency of said traffic.
- As an important corollary, the new instance of the application must be declared to be 'ready' only when all its associated state has been relocated. The state may include not only databases and other configuration, but also configuration of networking policies (such as firewall rules or security policies) in the target environment. Otherwise, service continuity cannot be assured.
- If there are several candidates for the target MEC cluster, the final choice must be made based on the following criteria:
- Proximity/latency from the UE to the cluster. The proximity is defined in terms of latitude/longitude-based geographical distance while the latency may be measured in transport time/cost.ย
- Global resource utilization in the cluster, such as cluster-average CPU or memory usage.
- Conformance of the cluster for the original EMCO intents for that application's deployment, such as Hardware Platform Awareness (HPA) intents, etc.
- The deployment of the application in the target cluster must be subject to the original EMCO intents, such as network policy intents and Generic Action Controller intents.
ARCHITECTURE
Initially, we will focus on the UE roaming use case. The diagram below depicts this scenario:
The broad steps of any solution to the above problem are these:
- A. Listen for the events that can trigger a relocation, e.g., notifications from the 5G core regarding user mobility, a new user or PDU session, or application performance issues.ย
- B. Determine whether to relocate the application.
- C. Determine the โbestโ target MEC cluster based on many criteria, as stated in the problem description.
- Proximity/latency criteria.
- Global cluster utilization criteria.
- Other app-specific criteria expressed as EMCO intents โ e.g. HPA criteria for per-microservice CPU/memory/GPU and imposing resource limits.
- D. Perform the relocation.
We will focus on the class of solutions where step A is done by a workflow or by EMCO, step B is done by a workflow, step C is performed by EMCO, and step D is done by another workflow that invokes EMCO to do the relocation. The said workflows themselves may have been deployed by EMCO.
As Orange indicated in the EMCO TSC presentation, there is an I-UPF that the UE connects to. Beyond that, there are 3 possible deployment scenarios:
- A. There is an anchor UPF in each MEC cluster, to which the I-UPF forwards traffic via an N9 tunnel. This scenario is part of the ETSI standard. Other industry players are also assuming or advocating for this model. We can assume that the anchor UPF exists in each MEC cluster that is a candidate target for relocation
- B. There is no anchor UPF in each MEC cluster, but the I-UPF is deployed with a Traffic Steering Controller (which could be in the same pod or cluster as the I-UPF, or be in a separate SD-EWAN hub)
- C. There is neither an anchor UPF nor a Traffic Steering Controller.
We will prioritize Scenario A.
The notion that edge relocations should be transparent to the UE is very difficult to realize with service continuity. Rather, we will stipulate that:
- Existing TCP connections and PDU sessions from the UE are left intact.
- No new PDU sessions need to be initiated by the UE, though we may modify the existing PDU session, depending on the solution.
- The UE needs to retry or initiate a new TCP connection within the same (possibly modified) PDU session.
5GFF APIs
The 5G Future Forum APIs, a.k.a. Edge Discovery Service APIs, are defined by Verizon for AWS Wavelength zones.ย
Resources: API definition, description of APIsย
Use Case | API Call Sequence | EMCO Involvement |
---|---|---|
App Orchestrator registers a 'service profile' with VZ for application to be launched. | ย See Create-service-profile API. Service Profile will take in the resource needs of the application such as memory, GPU needs & performance needs of the application such as latency. NOTE: CPUs are not mentioned in the API. Needs clarification. | EMCO will do this. EMCO's HPA intents provide only the |
Determine the 'best' MEC cluster (for initial application deployment) | Task 1: Case A. The app provider only knows which region to deploy, but not the location of the clients. Query available regions with Get-Regions APIย and then the MEC clusters within those regions withย Get-MEC-Platforms API. | Some agent, such as a workflow, could call these APIs and onboard the discovered clusters into EMCO. Also, EMCO intents, such as cluster labels and HPA, could be used to further narrow the candidate clusters. We need a way to get the cluster candidates returned by 5GFF APIs into EMCO DIG intents. |
Task 1: Case B. The app provider knows the location of their clients. Get list of Public MEC zones for those locations and desired UE density using Get-MEC-Platforms API. | ||
Task 2: Launch app in the selected clusters. | EMCO will do this. | |
Task 3:ย Update the service registry with the service endpoints of each app instance. Invoke | EMCO will do this. | |
'Optimize' the application deployment, i.e.,ย Edge Relocation. | (These are the tasks from the Architecture section above.) Task A:ย Measure application performance issues and/or listen for 5G user mobility. | A workflow will do this. |
Task B: Determine whether to relocate the app. | A workflow will do this. | |
Task C1:ย Get a new list of suitable MEC clusters. Invokeย Get-MEC-Platforms APIย with region name, service profile ID, subscriber density and/or UE identity.ย | A new EMCO controller will do this. | |
Tasks C2, C3: Narrow the list of MEC cluster candidates using other criteria, and pick one target cluster. | EMCO controllers will do this. | |
Task D: Perform the relocation.ย After each operation (move, terminate, create new app instance), update the Service Registry using | A workflow will call EMCO to do this and then call the 5GFF APIs. | |
Discover the 'closest' app instance to a client/UE. | The app orchestrator can determine the closes MEC cluster for the client and tell the client which app instance to talk to. (Or the client can call the 5GFF APIs too.) Invokeย Get-MEC-Platforms APIย with UE identity. The identity types can beย IP Address, MSISDN, IMEI, MDN, or GPSI. | EMCO can do this. |
SUGGESTIONS
- Initial version: focus on distance as the metric for latency. In the future, we could add other metrics.ย
OPEN QUESTIONS
- Development Process:ย Should we move the current 5GFF API implementation into an open environment, where all folks can contribute? Or possibly reimplement the APIs in an open forum?
- MEC Cluster Selection Criteria:ย Need to add K8s version of target cluster? (See KubeCon talkย on x-cloud db migration.)
- Need to consider application compatibility with target MEC cluster. In the EMCO relocation intent, we can state the range of K8s versions that the app is compatible with.
- Latitude-Longitude is one way of measuring the cost/latency of a UEย โ MEC cluster pair. The general problem formulation would involve a bipartite graph of cells and MEC clusters, where nodes may be added/deleted and edge weights may be dynamically updated.
- Step A: Should Step A be done in EMCO or in a workflow? EMCO could be an AF in the 5GC architecture. RBAC considerations?
- Can subscribe to any event coming from AMF. 3p like EMCO can subscribe.ย
- Relocation Decision:ย Since the UE needs to start a new connection and not cache previous DNS lookups, how do we know whether the UE will do that? IOW, in Step B, how do we decide which UE or which PDU session can be subjected to relocation?
- Consider service continuity at 2 levels: PDU session (nw level) and app level (TCP reconnection). App level may be outside our scope? We can focus on network level alone.
- DNS: Today we update the app's DNS record in PowerDNS. That has 2 problems:
- A DNS update for the app implies that all UEs will be diverted to the relocated app. That is not desirable. We want to divert specific UEs to the relocated app while other UEs continue to connect to the existing app.
- Assuming DNS updates are ok: PowerDNS will not let a new entry take effect while the old entry exists. IOW, the old app instance has to die before the new entry will take effect. How do we handle that?
- Traffic Steering: How do we ensure new TCP connections in the same PDU session are directed towards the relocated app?
- For Scenario B, one option is to set up a DNS cache in the TSC for each PDU session.
- For scenario A, we should investigate whether we can program the I-UPF to update a PDU session such that existing TCP connections stay unaffected while new TCP connections are forwarded to the relocated app. Ideally, this should be implementation-independent. There is some doubt whether Free5GC supports modification of an existing PDU session.
- Connectivity across clusters. Submariner (already done by Orange) vs Cilium vs others. Does it offer a better solution than TSC?
- Load Balancer-based approach: where would that fit in a multi-provide rmodel?
MEETING MINUTES
Please seeย Edge Relocation WG Meeting Minutes.
WEEKLY CALLS
EMCO Edge Relocation WG
Repeats: Weekly onย Wednesdays at 15:00 UTC
Ways to join meeting:
1. Join from PC, Mac, iPad, or Android
Join Meeting
If the button above does not work, paste this into your browser:ย https://zoom-lfx.platform.linuxfoundation.org/meeting/96320891634
You may be asked to register for the meeting. If so, you must register to join.
To keep this meeting secure, do not share this link!
2. Join via audio
One tap mobile:
US: +12532158782,,96320891634# or +13462487799,,96320891634#
Or dial:
US: +1 253 215 8782 or +1 346 248 7799 or +1 669 900 6833 or +1 301 715 8592 or +1 312 626 6799 or +1 646 374 8656 or 877 369 0926 (Toll Free) or 855 880 1246 (Toll Free)
Canada: +1 647 374 4685 or +1 647 558 0588 or +1 778 907 2071 or +1 204 272 7920 or +1 438 809 7799 or +1 587 328 1099 or 855 703 8985 (Toll Free)
Meeting ID: 96320891634
Meeting Passcode: 195570
International numbers