Skip to end of banner
Go to start of banner

EMCO Resilience

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Database persistence

Today, db persistence is not enabled by default. We need to validate with persistence enabled.

  • We have tested with NFS-based PV in the past, and we still have the NFS-related YAMLs. If there is consensus on NFS as default storage, we need to set up NFS environment.
  • With persistence enabled, we cannot rely anymore upon developer-oriented troubleshooting and workarounds based on re-installing EMCO to blow away the db. Developers should also
    test with persistence enabled.

Recovery from crashes/disruptions

The scenarios that need to be validated:

  • Restart each microservice, when it is processing a request
  • In particular: Restart orchestrator when a DIG instantiate request is in flight
  • Restart all microservices together
  • Restart the node on which EMCO pods are running (assuming it is 1 node for now)

rsync can restart after a crash. Aarna, as part of EMCO backup/restore presentation, has tested blowing away the EMCO namespace (incl. EMCO pods and db), and restoring it.

Graceful handling of cluster connectivity failure

Without the GitOps model, rsync should apply configurable retry/timeout policies to handle cluster connectivity loss. We have the
/projects/.../{dig}/stop API but that is a workaround -- the user needs to invoke that API manually.

Question: can we recommend the GitOps approach and leave things as is? If not, we need to fix this.


  • No labels