2020-10-14 - CNTT - Telemetry/Observability architecture - a path to full automation
Topic Leader(s)
@Sukhdev Kapur (Juniper)
@Zlatko Dukic (DT)
@Walter.kozlowski (Telstra)
@Trevor Cooper (Intel)
@Pankaj.Goyal (AT&T)
@Marcel Wiget (Deactivated) (Juniper)
@Petar Torre (Deactivated) (Intel)
@Al Morton(AT&T)
@Sunku Ranganath (Deactivated) (Intel)
@Tomas Fredberg [Ericsson] (Ericsson)
@Parth yadav
Topic Overview
In this session we will present an architecture for the Telemetry/Observability of VNFs and Cloud Infrastructure (NFVI) to achieve full automation leading to zero touch operation
Slides & Recording
Minutes
How do you ensure that your large scale collection of data doesn't affect the system being observed?
Answer: We use the same data that is already provided, but kept locally and transmitted periodically every few hours, or not at all. Debug level information usually is not provided all the time as it impacts the system functionality, so if that is needed then we have to perform online configuration change to increase the levels of verbosity on observed system(s). Our main concern is to get already existing data which does not impact the functionality, and to concentrate it in the central place where it can be analyzed, independent is it by humans or machines.
In your management system what component is making the autonomic decisions?
Answer: There are multiple components, and this analytical system is implemented in-the-house. Roughly, it derives meaningful information from a raw data: logs, alarms, events, resource metrics, KPIs, … . Analysis is performed both in deterministic way i.e. statistical – detection of outliers, but also in non-deterministic way such as using Deep Learning RNNs, LSTMs to be precise, in a supervised way. Upon detection and classification of sequences - imagine it as an anomaly signature recognition, further correlation (detection of relation) of those anomalies is performed in order to come up with the final “idea” of what is wrong in the observed system. This final idea of what is wrong, we call it “incident”, can be enriched with instructions of what needs to be done. This instruction is normally targeted towards the Orchestrator system which will realize the instruction. In order to provide even more educated decision, we take into account also network traffic which we at first decode, isolate the messages which belong to the particular call, and using the “collective” knowledge of our call-flow experts built into the system (supervised learning), we provide information about what is wrong on the call-level. This information may or may not be incorporated into the final decision about the “incident”.
Can/Will the presented slides be put here ?
Answer: Gladly, very soon.
Update 2020-12-14: My apologies, due to administrative challenges this could not take place earlier. Please find the presentation here.