Skip to end of banner
Go to start of banner

Observability

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Overview

The purpose of this page at this time is to capture requirements related to observability of the EMCO services (https://gitlab.com/groups/project-emco/-/epics/7).

Front-ending the services with Istio provides a useful set of metrics and tracing, and adding the Prometheus library provided collectors to each service expands that with other fundamental metrics. The open question is what additional metrics and tracing will be useful to EMCO operators.

Metrics

The following items are based on Prometheus recommendations for instrumentation.

Queries, errors, and latency

Both client and server side are provided by Istio. https://istio.io/latest/docs/reference/config/metrics/

Istio metrics can be customized to include other attributes from Envoy such as subject field of peer certificate. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/advanced/attributes

In-progress requests

These do not appear to be available with Istio, further investigation is required.

Queries, errors, and latencies of resources external to process (network, disk, IPC, etc.)

Unsure which external resources would need this coverage at this time. Note that downstream HTTP and gRPC requests are provided by Istio.

The prometheus golang library provides builtin collectors for various process and golang metrics: https://pkg.go.dev/github.com/prometheus/client_golang@v1.12.2/prometheus/collectors.

Internal errors and latency

Internal errors should be counted.  It also desirable to measure success to calculate ratio.

Totals of info/error/warning logs

Unsure if this is a useful metric.

Any general statistics

This bucket includes EMCO specific information such as number of projects, errors and latency of deployment intent group instantiation, etc. Also consider any cache or threadpool metrics. Looking for feedback here on any general metrics of interest to EMCO operators.

Preliminary guidelines:

  • Distinguish between resources and actions. 
  • Action metrics will record requests, errors, and latency similar to general network requests.
  • Resource metrics will record creation, deletion, and possible modification.  
  • Metrics will be labeled with project, composite-app, deployment intent group, etc.

For rsync specifically, measure health/reachability of target clusters.

Tracing

Istio provides some tracing support, but it appears rudimentary (no detailed spans for EMCO related operations).

Preliminary guidelines:

  • Follow the flow of all external and internal API calls.
  • Filter by caller.

Logging

Each log message must contain the timestamp and identifying information describing the resource, such as project, composite application, etc. in case of orchestration.

The priority is placed on error logs; logging other significant actions is secondary.

  • No labels