Overview
The purpose of this page at this time is to capture requirements related to observability of the EMCO services (https://gitlab.com/groups/project-emco/-/epics/7).
Front-ending the services with Istio provides a useful set of metrics and tracing, and adding the Prometheus library provided collectors to each service expands that with other fundamental metrics. The open question is what additional metrics and tracing will be useful to EMCO operators.
Metrics
The following items are based on Prometheus recommendations for instrumentation.
Queries, errors, and latency
Both client and server side are provided by Istio. https://istio.io/latest/docs/reference/config/metrics/
Istio metrics can be customized to include other attributes from Envoy such as subject field of peer certificate. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/advanced/attributes
Example PromQL
Service | Type | PromQL | Notes |
---|---|---|---|
HTTP | Queries | sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m])) | inbound |
sum(irate(istio_requests_total{reporter="source",source_workload="services-orchestrator"}[5m])) by (destination_workload) | outbound | ||
Errors | sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m])) | inbound | |
sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) by (destination_workload) / sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator"}[5m])) by (destination_workload) | outbound | ||
Latency | histogram_quantile(0.90, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_workload="services-orchestrator"}[1m])) by (le)) / 1000 | P90 |
In-progress requests
These do not appear to be available with Istio, further investigation is required.
Queries, errors, and latencies of resources external to process (network, disk, IPC, etc.)
Unsure which external resources would need this coverage at this time. Note that downstream HTTP and gRPC requests are provided by Istio.
The prometheus golang library provides builtin collectors for various process and golang metrics: https://pkg.go.dev/github.com/prometheus/client_golang@v1.12.2/prometheus/collectors. A list of metrics provided by cAdvisor is at https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md.
Internal errors and latency
Internal errors should be counted. It also desirable to measure success to calculate ratio.
Totals of info/error/warning logs
Unsure if this is a useful metric.
Any general statistics
This bucket includes EMCO specific information such as number of projects, errors and latency of deployment intent group instantiation, etc. Also consider any cache or threadpool metrics. Looking for feedback here on any general metrics of interest to EMCO operators.
Preliminary guidelines:
- Distinguish between resources and actions.
- Action metrics will record requests, errors, and latency similar to general network requests.
- Resource metrics will record creation, deletion, and possible modification.
- Metrics will be labeled with project, composite-app, deployment intent group, etc.
For rsync specifically, measure health/reachability of target clusters.
Tracing
Istio provides a starting point for tracing by creating a trace for each request in the sidecars. But this is insufficient as it does not include the outgoing requests made during an inbound request. What we'd like to see is a complete trace of, for example, an instantiate request to the orchestrator that includes the requests made to any controllers, etc.
In order to do this it is necessary to pass the tracing headers from the inbound request through to any outbound requests. This will be done with the https://opentelemetry.io/ golang libraries.
Logging
Each log message must contain the timestamp and identifying information describing the resource, such as project, composite application, etc. in case of orchestration.
The priority is placed on error logs; logging other significant actions is secondary.