Overview
The purpose of this page at this time is to capture requirements related to observability of the EMCO services (https://gitlab.com/groups/project-emco/-/epics/7).
...
Istio metrics can be customized to include other attributes from Envoy such as subject field of peer certificate. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/advanced/attributes
In-progress requests
These do not appear to be available with Istio, further investigation is required.
Example PromQL
Service | Type | PromQL | Notes |
---|---|---|---|
HTTP/gRPC* *The request_protocol label can be used to distinguish among HTTP and gRPC. | Queries | sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m])) | inbound |
sum(irate(istio_requests_total{reporter="source",source_workload="services-orchestrator"}[5m])) by (destination_workload) | outbound | ||
Errors | sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m])) | inbound | |
sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) by (destination_workload) / sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator"}[5m])) by (destination_workload) | outbound | ||
Latency | histogram_quantile(0.90, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_workload="services-orchestrator"}[1m])) by (le)) / 1000 | P90 | |
Saturation |
Queries, errors, and latencies of resources external to process (network, disk, IPC, etc.)
...
The prometheus golang library provides builtin collectors for various process and golang metrics: https://pkg.go.dev/github.com/prometheus/client_golang@v1.12.2/prometheus/collectors. A list of metrics provided by cAdvisor is at https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md. Additional K8s specific metrics can be enabled with the https://github.com/kubernetes/kube-state-metrics project.
Example PromQL
Note: some of these require that kube-state-metrics is also deployed.
Pod Resource | Type | PromQL |
---|---|---|
CPU | Utilization | sum(rate(container_cpu_usage_seconds_total{namespace="emco"}[5m])) by (pod) |
Saturation | sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="emco"}[5m])) by (pod) | |
Errors | ||
Memory | Utilization | sum(container_memory_working_set_bytes{namespace="emco"}) by (pod) |
Saturation | sum(container_memory_working_set_bytes{namespace="emco"}) by (pod) / sum(kube_pod_container_resource_limits{namespace="emco",resource="memory",unit="byte"}) by (pod) | |
Errors | ||
Disk | Utilization | sum(irate(container_fs_reads_bytes_total{namespace="emco"}[5m])) by (pod, device) |
sum(irate(container_fs_writes_bytes_total{namespace="emco"}[5m])) by (pod) | ||
Saturation | ||
Errors | ||
Network | Utilization | sum(rate(container_network_receive_bytes_total{namespace="emco"}[1m])) by (pod) |
sum(rate(container_network_transmit_bytes_total{namespace="emco"}[1m])) by (pod) | ||
Saturation | ||
Errors | sum(container_network_receive_errors_total{namespace="emco"}) by (pod) | |
sum(container_network_transmit_errors_total{namespace="emco"}) by (pod) |
Internal errors and latency
...
This bucket includes EMCO specific information such as number of projects, errors and latency of deployment intent group instantiation, etc. Also consider any cache or threadpool metrics. Looking for feedback here on any general metrics of interest to EMCO operators.
Preliminary guidelines:
- Distinguish between resources and actions.
- Action metrics will record requests, errors, and latency similar to general network requests.
- Resource metrics will record creation, deletion, and possible modification.
- Metrics will be labeled with project, composite-app, deployment intent group, etc.
For rsync specifically, measure health/reachability of target clusters.
Also, keep in mind this cautionary note from the Prometheus project:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
However note that well-known projects such as Istio and kube-state-metrics appear to disregard this, so further investigation may be needed on the motivations behind this note.
Preliminary metrics
This section contains some of the considerations of the guidelines above applied to the orchestrator service.
The actions of a service can be identified from the gRPC requests and HTTP lifecycle requests:
Service | Action |
---|---|
orchestrator | approve |
instantiate | |
migrate | |
rollback | |
stop | |
terminate | |
update | |
StatusRegister | |
StatusDeregister |
The requests, errors, and latency can be modeled after Istio's istio_requests_total and istio_request_duration_milliseconds, with an additional action name label.
The resources of a service can be identified from the HTTP resources. The initial labels can be the URL parameters.
Service | Resource | Labels |
---|---|---|
orchestrator | controller | name |
project | name | |
compositeApp | version, name, project | |
app | name, composite_app_version, composite_app, project | |
dependency | name, app, composite_app_version, composite_app, project | |
compositeProfile | name, composite_app_version, composite_app, project | |
appProfile | name, composite_profile, composite_app_version, composite_app, project | |
deploymentIntentGroup | name, composite_app_version, composite_app, project | |
genericPlacementIntent | name, deployment_intent_group, composite_app_version, composite_app, project | |
genericAppPlacementIntent | name, generic_placement_intent, deployment_intent_group, composite_app_version, composite_app, project | |
groupIntent | name, deployment_intent_group, composite_app_version, composite_app_name, project | |
dcm | emco_logical_cloud_resource | project, name, namespace, status |
clm | emco_cluster_provider_resource | name |
emco_cluster_resource | name, clusterprovider | |
ncm | emco_cluster_network_resource | clusterprovider, cluster, name, cnitype |
emco_cluster_provider_network_resource | clusterprovider, cluster, name, cnitype, nettype, vlanid, providerinterfacename, logicalinterfacename, vlannodeselector | |
dtc | emco_dig_traffic_group_intent_resource | name, project, composite_app, composite_app_version, dig |
emco_dig_inbound_intent_resource | name, project, composite_app, composite_app_version, dig, traffic_group_intent, spec_app, app_label, serviceName, externalName, port, protocol, externalSupport, serviceMesh, sidecarProxy, tlsType | |
emco_dig_inbound_intent_client_resource | name project, composite_app, composite_app_version, dig, traffic_group_intent, inbound_intent, spec_app, app_label, serviceName | |
emco_dig_inbound_intent_client_access_point_resource | name, project, composite_app, composite_app_version, dig, traffic_group_intent, inbound_intent, client_name, action | |
ovnaction | emco_network_controller_intent_resource | name, project, composite_app, composite_app_version, dig |
emco_workload_intent_resource | name, project, composite_app, composite_app_version, dig, network_controller_intent, app_label, workload_resource, type | |
emco_workload_interface_intent_resource | name, project, composite_app, composite_app_version, dig, network_controller_intent, workload_intent interface, network_name, default_gateway, ip_address, mac_address |
The metrics for these resources should capture the state of the resource, i.e. metrics for creation, deletion, etc. (emco_controller_creation_timestamp, emco_controller_deletion_timestamp, etc.) as described in the guidelines. This approach is suggested as it is unclear how to apply metrics capturing resource utilization to these resources.
The status of a deployment intent group deserves special consideration. The suggested approach is to support the labels necessary to execute equivalent queries as shown in EMCO Status Queries. This would enable alerting on the various states of the resources composing a deployment intent group.
Metric | Type | Description | Labels |
---|---|---|---|
emco_deployment_intent_group_resource | GAUGE | 0 or 1 | project app composite_app_version composite_profile name deployed_status ready_status app cluster_provider cluster connectivity resource_gvk resource resource_deployed_status resource_ready_status |
The deployment intent group shown in Example query - status=deployed would create the following metrics:
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="ConfigMap.v1",resource="firewall-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-firewall",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge02",connectivity="available",resource_gvk="Config.v1",resource="firewall-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge02",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-firewall",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-packetgen",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="ConfigMap.v1.apps",resource="packetgen-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Service.v1.apps",resource="packetgen-service",resource_deployed_status="applied",resource_ready_status="ready"}
...
Some example queries:
Description | PromQL |
---|---|
deployedCounts | count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_deployed_status="applied"}) |
readyCounts | count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_ready_status="ready"}) |
count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_ready_status="notready"}) | |
apps | count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group"}) by (app) |
clusters filtered by the sink and firewall apps | count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",app="sink"} or emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",app="firewall"}) by (cluster_provider,cluster) |
Tracing
Istio provides a starting point for tracing by creating a trace for each request in the sidecars. But this is insufficient as it does not include the outgoing requests made during an inbound request. What we'd like to see is a complete trace of, for example, an instantiate request to the orchestrator that includes the requests made to any controllers, etc.
...