Metrics Analysis
As part of the analysis process, Flagger can validate service level objectives (SLOs) like availability, error rate percentage, average response time and any other objective based on app specific metrics. If a drop in performance is noticed during the SLOs analysis, the release will be automatically rolled back with minimum impact to end-users.
Builtin metrics
Flagger comes with two builtin metric checks: HTTP request success rate and duration.
analysis:
metrics:
- name: request-success-rate
interval: 1m
# minimum req success rate (non 5xx responses)
# percentage (0-100)
thresholdRange:
min: 99
- name: request-duration
interval: 1m
# maximum req duration P99
# milliseconds
thresholdRange:
max: 500
For each metric you can specify a range of accepted values with thresholdRange
and
the window size or the time series with interval
.
The builtin checks are available for every service mesh / ingress controller
and are implemented with
Prometheus queries.
Custom metrics
The canary analysis can be extended with custom metric checks.
Using a MetricTemplate
custom resource,
you configure Flagger to connect to a metric provider and run a query that returns a float64
value.
The query result is used to validate the canary based on the specified threshold range.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: my-metric
spec:
provider:
type: # can be prometheus, datadog, etc
address: # API URL
insecureSkipVerify: # if set to true, disables the TLS cert validation
secretRef:
name: # name of the secret containing the API credentials
query: # metric query
The following variables are available in query templates:
name
(canary.metadata.name)namespace
(canary.metadata.namespace)target
(canary.spec.targetRef.name)service
(canary.spec.service.name)ingress
(canary.spec.ingresRef.name)interval
(canary.spec.analysis.metrics[].interval)variables
(canary.spec.analysis.metrics[].templateVariables)
A canary analysis metric can reference a template with templateRef
:
analysis:
metrics:
- name: "my metric"
templateRef:
name: my-metric
# namespace is optional
# when not specified, the canary namespace will be used
namespace: flagger
# accepted values
thresholdRange:
min: 10
max: 1000
# metric query time window
interval: 1m
A canary analysis metric can reference a set of custom variables with templateVariables
. These variables will be then injected into the query defined in the referred MetricTemplate
object during canary analysis:
analysis:
metrics:
- name: "my metric"
templateRef:
name: my-metric
namespace: flagger
# accepted values
thresholdRange:
min: 10
max: 1000
# metric query time window
interval: 1m
# custom variables used within the referenced metric template
templateVariables:
direction: inbound
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: my-metric
spec:
provider:
type: prometheus
address: http://prometheus.linkerd-viz:9090
query: |
histogram_quantile(
0.99,
sum(
rate(
response_latency_ms_bucket{
namespace="{{ namespace }}",
deployment=~"{{ target }}",
direction="{{ variables.direction }}"
}[{{ interval }}]
)
) by (le)
)
Prometheus
You can create custom metric checks targeting a Prometheus server by
setting the provider type to prometheus
and writing the query in PromQL.
Prometheus template example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: not-found-percentage
namespace: istio-system
spec:
provider:
type: prometheus
address: http://prometheus.istio-system:9090
query: |
100 - sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="{{ namespace }}",
destination_workload="{{ target }}",
response_code!="404"
}[{{ interval }}]
)
)
/
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="{{ namespace }}",
destination_workload="{{ target }}"
}[{{ interval }}]
)
) * 100
Reference the template in the canary analysis:
analysis:
metrics:
- name: "404s percentage"
templateRef:
name: not-found-percentage
namespace: istio-system
thresholdRange:
max: 5
interval: 1m
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
Prometheus gRPC error rate example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: grpc-error-rate-percentage
namespace: flagger
spec:
provider:
type: prometheus
address: http://flagger-prometheus.flagger-system:9090
query: |
100 - sum(
rate(
grpc_server_handled_total{
grpc_code!="OK",
kubernetes_namespace="{{ namespace }}",
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
}[{{ interval }}]
)
)
/
sum(
rate(
grpc_server_started_total{
kubernetes_namespace="{{ namespace }}",
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
}[{{ interval }}]
)
) * 100
The above template is for gRPC services instrumented with go-grpc-prometheus.
Prometheus authentication
If your Prometheus API requires basic authentication, you can create a secret in the same namespace
as the MetricTemplate
with the basic-auth credentials:
apiVersion: v1
kind: Secret
metadata:
name: prom-auth
namespace: flagger
data:
username: your-user
password: your-password
or if you require bearer token authentication (via a SA token):
apiVersion: v1
kind: Secret
metadata:
name: prom-auth
namespace: flagger
data:
token: ey1234...
Then reference the secret in the MetricTemplate
:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: my-metric
namespace: flagger
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
secretRef:
name: prom-auth
Datadog
You can create custom metric checks using the Datadog provider.
Create a secret with your Datadog API credentials:
apiVersion: v1
kind: Secret
metadata:
name: datadog
namespace: istio-system
data:
datadog_api_key: your-datadog-api-key
datadog_application_key: your-datadog-application-key
Datadog template example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: not-found-percentage
namespace: istio-system
spec:
provider:
type: datadog
address: https://api.datadoghq.com
secretRef:
name: datadog
query: |
100 - (
sum:istio.mesh.request.count{
reporter:destination,
destination_workload_namespace:{{ namespace }},
destination_workload:{{ target }},
!response_code:404
}.as_count()
/
sum:istio.mesh.request.count{
reporter:destination,
destination_workload_namespace:{{ namespace }},
destination_workload:{{ target }}
}.as_count()
) * 100
Reference the template in the canary analysis:
analysis:
metrics:
- name: "404s percentage"
templateRef:
name: not-found-percentage
namespace: istio-system
thresholdRange:
max: 5
interval: 1m
Amazon CloudWatch
You can create custom metric checks using the CloudWatch metrics provider.
CloudWatch template example:
apiVersion: flagger.app/v1alpha1
kind: MetricTemplate
metadata:
name: cloudwatch-error-rate
spec:
provider:
type: cloudwatch
region: ap-northeast-1 # specify the region of your metrics
query: |
[
{
"Id": "e1",
"Expression": "m1 / m2",
"Label": "ErrorRate"
},
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "MyKubernetesCluster",
"MetricName": "ErrorCount",
"Dimensions": [
{
"Name": "appName",
"Value": "{{ name }}.{{ namespace }}"
}
]
},
"Period": 60,
"Stat": "Sum",
"Unit": "Count"
},
"ReturnData": false
},
{
"Id": "m2",
"MetricStat": {
"Metric": {
"Namespace": "MyKubernetesCluster",
"MetricName": "RequestCount",
"Dimensions": [
{
"Name": "appName",
"Value": "{{ name }}.{{ namespace }}"
}
]
},
"Period": 60,
"Stat": "Sum",
"Unit": "Count"
},
"ReturnData": false
}
]
The query format documentation can be found here.
Reference the template in the canary analysis:
analysis:
metrics:
- name: "app error rate"
templateRef:
name: cloudwatch-error-rate
thresholdRange:
max: 0.1
interval: 1m
Note that Flagger need AWS IAM permission to perform cloudwatch:GetMetricData
to use this provider.
New Relic
You can create custom metric checks using the New Relic provider.
Create a secret with your New Relic Insights credentials:
apiVersion: v1
kind: Secret
metadata:
name: newrelic
namespace: istio-system
data:
newrelic_account_id: your-account-id
newrelic_query_key: your-insights-query-key
New Relic template example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: newrelic-error-rate
namespace: ingress-nginx
spec:
provider:
type: newrelic
secretRef:
name: newrelic
query: |
SELECT
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
sum(nginx_ingress_controller_requests) * 100
FROM Metric
WHERE metricName = 'nginx_ingress_controller_requests'
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
Reference the template in the canary analysis:
analysis:
metrics:
- name: "error rate"
templateRef:
name: newrelic-error-rate
namespace: ingress-nginx
thresholdRange:
max: 5
interval: 1m
Graphite
You can create custom metric checks using the Graphite provider.
Graphite template example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: graphite-request-success-rate
spec:
provider:
type: graphite
address: http://graphite.monitoring
query: |
target=summarize(
asPercent(
sumSeries(
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.{CLIENT_ERROR,INFORMATIONAL,REDIRECTION,SUCCESS}.status.*.uri.*.count
),
sumSeries(
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.*.status.*.uri.*.count
)
),
{{interval}},
'avg'
)
Reference the template in the canary analysis:
analysis:
metrics:
- name: "success rate"
templateRef:
name: graphite-request-success-rate
thresholdRange:
min: 90
interval: 1min
Graphite authentication
If your Graphite API requires basic authentication, you can create a secret in the same namespace
as the MetricTemplate
with the basic-auth credentials:
apiVersion: v1
kind: Secret
metadata:
name: graphite-basic-auth
namespace: flagger
data:
username: your-user
password: your-password
Then, reference the secret in the MetricTemplate
:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: my-metric
namespace: flagger
spec:
provider:
type: graphite
address: http://graphite.monitoring
secretRef:
name: graphite-basic-auth
Google Cloud Monitoring (Stackdriver)
Enable Workload Identity on your cluster, create a service account key that has read access to the Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger service account on Kubernetes. You can take a look at this guide
Annotate the flagger service account
kubectl annotate serviceaccount flagger \
--namespace <namespace> \
iam.gke.io/gcp-service-account=<gcp-serviceaccount-name>@<project-id>.iam.gserviceaccount.com
Alternatively, you can download the json keys and add it to your secret with the key serviceAccountKey
(This method is not recommended).
Create a secret that contains your project-id (and, if workload identity is not enabled on your cluster, your service account json).
kubectl create secret generic gcloud-sa --from-literal=project=<project-id>
Then reference the secret in the metric template. Note: The particular MQL query used here works if Istio is installed on GKE.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: bytes-sent
namespace: test
spec:
provider:
type: stackdriver
secretRef:
name: gcloud-sa
query: |
fetch k8s_container
| metric 'istio.io/service/server/response_latencies'
| filter
(metric.destination_service_name == '{{ service }}-canary'
&& metric.destination_service_namespace == '{{ namespace }}')
| align delta(1m)
| every 1m
| group_by [],
[value_response_latencies_percentile:
percentile(value.response_latencies, 99)]
The reference for the query language can be found here
InfluxDB
The InfluxDB provider uses the flux query language.
Create a secret that contains your authentication token that can be found in the InfluxDB UI.
kubectl create secret generic influx-token --from-literal=token=<token>
Then reference the secret in the metric template.
Note: The particular MQL query used here works if Istio is installed on GKE.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: not-found
namespace: test
spec:
provider:
type: influxdb
secretRef:
name: influx-token
query: |
from(bucket: "default")
|> range(start: -2h)
|> filter(fn: (r) => r["_measurement"] == "istio_requests_total")
|> filter(fn: (r) => r[" destination_workload_namespace"] == "{{ namespace }}")
|> filter(fn: (r) => r["destination_workload"] == "{{ target }}")
|> filter(fn: (r) => r["response_code"] == "500")
|> count()
|> yield(name: "count")
Dynatrace
You can create custom metric checks using the Dynatrace provider.
Create a secret with your Dynatrace token:
apiVersion: v1
kind: Secret
metadata:
name: dynatrace
namespace: istio-system
data:
dynatrace_token: ZHQwYz...
Dynatrace metric template example:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: response-time-95pct
namespace: istio-system
spec:
provider:
type: dynatrace
address: https://xxxxxxxx.live.dynatrace.com
secretRef:
name: dynatrace
query: |
builtin:service.response.time:filter(eq(dt.entity.service,SERVICE-ABCDEFG0123456789)):percentile(95)
Reference the template in the canary analysis:
analysis:
metrics:
- name: "response-time-95pct"
templateRef:
name: response-time-95pct
namespace: istio-system
thresholdRange:
max: 1000
interval: 1m