Skip to content

Commit

Permalink
Update InfraNodesNeedResizingSRE
Browse files Browse the repository at this point in the history
  • Loading branch information
sam-nguyen7 committed Feb 8, 2024
1 parent a795041 commit 337af07
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 20 deletions.
10 changes: 5 additions & 5 deletions deploy/sre-prometheus/100-infra-resizing.PrometheusRule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ spec:
## If either of the CPU or Memory resource consumption alerts (see below) fire, then trigger an alert for SRE
- expr: (
count(
ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE", alertstate="firing"}
ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE1h", alertstate="firing"}
OR
ALERTS{alertname="memory-InfraNodesExcessiveResourceConsumptionSRE", alertstate="firing"}
) >= 1
Expand All @@ -103,7 +103,7 @@ spec:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 1h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: "The cluster's infrastructure nodes have been consuming excessive CPU for 1 hours and may need to be vertically scaled to support the existing workers. See linked SOP for details."
Expand All @@ -113,7 +113,7 @@ spec:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 16h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: "The cluster's infrastructure nodes have been consuming excessive CPU for 16 hours and may need to be vertically scaled to support the existing workers. See linked SOP for details."
Expand All @@ -122,14 +122,14 @@ spec:
expr: sre:node_infra:excessive_consumption_memory > 0
for: 24h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: "The cluster's infrastructure nodes have been consuming excessive memory for 24 hours and may need to be vertically scaled to support the existing workers. See linked SOP for details."
## If the CPU or Memory related "InfraNodesExcessiveResourceConsumptionSRE" alerts are firing, raise a critical ticket to SRE to scale the infra nodes up
- alert: InfraNodesNeedResizingSRE
expr: sre:node_infras:need_resize > 0
for: 2h
for: 5m
labels:
severity: critical
namespace: openshift-monitoring
Expand Down
10 changes: 5 additions & 5 deletions hack/00-osd-managed-cluster-config-integration.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -34946,7 +34946,7 @@ objects:
="infra"} ) - 1 ) / count ( cluster:nodes_roles{label_node_role_kubernetes_io
="infra"} ) ) )
record: sre:node_infra:excessive_consumption_memory
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE",
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE1h",
alertstate="firing"} OR ALERTS{alertname="memory-InfraNodesExcessiveResourceConsumptionSRE",
alertstate="firing"} ) >= 1 )
record: sre:node_infras:need_resize
Expand All @@ -34956,7 +34956,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 1h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34966,7 +34966,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 16h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34976,15 +34976,15 @@ objects:
expr: sre:node_infra:excessive_consumption_memory > 0
for: 24h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
memory for 24 hours and may need to be vertically scaled to support
the existing workers. See linked SOP for details.
- alert: InfraNodesNeedResizingSRE
expr: sre:node_infras:need_resize > 0
for: 2h
for: 5m
labels:
severity: critical
namespace: openshift-monitoring
Expand Down
10 changes: 5 additions & 5 deletions hack/00-osd-managed-cluster-config-production.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -34946,7 +34946,7 @@ objects:
="infra"} ) - 1 ) / count ( cluster:nodes_roles{label_node_role_kubernetes_io
="infra"} ) ) )
record: sre:node_infra:excessive_consumption_memory
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE",
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE1h",
alertstate="firing"} OR ALERTS{alertname="memory-InfraNodesExcessiveResourceConsumptionSRE",
alertstate="firing"} ) >= 1 )
record: sre:node_infras:need_resize
Expand All @@ -34956,7 +34956,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 1h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34966,7 +34966,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 16h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34976,15 +34976,15 @@ objects:
expr: sre:node_infra:excessive_consumption_memory > 0
for: 24h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
memory for 24 hours and may need to be vertically scaled to support
the existing workers. See linked SOP for details.
- alert: InfraNodesNeedResizingSRE
expr: sre:node_infras:need_resize > 0
for: 2h
for: 5m
labels:
severity: critical
namespace: openshift-monitoring
Expand Down
10 changes: 5 additions & 5 deletions hack/00-osd-managed-cluster-config-stage.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -34946,7 +34946,7 @@ objects:
="infra"} ) - 1 ) / count ( cluster:nodes_roles{label_node_role_kubernetes_io
="infra"} ) ) )
record: sre:node_infra:excessive_consumption_memory
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE",
- expr: ( count( ALERTS{alertname="cpu-InfraNodesExcessiveResourceConsumptionSRE1h",
alertstate="firing"} OR ALERTS{alertname="memory-InfraNodesExcessiveResourceConsumptionSRE",
alertstate="firing"} ) >= 1 )
record: sre:node_infras:need_resize
Expand All @@ -34956,7 +34956,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 1h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34966,7 +34966,7 @@ objects:
expr: sre:node_infra:excessive_consumption_cpu > 0
for: 16h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
Expand All @@ -34976,15 +34976,15 @@ objects:
expr: sre:node_infra:excessive_consumption_memory > 0
for: 24h
labels:
severity: critical
severity: warning
namespace: openshift-monitoring
annotations:
message: The cluster's infrastructure nodes have been consuming excessive
memory for 24 hours and may need to be vertically scaled to support
the existing workers. See linked SOP for details.
- alert: InfraNodesNeedResizingSRE
expr: sre:node_infras:need_resize > 0
for: 2h
for: 5m
labels:
severity: critical
namespace: openshift-monitoring
Expand Down

0 comments on commit 337af07

Please sign in to comment.