Skip to content

Latest commit

 

History

History
171 lines (94 loc) · 6.51 KB

RUNBOOK.md

File metadata and controls

171 lines (94 loc) · 6.51 KB

Prometheus Nagios Exporter

A HTTP API which scrapes individual nagios instances and returns a response in the Prometheus metrics format, with each Nagios check exposed as a separate metric.

Primary URL

https://prometheus-nagios-exporter.in.ft.com

Service Tier

Platinum

Lifecycle Stage

Production

Host Platform

AWS ECS

Delivered By

reliability-engineering

Supported By

reliability-engineering

First Line Troubleshooting

The exporter can be accessed directly either via the Dyn GSLB or in the EU or US specifically.

There is a Splunk dashboard for Nagios connectivity issues as this is AWS -> FT data centres over a VPN tunnel, and has to go through firewalls context deadline exceeded errors in this dashboard indicate a normal timeout error and likely doesn't indicate a connectivity issue.

A few useful queries can be ran to determine what the exporter is returning, if anything. These can be run either in the Prometheus console or the Grafana explore UI.

  • Is the exporter down, or has it been down recently?

    up{job="nagios"} == 0
    

    == 0 for down, == 1 for up.

    If this is down it suggests either a problem with the connectivity between Prometheus and the exporter, or that the exporter is not responding.

  • Is the Nagios instance down/accessible, or has it been down recently?

    nagios_up{job="nagios"} == 0
    

    == 0 for down, == 1 for up.

    If this is down it suggests either a problem with the connectivity between the exporter and the Nagios instance, bad credentials between the two, or a problem with the the Nagios instance API.

    There have been issues with connectivity to exporters previously due to firewall rules or VPN issues. By viewing which Nagios instances are affected using the instance label, and comparing results for the two AWS regions by making queries in both Prometheus instances it is possible to work out if there is a pattern suggesting firewall/connectivity issues. Splunk is also useful here as the exporter logs the Nagios scrape error.

  • Are the expected checks being fetched in the numbers expected?

    count(nagios_check_status{job="nagios"} == 0)
    

    == 0 for down, == 1 for up.

View the generic troubleshooting information for the AWS ECS cluster (including services running on the cluster) which the application runs on: monitoring-aggregation-ecs.

Second Line Troubleshooting

Nothing further to add.

Bespoke Monitoring

The Heimdall Prometheus has some bespoke alarms which are sent to the #rel-eng-alerts Slack via alertmanager.

These are visible in the Alertmanager UI if they are firing.

There are several Grafana dashboards:

Logs are available in Splunk via the query:

index="operations-reliability" attrs.com.ft.service-name="prometheus-nagios-exporter-service" attrs.com.ft.service-region="*"

There is a bespoke Splunk dashboard for connectivity issues between the exporter and nagios. context deadline exceeded is a normal timeout error and likely doesn't indicate a connectivity issue.

Contains Personal Data

False

Contains Sensitive Data

False

Architecture

Diagram for the nagios exporter:

nagios-architecture-diagram

View in Lucidchart.

Note: This setup is mirrored in eu-west-1 and us-east-1 regions.

Failover Architecture Type

ActiveActive

Failover Process Type

FullyAutomated

Failback Process Type

FullyAutomated

Data Recovery Process Type

NotApplicable

Data Recovery Details

Not applicable.

Release Process Type

FullyAutomated

Rollback Process Type

Manual

Release Details

Release:

  • Merge a commit to master
  • CircleCI will build and deploy the commit.

Rollback:

  • Open CircleCI for this project: circleci:prometheus-nagios-exporter
  • Find the build of the commit which you wish to roll back to. The commit message is visible, and the sha of the commit is displayed to the right
  • Click on Rerun, under the build status for each workflow
  • Click Rerun from beginning

Key Management Process Type

Manual

Key Management Details

The systems secrets are set at build time as parameters in the services Cloudformation template.

They come from two sources:

  1. The CircleCI environment variables for the CircleCI project.
  2. The CircleCI context used in the CircleCI config.

See the README for more details.