A HTTP API which scrapes individual nagios instances and returns a response in the Prometheus metrics format, with each Nagios check exposed as a separate metric.
https://prometheus-nagios-exporter.in.ft.com
Platinum
Production
AWS ECS
The exporter can be accessed directly either via the Dyn GSLB or in the EU or US specifically.
There is a Splunk dashboard for Nagios connectivity issues as this is AWS -> FT data centres over a VPN tunnel, and has to go through firewalls context deadline exceeded
errors in this dashboard indicate a normal timeout error and likely doesn't indicate a connectivity issue.
A few useful queries can be ran to determine what the exporter is returning, if anything. These can be run either in the Prometheus console or the Grafana explore UI.
-
Is the exporter down, or has it been down recently?
up{job="nagios"} == 0
== 0
for down,== 1
for up.If this is down it suggests either a problem with the connectivity between Prometheus and the exporter, or that the exporter is not responding.
-
Is the Nagios instance down/accessible, or has it been down recently?
nagios_up{job="nagios"} == 0
== 0
for down,== 1
for up.If this is down it suggests either a problem with the connectivity between the exporter and the Nagios instance, bad credentials between the two, or a problem with the the Nagios instance API.
There have been issues with connectivity to exporters previously due to firewall rules or VPN issues. By viewing which Nagios instances are affected using the
instance
label, and comparing results for the two AWS regions by making queries in both Prometheus instances it is possible to work out if there is a pattern suggesting firewall/connectivity issues. Splunk is also useful here as the exporter logs the Nagios scrape error. -
Are the expected checks being fetched in the numbers expected?
count(nagios_check_status{job="nagios"} == 0)
== 0
for down,== 1
for up.
View the generic troubleshooting information for the AWS ECS cluster (including services running on the cluster) which the application runs on: monitoring-aggregation-ecs.
Nothing further to add.
The Heimdall Prometheus has some bespoke alarms which are sent to the #rel-eng-alerts Slack via alertmanager.
These are visible in the Alertmanager UI if they are firing.
There are several Grafana dashboards:
- AWS ECS Task metrics (
us-east-1
metrics are available using the dropdowns). - Go language runtime metrics
Logs are available in Splunk via the query:
index="operations-reliability" attrs.com.ft.service-name="prometheus-nagios-exporter-service" attrs.com.ft.service-region="*"
There is a bespoke Splunk dashboard for connectivity issues between the exporter and nagios. context deadline exceeded
is a normal timeout error and likely doesn't indicate a connectivity issue.
False
False
Diagram for the nagios exporter:
Note: This setup is mirrored in eu-west-1
and us-east-1
regions.
ActiveActive
FullyAutomated
FullyAutomated
NotApplicable
Not applicable.
FullyAutomated
Manual
Release:
- Merge a commit to master
- CircleCI will build and deploy the commit.
Rollback:
- Open CircleCI for this project: circleci:prometheus-nagios-exporter
- Find the build of the commit which you wish to roll back to. The commit message is visible, and the
sha
of the commit is displayed to the right - Click on
Rerun
, under the build status for each workflow - Click
Rerun from beginning
Manual
The systems secrets are set at build time as parameters in the services Cloudformation template.
They come from two sources:
- The CircleCI environment variables for the CircleCI project.
- The CircleCI context used in the CircleCI config.
See the README for more details.