Olivier Bourdon | cad047a | 2017-07-19 03:14:25 +0200 | [diff] [blame] | 1 | {% from "nova/map.jinja" import controller, compute, monitoring with context %} |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 2 | |
| 3 | {%- set is_controller = controller.get('enabled', False) %} |
| 4 | {%- set is_compute = compute.get('enabled', False) %} |
| 5 | |
| 6 | {%- if is_controller or is_compute %} |
Vasyl Saienko | cc87b1b | 2019-03-11 15:44:42 +0200 | [diff] [blame] | 7 | {%- if is_compute and exporters is defined and compute.get('compute_driver', 'libvirt.LibvirtDriver') == 'libvirt.LibvirtDriver' %} |
Olivier Bourdon | 47b48a5 | 2017-07-10 11:48:08 +0200 | [diff] [blame] | 8 | {%- set packages = exporters.get('libvirt', {}).get('packages', ('libvirt-exporter', )) %} |
| 9 | {%- load_yaml as new_exporters_cfg %} |
| 10 | exporters: |
| 11 | libvirt: |
| 12 | enabled: true |
| 13 | {%- if packages is defined %} |
| 14 | packages: |
| 15 | {% for pkg in packages %} |
| 16 | - {{ pkg }} |
| 17 | {% endfor %} |
| 18 | {%- endif %} |
| 19 | services: |
| 20 | qemu: |
| 21 | enabled: true |
| 22 | bind: |
| 23 | address: 0.0.0.0 |
| 24 | port: 9177 |
| 25 | {%- endload %} |
| 26 | {{ new_exporters_cfg|yaml(False) }} |
| 27 | {%- endif %} |
| 28 | |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 29 | server: |
| 30 | alert: |
| 31 | {%- if is_controller %} |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 32 | {%- set minor_threshold = monitoring.services_failed_warning_threshold_percent|float %} |
| 33 | {%- set major_threshold = monitoring.services_failed_critical_threshold_percent|float %} |
| 34 | {%- set minor_compute_threshold = monitoring.computes_failed_warning_threshold_percent|float %} |
| 35 | {%- set major_compute_threshold = monitoring.computes_failed_critical_threshold_percent|float %} |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 36 | {%- set major_endpoint_threshold = monitoring.endpoint_failed_major_threshold|float %} |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 37 | {% raw %} |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 38 | NovaApiOutage: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 39 | if: >- |
| 40 | max(openstack_api_check_status{name=~"nova.*|placement"}) == 0 |
| 41 | labels: |
| 42 | severity: critical |
| 43 | service: nova |
| 44 | annotations: |
| 45 | summary: "Nova API outage" |
| 46 | description: >- |
| 47 | Nova API is not accessible for all available Nova endpoints in the OpenStack service catalog. |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 48 | NovaApiDown: |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 49 | if: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 50 | openstack_api_check_status{name=~"nova.*|placement"} == 0 |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 51 | labels: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 52 | severity: major |
| 53 | service: nova |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 54 | annotations: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 55 | summary: "{{ $labels.name }} endpoint is not accessible" |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 56 | description: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 57 | Nova API is not accessible for the {{ $labels.name }} endpoint. |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 58 | NovaApiEndpointDown: |
Olivier Bourdon | 196d4da | 2017-09-20 16:31:51 +0200 | [diff] [blame] | 59 | if: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 60 | http_response_status{name=~"nova-api"} == 0 |
Olivier Bourdon | 196d4da | 2017-09-20 16:31:51 +0200 | [diff] [blame] | 61 | for: 2m |
| 62 | labels: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 63 | severity: minor |
| 64 | service: nova |
Olivier Bourdon | 196d4da | 2017-09-20 16:31:51 +0200 | [diff] [blame] | 65 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 66 | summary: "nova-api endpoint is not accessible" |
Olivier Bourdon | 196d4da | 2017-09-20 16:31:51 +0200 | [diff] [blame] | 67 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 68 | The nova-api endpoint on the {{ $labels.host }} node is not accessible for 2 minutes. |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 69 | {%- endraw %} |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 70 | NovaApiEndpointsDownMajor: |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 71 | if: >- |
| 72 | count(http_response_status{name=~"nova-api"} == 0) >= count(http_response_status{name=~"nova-api"}) * {{ major_endpoint_threshold }} |
| 73 | for: 2m |
| 74 | labels: |
| 75 | severity: major |
| 76 | service: nova |
| 77 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 78 | summary: "{{major_endpoint_threshold * 100}}% of nova-api endpoints are not accessible" |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 79 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 80 | {% raw %}{{ $value }} nova-api endpoints (>= {% endraw %} {{major_endpoint_threshold * 100}}{% raw %}%) are not accessible for 2 minutes. |
| 81 | NovaApiEndpointsOutage: |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 82 | if: >- |
| 83 | count(http_response_status{name=~"nova-api"} == 0) == count(http_response_status{name=~"nova-api"}) |
| 84 | for: 2m |
| 85 | labels: |
| 86 | severity: critical |
| 87 | service: nova |
| 88 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 89 | summary: "nova-api endpoints outage" |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 90 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 91 | All available nova-api endpoints are not accessible for 2 minutes. |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 92 | NovaServiceDown: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 93 | if: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 94 | openstack_nova_service_state == 0 |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 95 | labels: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 96 | severity: minor |
| 97 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 98 | annotations: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 99 | summary: "{{ $labels.binary }} service is down" |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 100 | description: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 101 | The {{ $labels.binary }} service on the {{ $labels.hostname }} node is down. |
| 102 | {%- endraw %} |
| 103 | NovaServicesDownMinor: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 104 | if: >- |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 105 | count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary) count(openstack_nova_service_state{binary!~"nova-compute"}) by (binary) * {{minor_threshold}} |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 106 | labels: |
| 107 | severity: minor |
| 108 | service: nova |
| 109 | annotations: |
| 110 | summary: "{{minor_threshold * 100}}%{%- raw %} of {{ $labels.binary }} services are down" |
| 111 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 112 | {{ $value }} {{ $labels.binary }} services (>= {%- endraw %} {{minor_threshold * 100}}%) are down. |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 113 | NovaComputeServicesDownMinor: |
| 114 | if: >- |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 115 | count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * {{minor_compute_threshold}} |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 116 | labels: |
| 117 | severity: minor |
| 118 | service: nova |
| 119 | annotations: |
| 120 | summary: "{{minor_compute_threshold * 100}}%{%- raw %} of nova-compute services are down" |
| 121 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 122 | {{ $value }} nova-compute services (>= {%- endraw %} {{minor_compute_threshold * 100}}%) are down. |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 123 | NovaServicesDownMajor: |
| 124 | if: >- |
| 125 | count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary) count(openstack_nova_service_state{binary!~"nova-compute"}) by (binary) * {{major_threshold}} |
| 126 | labels: |
| 127 | severity: major |
| 128 | service: nova |
| 129 | annotations: |
| 130 | summary: "{{major_threshold * 100}}%{%- raw %} of {{ $labels.binary }} services are down" |
| 131 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 132 | {{ $value }} {{ $labels.binary }} services (>= {%- endraw %} {{major_threshold * 100}}%) are down. |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 133 | NovaComputeServicesDownMajor: |
| 134 | if: >- |
| 135 | count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * {{major_compute_threshold}} |
| 136 | labels: |
| 137 | severity: major |
| 138 | service: nova |
| 139 | annotations: |
| 140 | summary: "{{major_compute_threshold * 100}}%{%- raw %} of nova-compute services are down" |
| 141 | description: >- |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 142 | {{ $value }} nova-compute services (>= {%- endraw %} {{major_compute_threshold * 100}}{%- raw %}%) are down. |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 143 | NovaServiceOutage: |
| 144 | if: >- |
| 145 | count(openstack_nova_service_state == 0) by (binary) == on (binary) count(openstack_nova_service_state) by (binary) |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 146 | labels: |
| 147 | severity: critical |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 148 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 149 | annotations: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 150 | summary: "{{ $labels.binary }} service outage" |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 151 | description: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 152 | All {{ $labels.binary }} services are down. |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 153 | {%- endraw %} |
Ildar Svetlov | aebd3ff | 2018-06-01 18:16:02 +0400 | [diff] [blame] | 154 | {%- set cpu_minor_threshold = monitoring.cpu_minor_threshold|float %} |
| 155 | {%- set cpu_major_threshold = monitoring.cpu_major_threshold|float %} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 156 | {%- set ram_major_threshold = monitoring.ram_major_threshold|float %} |
| 157 | {%- set ram_critical_threshold = monitoring.ram_critical_threshold|float %} |
| 158 | {%- set disk_major_threshold = monitoring.disk_major_threshold|float %} |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 159 | {%- set disk_critical_threshold = monitoring.disk_critical_threshold|float %} |
Ildar Svetlov | aebd3ff | 2018-06-01 18:16:02 +0400 | [diff] [blame] | 160 | NovaHypervisorVCPUsFullMinor: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 161 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 162 | label_replace(system_load15, "hostname", "$1", "host", "(.*)") > on (hostname) openstack_nova_vcpus * {{ cpu_minor_threshold }} |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 163 | labels: |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 164 | severity: minor |
Ildar Svetlov | 8b062ca | 2017-09-08 17:26:41 +0400 | [diff] [blame] | 165 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 166 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 167 | summary: "{{ cpu_minor_threshold * 100 }}% of hypervisor VCPUs are used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 168 | description: "{% raw %}{{ $value }} VCPUs on the {{ $labels.hostname }} node (> {% endraw %} {{ cpu_minor_threshold * 100 }}%) are used." |
Ildar Svetlov | aebd3ff | 2018-06-01 18:16:02 +0400 | [diff] [blame] | 169 | NovaHypervisorVCPUsFullMajor: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 170 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 171 | label_replace(system_load15, "hostname", "$1", "host", "(.*)") > on (hostname) openstack_nova_vcpus * {{ cpu_major_threshold }} |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 172 | labels: |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 173 | severity: major |
Ildar Svetlov | 8b062ca | 2017-09-08 17:26:41 +0400 | [diff] [blame] | 174 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 175 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 176 | summary: "{{ cpu_major_threshold * 100 }}% of hypervisor VCPUs are used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 177 | description: "{% raw %}{{ $value }} VCPUs on the {{ $labels.hostname }} node (> {% endraw %} {{ cpu_major_threshold * 100 }}%) are used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 178 | NovaHypervisorMemoryFullMajor: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 179 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 180 | openstack_nova_used_ram > openstack_nova_ram * {{ ram_major_threshold }} |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 181 | labels: |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 182 | severity: major |
Ildar Svetlov | 8b062ca | 2017-09-08 17:26:41 +0400 | [diff] [blame] | 183 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 184 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 185 | summary: "{{ ram_major_threshold * 100 }}% of hypervisor RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 186 | description: "{% raw %}{{ $value }}MB of RAM on the {{ $labels.hostname }} node (> {% endraw %} {{ ram_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 187 | NovaHypervisorMemoryFullCritical: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 188 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 189 | openstack_nova_used_ram > openstack_nova_ram * {{ ram_critical_threshold }} |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 190 | labels: |
| 191 | severity: critical |
Ildar Svetlov | 8b062ca | 2017-09-08 17:26:41 +0400 | [diff] [blame] | 192 | service: nova |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 193 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 194 | summary: "{{ ram_critical_threshold * 100 }}% of hypervisor RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 195 | description: "{% raw %}{{ $value }}MB of RAM on the {{ $labels.hostname }} node (> {% endraw %} {{ ram_critical_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 196 | NovaHypervisorDiskFullMajor: |
| 197 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 198 | openstack_nova_used_disk > openstack_nova_disk * {{ disk_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 199 | labels: |
| 200 | severity: major |
| 201 | service: nova |
| 202 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 203 | summary: "{{ disk_major_threshold * 100 }}% of hypervisor disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 204 | description: "{% raw %}{{ $value }}GB of disk space on the {{ $labels.hostname }} node (> {% endraw %} {{ disk_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 205 | NovaHypervisorDiskFullCritical: |
| 206 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 207 | openstack_nova_used_disk > openstack_nova_disk * {{ disk_critical_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 208 | labels: |
| 209 | severity: critical |
| 210 | service: nova |
| 211 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 212 | summary: "{{ disk_critical_threshold * 100 }}% of hypervisor disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 213 | description: "{% raw %}{{ $value }}GB of disk space on the {{ $labels.hostname }} node (> {% endraw %} {{ disk_critical_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 214 | NovaAggregateMemoryFullMajor: |
| 215 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 216 | openstack_nova_aggregate_used_ram > openstack_nova_aggregate_ram * {{ ram_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 217 | labels: |
| 218 | severity: major |
| 219 | service: nova |
| 220 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 221 | summary: "{{ ram_major_threshold * 100 }}% of aggregate RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 222 | description: "{% raw %}{{ $value }}MB of RAM on the {{ $labels.aggregate }} aggregate (> {% endraw %} {{ ram_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 223 | NovaAggregateMemoryFullCritical: |
| 224 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 225 | openstack_nova_aggregate_used_ram > openstack_nova_aggregate_ram * {{ ram_critical_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 226 | labels: |
| 227 | severity: critical |
| 228 | service: nova |
| 229 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 230 | summary: "{{ ram_critical_threshold * 100 }}% of aggregate RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 231 | description: "{% raw %}{{ $value }}MB of RAM on the {{ $labels.aggregate }} aggregate (> {% endraw %} {{ ram_critical_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 232 | NovaAggregateDiskFullMajor: |
| 233 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 234 | openstack_nova_aggregate_used_disk > openstack_nova_aggregate_disk * {{ disk_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 235 | labels: |
| 236 | severity: major |
| 237 | service: nova |
| 238 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 239 | summary: "{{ disk_major_threshold * 100 }}% of aggregate disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 240 | description: "{% raw %}{{ $value }}GB of disk space on the {{ $labels.aggregate }} aggregate (> {% endraw %} {{ disk_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 241 | NovaAggregateDiskFullCritical: |
| 242 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 243 | openstack_nova_aggregate_used_disk > openstack_nova_aggregate_disk * {{ disk_critical_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 244 | labels: |
| 245 | severity: critical |
| 246 | service: nova |
| 247 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 248 | summary: "{{ disk_critical_threshold * 100 }}% of aggregate disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 249 | description: "{% raw %}{{ $value }}GB of disk space on the {{ $labels.aggregate }} aggregate (> {% endraw %} {{ disk_critical_threshold * 100 }}%) is used." |
Ildar Svetlov | aebd3ff | 2018-06-01 18:16:02 +0400 | [diff] [blame] | 250 | NovaTotalVCPUsFullMinor: |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 251 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 252 | sum(label_replace(system_load15, "hostname", "$1", "host", "(.*)") and on (hostname) openstack_nova_vcpus) > max(sum(openstack_nova_vcpus) by (instance)) * {{ cpu_minor_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 253 | labels: |
| 254 | severity: minor |
| 255 | service: nova |
| 256 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 257 | summary: "{{ cpu_minor_threshold * 100 }}% of cloud VCPUs are used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 258 | description: "{% raw %}{{ $value }} VCPUs in the cloud (> {% endraw %} {{ cpu_minor_threshold * 100 }}%) are used." |
Ildar Svetlov | aebd3ff | 2018-06-01 18:16:02 +0400 | [diff] [blame] | 259 | NovaTotalVCPUsFullMajor: |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 260 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 261 | sum(label_replace(system_load15, "hostname", "$1", "host", "(.*)") and on (hostname) openstack_nova_vcpus) > max(sum(openstack_nova_vcpus) by (instance)) * {{ cpu_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 262 | labels: |
| 263 | severity: major |
| 264 | service: nova |
| 265 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 266 | summary: "{{ cpu_major_threshold * 100 }}% of cloud VCPUs are used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 267 | description: "{% raw %}{{ $value }} VCPUs in the cloud (> {% endraw %} {{ cpu_major_threshold * 100 }}%) are used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 268 | NovaTotalMemoryFullMajor: |
| 269 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 270 | openstack_nova_total_used_ram > openstack_nova_total_ram * {{ ram_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 271 | labels: |
| 272 | severity: major |
| 273 | service: nova |
| 274 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 275 | summary: "{{ ram_major_threshold * 100 }}% of cloud RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 276 | description: "{% raw %}{{ $value }}MB of RAM in the cloud (> {% endraw %} {{ ram_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 277 | NovaTotalMemoryFullCritical: |
| 278 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 279 | openstack_nova_total_used_ram > openstack_nova_total_ram * {{ ram_critical_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 280 | labels: |
| 281 | severity: critical |
| 282 | service: nova |
| 283 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 284 | summary: "{{ ram_critical_threshold * 100 }}% of cloud RAM is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 285 | description: "{% raw %}{{ $value }}MB of RAM in the cloud (> {% endraw %} {{ ram_critical_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 286 | NovaTotalDiskFullMajor: |
| 287 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 288 | openstack_nova_total_used_disk > openstack_nova_total_disk * {{ disk_major_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 289 | labels: |
| 290 | severity: major |
| 291 | service: nova |
| 292 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 293 | summary: "{{ disk_major_threshold * 100 }}% of cloud disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 294 | description: "{% raw %}{{ $value }}GB of disk space in the cloud (> {% endraw %} {{ disk_major_threshold * 100 }}%) is used." |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 295 | NovaTotalDiskFullCritical: |
| 296 | if: >- |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 297 | openstack_nova_total_used_disk > openstack_nova_total_disk * {{ disk_critical_threshold }} |
Ildar Svetlov | d4902c2 | 2018-05-07 08:35:46 +0400 | [diff] [blame] | 298 | labels: |
| 299 | severity: critical |
| 300 | service: nova |
| 301 | annotations: |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 302 | summary: "{{ disk_critical_threshold * 100 }}% of cloud disk space is used" |
Ildar Svetlov | c87868a | 2018-07-09 08:05:39 +0400 | [diff] [blame] | 303 | description: "{% raw %}{{ $value }}GB of disk space in the cloud (> {% endraw %} {{ disk_critical_threshold * 100 }}%) is used." |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 304 | {%- endif %} |
| 305 | NovaErrorLogsTooHigh: |
Olivier Bourdon | cad047a | 2017-07-19 03:14:25 +0200 | [diff] [blame] | 306 | {%- set log_threshold = monitoring.error_log_rate.warn|float %} |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 307 | if: >- |
Dmitry Kalashnik | 97d174e | 2018-01-19 17:16:31 +0400 | [diff] [blame] | 308 | sum(rate(log_messages{service="nova",level=~"(?i:(error|emergency|fatal))"}[5m])) without (level) > {{ log_threshold }} |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 309 | {%- raw %} |
| 310 | labels: |
| 311 | severity: warning |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 312 | service: nova |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 313 | annotations: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 314 | summary: "High number of errors in Nova logs" |
Ildar Svetlov | 88baa46 | 2018-05-16 15:08:34 +0400 | [diff] [blame] | 315 | description: "The average per-second rate of errors in Nova logs on the {{ $labels.host }} node is {{ $value }} (as measured over the last 5 minutes)." |
| 316 | {%- endraw %} |
Vasyl Saienko | cc87b1b | 2019-03-11 15:44:42 +0200 | [diff] [blame] | 317 | {%- if is_compute and exporters is defined and compute.get('compute_driver', 'libvirt.LibvirtDriver') == 'libvirt.LibvirtDriver'%} |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 318 | {%- raw %} |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 319 | LibvirtDown: |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 320 | if: >- |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 321 | libvirt_up == 0 |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 322 | for: 2m |
| 323 | labels: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 324 | severity: critical |
| 325 | service: libvirt |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 326 | annotations: |
Ildar Svetlov | e3bb3f0 | 2018-04-24 11:17:45 +0400 | [diff] [blame] | 327 | summary: "Failure to gather Libvirt metrics" |
Michal Kobus | 4a83ddb | 2018-06-08 14:04:43 +0200 | [diff] [blame] | 328 | description: "The Libvirt metric exporter fails to gather metrics on the {{ $labels.host }} node for 2 minutes." |
Olivier Bourdon | a8b46eb | 2017-07-03 12:57:34 +0200 | [diff] [blame] | 329 | {%- endraw %} |
Olivier Bourdon | 47b48a5 | 2017-07-10 11:48:08 +0200 | [diff] [blame] | 330 | {%- include "prometheus/_exporters_config.sls" %} |
Simon Pasquier | aba66d3 | 2017-06-27 12:27:43 +0200 | [diff] [blame] | 331 | {%- endif %} |
Olivier Bourdon | 7a77dbc | 2017-06-27 15:16:44 +0200 | [diff] [blame] | 332 | {%- endif %} |