Swiss army knife for verifying MCP cluster health
Features:
* Verify offline minions
* Verify time diff on your minions
* Produce JSON output for ntpq command
* Verify NTP peers state on your minions
* Verify contrail nodes contrail-status output
* Verify galera cluster status
* Verify rabbitmq cluster status
* Produce JSON output for rabbitmqctl commands
* Verify haproxy upstream status
* Produce haproxy JSON stats output
* Verify disk space usage
* Verify disk inodes usage
* Verify load average
* Verify ifaces rx/tx drops on the interfaces
* Verify memory usage
Related-Prod: PROD-29236
Change-Id: Id7423665e8d45baee4b96751d9df29112dfa10e5
diff --git a/README.rst b/README.rst
index 4414544..8a31001 100644
--- a/README.rst
+++ b/README.rst
@@ -620,6 +620,141 @@
{{- item }}
%- endfor
+MCP Cluster health checks
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Swiss army knife toolset for verifying MCP cluster health.
+
+.. note:: Health checks are tested with salt modules >= 2017.7.
+
+Install health_checks module:
+
+.. code-block:: bash
+
+ cp health_checks.py /usr/share/salt-formulas/env/_modules/health_checks.py
+ salt -C '*' saltutil.sync_all
+
+Usually exit codes are not catched and salt-call for a module
+will always return exit 0 regardless of errors in output.
+If you want control exit code for scripting, you should pass
+**--retcode-passthrough** to each salt call:
+
+.. code-block:: bash
+
+ salt-call health_checks.minions_check --retcode-passthrough
+
+Verify if minions are online.
+Use it to determine which minions are offline.
+
+.. code-block:: bash
+
+ salt-call health_checks.minions_check
+
+Verify time diff on your minions:
+
+.. code-block:: bash
+
+ salt-call health_checks.time_diff_check
+
+In case of failure, dump diff JSON:
+
+.. code-block:: bash
+
+ salt-call health_checks.time_diff_check debug=True --out=json
+
+Get JSON stats from ntpq:
+
+.. code-block:: bash
+
+ salt-call health_checks.ntp_status
+
+Verify NTP peers status on the environment:
+
+.. code-block:: bash
+
+ salt-call health_checks.ntp_check
+ salt-call health_checks.ntp_check min_peers=2 max_stratum=2
+
+Verify contrail nodes contrail-status output:
+
+.. code-block:: bash
+
+ salt-call health_checks.contrail_check debug=True
+
+Verify galera cluster status:
+
+.. code-block:: bash
+
+ salt-call health_checks.galera_check debug=True
+ salt-call health_checks.galera_check cluster_size=3 debug=True
+
+Verify rabbitmq cluster status:
+
+.. code-block:: bash
+
+ salt-call health_checks.mysql_check debug=True
+
+Get rabbitmq json objects on command execution.
+
+.. warning:: This code is experimental. It is a hack to convert erlang object to JSON. May fail.
+
+.. code-block:: bash
+
+ salt-call health_checks.rabbitmq_cmd status
+ salt-call health_checks.rabbitmq_cmd cluster_status
+ salt-call health_checks.rabbitmq_cmd list_hashes
+ salt-call health_checks.rabbitmq_cmd list_ciphers
+
+Verify haproxy upstream status:
+
+.. code-block:: bash
+
+ salt-call health_checks.haproxy_check debug=True
+ salt-call health_checks.haproxy_check ignore_no_upstream=True
+
+Get haproxy JSON stats (native python calls to socket):
+
+.. code-block:: bash
+
+ salt-call health_checks.haproxy_status
+ salt-call health_checks.haproxy_status socket_path='/var/run/haproxy/admin.sock' stats_filter=['status']
+
+Verify disk space usage:
+
+.. code-block:: bash
+
+ salt-call health_checks.df_check
+ salt-call health_checks.df_check verify=space space_limit=90 ignore_partitions=['/']
+
+Verify disk inodes usage:
+
+.. code-block:: bash
+
+ salt-call health_checks.df_check verify=inodes
+ salt-call health_checks.df_check verify=inodes inode_limit=10
+
+Verify load average on the environment:
+
+.. code-block:: bash
+
+ salt-call health_checks.load_check
+ salt-call health_checks.load_check la1=4 la5=1 la15=1
+
+Verify ifaces rx/tx drops:
+
+.. code-block:: bash
+
+ salt-call health_checks.netdev_check
+ salt-call health_checks.netdev_check rx_drop_limit=0 tx_drop_limit=0
+
+Verify memory usage:
+
+.. code-block:: bash
+
+ salt-call health_checks.mem_check
+ salt-call health_checks.mem_check used_limit=50
+
+
Encrypted pillars
~~~~~~~~~~~~~~~~~