Extend wait_for maas.py, wait_for_* attempts arg maas.py: Extend wait_for states with timeout param Extend the wait_for states with a timeout parameter. The timeout value is taken from reclass pillar data if defined. Oterwise, the states use the default value. Based on Ting's PR [1], slightly refactored. [1] https://github.com/salt-formulas/salt-formula-maas/pull/34 Signed-off-by: ting wu <ting.wu@enea.com> Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com> maas.py: wait_for_*: Add attempts arg Introduce a new parameter that allows a maximum number of automatic recovery attempts for the common failures w/ machine operations. If not present in pillar data, it defaults to 0 (OFF). Common error states, possible cause and automatic recovery pattern: * New - usually indicates issues with BMC connectivity (no network route, but on rare occassions it happens due to MaaS API being flaky); - fix: delete the machine, (re)process machine definitions; * Failed commissioning - various causes, usually a simple retry works; - fix: delete the machine, (re)process machine definitions; * Failed testing - incompatible hardware, missing drivers etc. - usually consistent and board-specific; - fix: override failed testing * Allocated - on rare ocassions nodes get stuck in this state instead 'Deploy'; - fix: mark-broken, mark-fixed, if it failed at least once before perform a fio test (fixes another unrelated spurious issue with encrypted disks from previous deployments), (re)deploy machines; * Failed deployment - various causes, usually a simple retry works; - fix: same as for nodes stuck in 'Allocated'; Relateed: PROD-28390(PROD:28390) Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15 Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com> (cherry picked from commit 4fa108e39fbf4da924f0bcbf01ff1625f13910a1)

commit: 3206fe708608cad0780b4c86acead99a11293ec2 [log] [tgz]
author: Alexandru Avadanii <Alexandru.Avadanii@enea.com> Sun Sep 23 03:57:27 2018 +0200
committer: Ivan Berezovskiy <iberezovskiy@mirantis.com> Mon Sep 30 12:22:57 2019 +0000
tree: 5982c685397046e1f7dfea67adb63829d0d7625a
parent: 99c2fcdabfc32e8e982abc7b9721ad3f62458754 [diff]
diff --git a/README.rst b/README.rst
index a82ef15..0010113 100644
--- a/README.rst
+++ b/README.rst

@@ -772,12 +772,16 @@
             machines:
               - kvm01
               - kvm02
-            timeout: 1200 # in seconds
+            timeout: {{ region.timeout.ready }}
+            attempts: {{ region.timeout.attempts }}
             req_status: "Ready"
       - require:
         - cmd: maas_login_admin
       ...
 
+The timeout setting is taken from the reclass pillar data.
+If the pillar data is not defined, it will use the default value.
+
 If module run w/\o any extra paremeters,
 ``wait_for_machines_ready`` will wait for defined in salt
 machines. In this case, it is usefull to skip some machines:
@@ -792,7 +796,8 @@
       module.run:
       - name: maas.wait_for_machine_status
       - kwargs:
-            timeout: 1200 # in seconds
+            timeout: {{ region.timeout.deployed }}
+            attempts: {{ region.timeout.attempts }}
             req_status: "Deployed"
             ignore_machines:
                - kvm01 # in case it's broken or whatever

diff --git a/_modules/maas.py b/_modules/maas.py
index c02f104..7db6bf3 100644
--- a/_modules/maas.py
+++ b/_modules/maas.py

@@ -921,6 +921,7 @@
             req_status: string; Polling status
             machines:   list; machine names
             ignore_machines: list; machine names
+            attempts:   max number of automatic hard retries
         :ret: True
                  Exception - if something fail/timeout reached
         """
@@ -929,6 +930,8 @@
         req_status = kwargs.get("req_status", "Ready")
         to_discover = kwargs.get("machines", None)
         ignore_machines = kwargs.get("ignore_machines", None)
+        attempts = kwargs.get("attempts", 0)
+        counter = {}
         if not to_discover:
             try:
                 to_discover = __salt__['config.get']('maas')['region'][
@@ -941,13 +944,43 @@
             total = [x for x in to_discover if x not in ignore_machines]
         started_at = time.time()
         while len(total) <= len(to_discover):
-            for m in to_discover:
+            for machine in to_discover:
                 for discovered in MachinesStatus.execute()['machines']:
-                    if m == discovered['hostname'] and \
-                            discovered['status'].lower() == req_status.lower():
-                        if m in total:
-                            total.remove(m)
-
+                    if machine == discovered['hostname'] and machine in total:
+                        if discovered['status'].lower() == req_status.lower():
+                            total.remove(machine)
+                        elif attempts > 0 and (machine not in counter or counter[machine] < attempts):
+                            status = discovered['status']
+                            sid = discovered['system_id']
+                            cls._maas = _create_maas_client()
+                            if status in ['Failed commissioning', 'New']:
+                                cls._maas.delete(u'api/2.0/machines/{0}/'
+                                    .format(sid))
+                                Machine().process()
+                                LOG.info('Machine {0} deleted'.format(sid))
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+                            elif status in ['Failed testing']:
+                                data = {}
+                                action = 'override_failed_testing'
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, action, **data))
+                                LOG.info('Machine {0} overriden'.format(sid))
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+                            elif status in ['Failed deployment', 'Allocated']:
+                                data = {}
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, 'mark_broken', **data))
+                                LOG.info('Machine {0} marked broken'.format(sid))
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, 'mark_fixed', **data))
+                                LOG.info('Machine {0} marked fixed'.format(sid))
+                                if machine in counter and counter[machine]:
+                                    data['testing_scripts'] = 'fio'
+                                    cls._maas.post(u'api/2.0/machines/{0}/'
+                                        .format(sid, 'commission', **data))
+                                    LOG.info('Machine {0} fio test'.format(sid))
+                                DeployMachines().process()
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
             if len(total) <= 0:
                 LOG.debug(
                     "Machines:{} are:{}".format(to_discover, req_status))
@@ -959,7 +992,9 @@
                 "Waiting status:{} "
                 "for machines:{}"
                 "\nsleep for:{}s "
-                "Timeout:{}s".format(req_status, total, poll_time, timeout))
+                "Timeout:{}s ({}s left)"
+                .format(req_status, total, poll_time, timeout,
+                    timeout - (time.time() - started_at)))
             time.sleep(poll_time)
 
 

diff --git a/maas/machines/wait_for_deployed.sls b/maas/machines/wait_for_deployed.sls
index ebeedac..628c1be 100644
--- a/maas/machines/wait_for_deployed.sls
+++ b/maas/machines/wait_for_deployed.sls

@@ -8,6 +8,8 @@
   module.run:
   - name: maas.wait_for_machine_status
   - kwargs:
-        req_status: "Deployed"
+      req_status: "Deployed"
+      timeout: {{ region.timeout.deployed }}
+      attempts: {{ region.timeout.attempts }}
   - require:
     - cmd: maas_login_admin

diff --git a/maas/machines/wait_for_ready.sls b/maas/machines/wait_for_ready.sls
index c5d3c28..3e8a0f1 100644
--- a/maas/machines/wait_for_ready.sls
+++ b/maas/machines/wait_for_ready.sls

@@ -7,5 +7,8 @@
 wait_for_machines_ready:
   module.run:
   - name: maas.wait_for_machine_status
+  - kwargs:
+      timeout: {{ region.timeout.ready }}
+      attempts: {{ region.timeout.attempts }}
   - require:
     - cmd: maas_login_admin

diff --git a/maas/map.jinja b/maas/map.jinja
index b4f3ac7..09a9fa2 100644
--- a/maas/map.jinja
+++ b/maas/map.jinja

@@ -29,6 +29,10 @@
   bind:
     host: 0.0.0.0
     port: 80
+  timeout:
+    deployed: 1800
+    ready: 900
+    attemps: 10
 {%- endload %}
 
 {%- set region = salt['grains.filter_by'](region_defaults, merge=salt['pillar.get']('maas:region', {})) %}

diff --git a/tests/pillar/maas_region.sls b/tests/pillar/maas_region.sls
index 668fc81..e482b61 100644
--- a/tests/pillar/maas_region.sls
+++ b/tests/pillar/maas_region.sls

@@ -35,3 +35,7 @@
       username: maas
       port: 5432
     salt_master_ip: 127.0.0.1
+    timeout:
+      deployed: 900
+      ready: 900
+      attempts: 2
commit	3206fe708608cad0780b4c86acead99a11293ec2	[log] [tgz]
author	Alexandru Avadanii <Alexandru.Avadanii@enea.com>	Sun Sep 23 03:57:27 2018 +0200
committer	Ivan Berezovskiy <iberezovskiy@mirantis.com>	Mon Sep 30 12:22:57 2019 +0000
tree	5982c685397046e1f7dfea67adb63829d0d7625a
parent	99c2fcdabfc32e8e982abc7b9721ad3f62458754 [diff]