Extend wait_for maas.py, wait_for_* attempts arg

maas.py: Extend wait_for states with timeout param

Extend the wait_for states with a timeout parameter.
The timeout value is taken from reclass pillar data if
defined. Oterwise, the states use the default value.

Based on Ting's PR [1], slightly refactored.

[1] https://github.com/salt-formulas/salt-formula-maas/pull/34

Signed-off-by: ting wu <ting.wu@enea.com>
Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>

maas.py: wait_for_*: Add attempts arg

Introduce a new parameter that allows a maximum number of automatic
recovery attempts for the common failures w/ machine operations.
If not present in pillar data, it defaults to 0 (OFF).

Common error states, possible cause and automatic recovery pattern:
* New
  - usually indicates issues with BMC connectivity (no network route,
    but on rare occassions it happens due to MaaS API being flaky);
  - fix: delete the machine, (re)process machine definitions;
* Failed commissioning
  - various causes, usually a simple retry works;
  - fix: delete the machine, (re)process machine definitions;
* Failed testing
  - incompatible hardware, missing drivers etc.
  - usually consistent and board-specific;
  - fix: override failed testing
* Allocated
  - on rare ocassions nodes get stuck in this state instead 'Deploy';
  - fix: mark-broken, mark-fixed, if it failed at least once before
    perform a fio test (fixes another unrelated spurious issue with
    encrypted disks from previous deployments), (re)deploy machines;
* Failed deployment
  - various causes, usually a simple retry works;
  - fix: same as for nodes stuck in 'Allocated';

Relateed: PROD-28390(PROD:28390)

Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15
Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
(cherry picked from commit 4fa108e39fbf4da924f0bcbf01ff1625f13910a1)
diff --git a/_modules/maas.py b/_modules/maas.py
index c02f104..7db6bf3 100644
--- a/_modules/maas.py
+++ b/_modules/maas.py
@@ -921,6 +921,7 @@
             req_status: string; Polling status
             machines:   list; machine names
             ignore_machines: list; machine names
+            attempts:   max number of automatic hard retries
         :ret: True
                  Exception - if something fail/timeout reached
         """
@@ -929,6 +930,8 @@
         req_status = kwargs.get("req_status", "Ready")
         to_discover = kwargs.get("machines", None)
         ignore_machines = kwargs.get("ignore_machines", None)
+        attempts = kwargs.get("attempts", 0)
+        counter = {}
         if not to_discover:
             try:
                 to_discover = __salt__['config.get']('maas')['region'][
@@ -941,13 +944,43 @@
             total = [x for x in to_discover if x not in ignore_machines]
         started_at = time.time()
         while len(total) <= len(to_discover):
-            for m in to_discover:
+            for machine in to_discover:
                 for discovered in MachinesStatus.execute()['machines']:
-                    if m == discovered['hostname'] and \
-                            discovered['status'].lower() == req_status.lower():
-                        if m in total:
-                            total.remove(m)
-
+                    if machine == discovered['hostname'] and machine in total:
+                        if discovered['status'].lower() == req_status.lower():
+                            total.remove(machine)
+                        elif attempts > 0 and (machine not in counter or counter[machine] < attempts):
+                            status = discovered['status']
+                            sid = discovered['system_id']
+                            cls._maas = _create_maas_client()
+                            if status in ['Failed commissioning', 'New']:
+                                cls._maas.delete(u'api/2.0/machines/{0}/'
+                                    .format(sid))
+                                Machine().process()
+                                LOG.info('Machine {0} deleted'.format(sid))
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+                            elif status in ['Failed testing']:
+                                data = {}
+                                action = 'override_failed_testing'
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, action, **data))
+                                LOG.info('Machine {0} overriden'.format(sid))
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+                            elif status in ['Failed deployment', 'Allocated']:
+                                data = {}
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, 'mark_broken', **data))
+                                LOG.info('Machine {0} marked broken'.format(sid))
+                                cls._maas.post(u'api/2.0/machines/{0}/'
+                                    .format(sid, 'mark_fixed', **data))
+                                LOG.info('Machine {0} marked fixed'.format(sid))
+                                if machine in counter and counter[machine]:
+                                    data['testing_scripts'] = 'fio'
+                                    cls._maas.post(u'api/2.0/machines/{0}/'
+                                        .format(sid, 'commission', **data))
+                                    LOG.info('Machine {0} fio test'.format(sid))
+                                DeployMachines().process()
+                                counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
             if len(total) <= 0:
                 LOG.debug(
                     "Machines:{} are:{}".format(to_discover, req_status))
@@ -959,7 +992,9 @@
                 "Waiting status:{} "
                 "for machines:{}"
                 "\nsleep for:{}s "
-                "Timeout:{}s".format(req_status, total, poll_time, timeout))
+                "Timeout:{}s ({}s left)"
+                .format(req_status, total, poll_time, timeout,
+                    timeout - (time.time() - started_at)))
             time.sleep(poll_time)