Extend wait_for maas.py, wait_for_* attempts arg
maas.py: Extend wait_for states with timeout param
Extend the wait_for states with a timeout parameter.
The timeout value is taken from reclass pillar data if
defined. Oterwise, the states use the default value.
Based on Ting's PR [1], slightly refactored.
[1] https://github.com/salt-formulas/salt-formula-maas/pull/34
Signed-off-by: ting wu <ting.wu@enea.com>
Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
maas.py: wait_for_*: Add attempts arg
Introduce a new parameter that allows a maximum number of automatic
recovery attempts for the common failures w/ machine operations.
If not present in pillar data, it defaults to 0 (OFF).
Common error states, possible cause and automatic recovery pattern:
* New
- usually indicates issues with BMC connectivity (no network route,
but on rare occassions it happens due to MaaS API being flaky);
- fix: delete the machine, (re)process machine definitions;
* Failed commissioning
- various causes, usually a simple retry works;
- fix: delete the machine, (re)process machine definitions;
* Failed testing
- incompatible hardware, missing drivers etc.
- usually consistent and board-specific;
- fix: override failed testing
* Allocated
- on rare ocassions nodes get stuck in this state instead 'Deploy';
- fix: mark-broken, mark-fixed, if it failed at least once before
perform a fio test (fixes another unrelated spurious issue with
encrypted disks from previous deployments), (re)deploy machines;
* Failed deployment
- various causes, usually a simple retry works;
- fix: same as for nodes stuck in 'Allocated';
Relateed: PROD-28390(PROD:28390)
Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15
Signed-off-by: Alexandru Avadanii <Alexandru.Avadanii@enea.com>
(cherry picked from commit 4fa108e39fbf4da924f0bcbf01ff1625f13910a1)
diff --git a/_modules/maas.py b/_modules/maas.py
index c02f104..7db6bf3 100644
--- a/_modules/maas.py
+++ b/_modules/maas.py
@@ -921,6 +921,7 @@
req_status: string; Polling status
machines: list; machine names
ignore_machines: list; machine names
+ attempts: max number of automatic hard retries
:ret: True
Exception - if something fail/timeout reached
"""
@@ -929,6 +930,8 @@
req_status = kwargs.get("req_status", "Ready")
to_discover = kwargs.get("machines", None)
ignore_machines = kwargs.get("ignore_machines", None)
+ attempts = kwargs.get("attempts", 0)
+ counter = {}
if not to_discover:
try:
to_discover = __salt__['config.get']('maas')['region'][
@@ -941,13 +944,43 @@
total = [x for x in to_discover if x not in ignore_machines]
started_at = time.time()
while len(total) <= len(to_discover):
- for m in to_discover:
+ for machine in to_discover:
for discovered in MachinesStatus.execute()['machines']:
- if m == discovered['hostname'] and \
- discovered['status'].lower() == req_status.lower():
- if m in total:
- total.remove(m)
-
+ if machine == discovered['hostname'] and machine in total:
+ if discovered['status'].lower() == req_status.lower():
+ total.remove(machine)
+ elif attempts > 0 and (machine not in counter or counter[machine] < attempts):
+ status = discovered['status']
+ sid = discovered['system_id']
+ cls._maas = _create_maas_client()
+ if status in ['Failed commissioning', 'New']:
+ cls._maas.delete(u'api/2.0/machines/{0}/'
+ .format(sid))
+ Machine().process()
+ LOG.info('Machine {0} deleted'.format(sid))
+ counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+ elif status in ['Failed testing']:
+ data = {}
+ action = 'override_failed_testing'
+ cls._maas.post(u'api/2.0/machines/{0}/'
+ .format(sid, action, **data))
+ LOG.info('Machine {0} overriden'.format(sid))
+ counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
+ elif status in ['Failed deployment', 'Allocated']:
+ data = {}
+ cls._maas.post(u'api/2.0/machines/{0}/'
+ .format(sid, 'mark_broken', **data))
+ LOG.info('Machine {0} marked broken'.format(sid))
+ cls._maas.post(u'api/2.0/machines/{0}/'
+ .format(sid, 'mark_fixed', **data))
+ LOG.info('Machine {0} marked fixed'.format(sid))
+ if machine in counter and counter[machine]:
+ data['testing_scripts'] = 'fio'
+ cls._maas.post(u'api/2.0/machines/{0}/'
+ .format(sid, 'commission', **data))
+ LOG.info('Machine {0} fio test'.format(sid))
+ DeployMachines().process()
+ counter[machine] = 1 if machine not in counter else (counter[machine] + 1)
if len(total) <= 0:
LOG.debug(
"Machines:{} are:{}".format(to_discover, req_status))
@@ -959,7 +992,9 @@
"Waiting status:{} "
"for machines:{}"
"\nsleep for:{}s "
- "Timeout:{}s".format(req_status, total, poll_time, timeout))
+ "Timeout:{}s ({}s left)"
+ .format(req_status, total, poll_time, timeout,
+ timeout - (time.time() - started_at)))
time.sleep(poll_time)