Context:
We recently observed cluster updates that missed to run the update in case there was a problem reading the metadata version. This behavior was initially intended to gracefully handle the case that there is no existing schema present and to skip the update execution in such cases. However in this case there was a schema present but probably not yet readable (as elastic was just updated as well it's likely that the indices were still in a recovery state) and when trying to retrieve the metadata document from elastic an exception occurred. Still the update was skipped and thus missed.
AT:
- In case of a failure reading the metadata document from the index the update should fail hard instead of silently succeeding to get retried eventually
- In case of no metadata index or document present the update should get skipped without failure