Product SiteDocumentation Site

9.3. Handling Resource Failure

By default, Pacemaker will attempt to recover failed resources by restarting them. However, failure recovery is highly configurable.

9.3.1. Failure Counts

Pacemaker tracks resource failures for each combination of node, resource, and operation (start, stop, monitor, etc.).
You can query the fail count for a particular node, resource, and/or operation using the crm_failcount command. For example, to see how many times the 10-second monitor for myrsc has failed on node1, run:
# crm_failcount --query -r myrsc -N node1 -n monitor -I 10s
If you omit the node, crm_failcount will use the local node. If you omit the operation and interval, crm_failcount will display the sum of the fail counts for all operations on the resource.
You can use crm_resource --cleanup or crm_failcount --delete to clear fail counts. For example, to clear the above monitor failures, run:
# crm_resource --cleanup -r myrsc -N node1 -n monitor -I 10s
If you omit the resource, crm_resource --cleanup will clear failures for all resources. If you omit the node, it will clear failures on all nodes. If you omit the operation and interval, it will clear the failures for all operations on the resource.

Note

Even when cleaning up only a single operation, all failed operations will disappear from the status display. This allows us to trigger a re-check of the resource’s current status.
Higher-level tools may provide other commands for querying and clearing fail counts.
The crm_mon tool shows the current cluster status, including any failed operations. To see the current fail counts for any failed resources, call crm_mon with the --failcounts option. This shows the fail counts per resource (that is, the sum of any operation fail counts for the resource).