Ticket #601 (new defect)

Opened 7 months ago

Last modified 7 months ago

FM/AVSv: SCAP failure can cause duplicate active for 2N redundant model

Reported by: anders Owned by:
Priority: major Milestone: PL 3.0.1
Component: FM Version: 3.0.0-GA
Keywords: AVSv Cc:
patch waiting for maintainer: no

Description

See discussion thread in:

http://list.opensaf.org/archives/devel/2009-May/004096.html

If the SCAP process crashes then this should lead to IMMEDIATE node
restart (or at least restart of middleware and SAF application at
that node).

The current solution allows applications to continue executing
(for 10 seconds), then standby is promoted to active in parallell
with an order from FM at standby to FM at the "active in demise" to
restart.

This solution is both unreliable (we dont know if the FM at
the old active will comply) dangerous (since we allow a node
with extreemely serious AVSv problems to continue executing) and
defective (since it has a tendency to cause duplicate execution
of 2N redundancy model).

The only reason I dont class the ticket as critical is that the
problem should be rare in a real system. We have only seen the
problem when testing by manually killing SCAP.

I have provided a simple illustative patch that shows approximately
what should be done. In essence, when AVA detects loss of contact
with (the local) AVND, it should termiante its hosting process.

In addition, one of the processes/AVA's should order the node
restart AND send a message to the peer FM that it is going down,
which will cut short the 10 second waiting time for failover.

Attachments

Change History

Changed 7 months ago by anders

A similar problem exists on payloads.
But on payloads there is no FM.

If the PCAP process crashes, then this is detected by FM on the active controller,
but no action is taken. The broken payload continues to execute indefinitely with applications running.

Add/Change #601 (FM/AVSv: SCAP failure can cause duplicate active for 2N redundant model)

Author



Action
as new
Note: See TracTickets for help on using tickets.