Ticket #632 (new defect)

Opened 6 months ago

Last modified 6 months ago

message queue server hangs after certain time

Reported by: umesh.patel@… Owned by:
Priority: blocker Milestone: PL 3.0.1
Component: MQSv Version: 2.0.0
Keywords: Cc: pratima.gupta@…
patch waiting for maintainer: no

Description

on high load, message queue library threads stuck on pthread mutex lock (Deadlock). Message queue server gets stuck and stops taking further message from the queue.

We are using message queue in synchronous mode, that client waits for server to reply. On 100 messages per seconds problem hit after 10-15 minutes only. This is highly repeatable. This problem observed frequently for message size more than 3500 bytes.

I am attaching pstack file for your reference.

Attachments

server_pstack.txt (6.5 kB) - added by Umesh Patel 6 months ago.
pstack of the message queue server
msg_q_deadlock_3.0.txt (8.1 kB) - added by umesh.patel@… 6 months ago.

Change History

Changed 6 months ago by Umesh Patel

pstack of the message queue server

Changed 6 months ago by umesh.patel@…

Changed 6 months ago by umesh.patel@…

I have downloaded 3.0 GA release and tried same scenarios. Message queue server hangs in this release also. I have attached pstack of the server process. Also attaching message queue agent logs which logs continuous error messages after this problem occurs. No sign of errors in director and node director log files.

ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.312 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.312 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.312 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391419 43 1 MQSv 13Jul2009_16.46.42.313 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391419 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391419 43 1 MQSv 13Jul2009_16.46.42.313 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391383 43 1 MQSv 13Jul2009_16.46.42.314 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391383 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391383 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391443 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.314 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391427 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.314 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.314 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.315 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.315 mqa_api.c: 3786:MQA - The send part of the message sendreceive failed:5
ERROR : 0x0002010f 970391435 43 1 MQSv 13Jul2009_16.46.42.315 mqa_api.c: 3899:MQA - MsgQ Svc msgsendreceive failed:5
ERROR : 0x0002010f 970391431 43 1 MQSv 13Jul2009_16.46.42.315 mqa_mds.c: 912:MQA - Message Send through MDS Failure:131

Changed 6 months ago by anonymous

From the pstack of the process and source of the services, it is clear that there is deadlock between two threads( thread #10 and thread #6). Thread #6 have acquired lock on callback lock that is (m_NCS_LOCK(&mqa_cb->cb_lock, NCS_LOCK_WRITE) != NCSCC_RC_SUCCESS)and then it calls api ncs_tmr_free to destroy the timer which it registered. In ncs_tmr_free, it tries to take lock for timer that is m_NCS_LOCK (&gl_tcb.safe.enter_lock, NCS_LOCK_WRITE);. Now this lock is already taken by Thread #10 in sysfTmrExpiry (). Thread #10 then try to take callback lock in mqa_node_timeout_handler api.

Rest all threads are waiting callback lock.

This is repeatable scenario as there may be possibility that on high load timer expires before message gets processed.

Since both services timer and message queue developed as separate services,locking mechanism should be independent. Here, One problem I found is that timeout callback called even though callback lock is taken. First callback lock should be released and then timeout callback should be called.

Add/Change #632 (message queue server hangs after certain time)

Author



Action
as new
Note: See TracTickets for help on using tickets.