Pages

Thursday, September 6, 2012

Serviceability gem: WebSphere Application Server hung thread detection and recovery

Customers often ask me a question that goes like this 

Is there a way that hung threads could be manually or automatically killed to prevent the maxed out condition? For instance, once an alarm state is reached, could a script or a person look at the threads, identify "hung" threads, and kill them thus breaking the logjam? As I understand it, if a thread is hung then no data is being sent until the thread process completes. And if the thread dies then the request is simply re-sent by the originating application. So killing a thread will not actually hurt anything. Is this correct? And if so, how would we go about doing that? Trying think outside the box here. If we can't fix this, can we figure out a creative way to prevent it?

My response is as follows :  Yes there is a way out of the logjam. Prior to WebSphere Application Server 8.5 there are two ways to achieve what you are asking for ... 

Please read and comment.  Disadvantages of this approach are that tit aborts i.e. kills the JVM on the first hung thread which is abrasive since in-flight requests are killed.  Please note when you abort a JVM you may leave your system in an inconsistent state since in flight transactions and requests are terminated. 

2. Use a java client Thread_Hung event notification program that automatically monitor possibly hung threads and restarts the server via the nodeagent mbean http://www.ibm.com/developerworks/websphere/library/techarticles/0412_kochuba/0412_kochuba.html 

By default the hung thread detection threshold is 10 minutes (600 seconds). You may want to change that to 5 minutes . if you want a super responsive early warning system. Please note that you will need to set com.ibm.websphere.threadmonitor.false.alarm.threshold to 300. 

In WebSphere Application Server 8.5 and after you can use the Intelligent Management health monitoring and management subsystem  to monitor the application server environment and take action when certain criteria are discovered. see Configuring Health Management and Custom health condition subexpression builder

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.