Java’s multithreading is a nice feature, the ability to concurrently process multiple requests makes the application responsive to user requests and allows better utilization of resources. As helpful as the feature is, it is however notoriously difficult to detect and debug multithreading issues. Concurrency or multithreading bugs are extremely difficult to reliably find by testing, due to their dependence on the non-deterministic scheduling of concurrent threads.
As we all know, Java programs are multithreaded and backend application servers like tomcat, wildfly and others spin a new thread (or pick one from a pool) for every user request. Medium to large systems process huge number of parallel requests spinning multiple threads. If programs are incorrectly written for multithreaded JVM, you will have intermittently occurring defects that elude even the most rigorous testing regimes and sneak into production.
With multithreading issues lurking in your code, you will have very serious problems that impact your customer experience. Believe me. I have seen production web applications that ran for years with multithreading issues daily troubling their customers and the team couldn’t do a thing and were helpless to root cause the defect. With due respect to the team, I mean to say that the nature of multithreading defects is very serious and difficult to identify. They are intermittent and sporadic. Another problem is that we cannot anticipate such issues at the time of development, unless we know that the object we are using is not thread safe.
In this blog, I am not talking about the types of multithreading issues, how to avoid them and not advising related best practices. There is numerous technical material available online on those topics. Instead, I am talking about your available options to find and fix such multithreading issues in production, after the fact. Know that no software is bug-free and all types of defects sneak into production.
It is highly difficult, in some cases even for experts, to write thread-safe programs. For example, most Java applications use java.text.SimpleDateFormat object to format date strings for display purposes. But how many of us know that SimpleDateFormat objects are not thread safe? Many developers actually think that a cached static instance of SimpleDateFormat object is the best way of sharing resources. So, they declare a static field level instance for such an object and share it across all threads and feel accomplished after their unit tests pass. But in reality, this code will fail in production when put under even a decent load.
Same is the problem with Xerces DOM object, which is also not thread-safe. It was this Xerces DOM object that caused the multithreading issues in one of my client’s production environment, who had to live with it for years. Read their story here.
Common symptoms of multithreading issues in production:
1) Data corruption: Data corruption issues are frequent and serious. This happens when multiple threads race to change the state of a shared object. When the object state is corrupted, the results include all kinds of weird Runtime Exceptions being thrown which seriously impact end-user experience. Some of the exceptions thrown when SimpleDateFormat object is shared include NumberFormatExceptions, ArrayIndexOutOfBoundsExceptions and NullPointerExceptions etc. Because user requests fail intermittently with different exceptions at different times makes debugging much difficult. The situation becomes even worse if these exceptions are swallowed, which leaves no trace at all in the log.
2) Deadlocks: Deadlocks are rare but serious. This happens due to incorrectly written synchronization blocks or when the lock acquisition is not in proper order. The results include threads keeping busy without actually doing any work.
But how do we know if our production software has concurrency or multithreading bugs? Take a look at the user reported issues or your bug database and look for phrases like
- The application occasionally does not respond
- The application outputs, rarely, an error screen
- The system sometimes behaves in an erratic manner
- I could not reproduce, but it just now glitched out
If you find one or more cases with similar phrases to the above examples, then your application likely has concurrency or related multithreading bugs.
Know that there are actually not many tools available to detect and root cause multithreading issues. The reason is simple, the nature of the issues and their symptoms are intermittent and sporadic due to the underlying system’s non-deterministic scheduling of concurrent threads. Debugging multithreading issues in development environment itself is a very difficult job and it is even more difficult and painful to detect and debug such issues in production.
1) jVisualVM: jVisualVM is one of the best tools to debug multithreading issues. It comes with JDK installation and you will find it in the bin directory. With jVisualVM you will be able to debug local and remote JVM instances. But the problem with jVisualVM is that, you should reproduce the defect to effectively debug and root cause the issue. A full detailed how to use jVisualVM is out of the scope of this blog. You can find more info about how to use jVisualVM here.
2) Seagence: Seagence is a Realtime Defect Monitoring and Root Cause Automation tool that proactively detects all production defects with root-cause as they manifest in realtime. It seamlessly plugs into any production java application using a tiny java agent and starts monitoring.
Seagence brings a new approach to production monitoring. Using its unique ExecutionPath Technology and machine learning, Seagence detects every defect as they occur and alerts. How Seagence is capable of doing this? Seagence’s ExecutionPath Technology differentiates successfully executed requests from failed requests and machine learning helps separate them into different groups or clusters.
What separates Seagence from other tools is that it not only detects defects due to HTTP 500s and/or internal server errors but also detects defects due to any type of exception whether it is a swallowed exception or a caught exception or even an HTTP 200. With Seagence provided defect and root cause in hand, you fix your broken code without needing any debugging.
Both of the above tools use dynamic analysis. If you are interested in a static code analysis tool instead, SonarQube is a very popular tool that you can plug into your CI/CD pipeline.
No software is bug-free. Any amount of testing you do defects still sneak into production and create trouble to end users. Your best bet to improve your end user experience is by using a proper production monitoring tool that proactively detects all defects on its own with root cause and is also capable of helping you debug your production application when you want to find root cause of a defect. Seagence is one such tool. You can start using Seagence for free here.