Introduction:
Java’s multithreading is a nice feature – the ability to concurrently process multiple requests makes the application responsive to user requests and allows better utilization of resources. As helpful as the feature is, it is however notoriously difficult to detect and debug multithreading issues. Concurrency or multithreading bugs are extremely difficult to reliably find by testing, due to their dependence on the non-deterministic scheduling of concurrent threads.
As we all know, Java programs are multithreaded and backend application servers like tomcat, wildfly and others spin a new thread (or pick one from a pool) for every user request. Medium to large systems process huge number of parallel requests spinning multiple threads. If programs are incorrectly written for multithreaded JVM, you will have intermittently occurring defects that elude even the most rigorous testing regimes and sneak into production.
With multithreading issues lurking in your code, you will have very serious problems that impact your customer experience. Believe me. I have seen production web applications that ran for years with multithreading issues daily troubling their customers and the team couldn’t do a thing and were helpless to root cause the defect. With due respect to the team, I mean to say that the nature of multithreading defects is very serious and difficult to identify. They are intermittent and sporadic. Another problem is that we cannot anticipate such issues at the time of development, unless we know that the object we are using is not thread safe.
In this blog, I am not talking about the types of multithreading issues, how to avoid them and not advising related best practices. There is numerous technical material available online on those topics. Instead, I am talking about your available options to find and fix such multithreading issues in production, after the fact. Know that no software is bug-free and all types of defects sneak into production.
Problems:
It is highly difficult, in some cases even for experts, to write thread-safe programs. For example, most Java applications use java.text.SimpleDateFormat object to format date strings for display purposes. But how many of us know that SimpleDateFormat objects are not thread safe? Many developers actually think that a cached static instance of SimpleDateFormat object is the best way of sharing resources. So, they declare a static field level instance for such an object and share it across all threads and feel accomplished after their unit tests pass. But in reality, this code will fail in production when put under even a decent load.
The same problem exists with the Xerces DOM object, which is also not thread-safe. It was this Xerces DOM object that caused the multithreading issues in one of my client’s production environment, who had to live with it for years. Read their story here.
Common symptoms of multithreading issues in Java:
- Data corruption: Data corruption issues are frequent and serious. This happens when multiple threads race to change the state of a shared object. When the object state is corrupted, the results include all kinds of weird Runtime Exceptions being thrown which seriously impact end-user experience. Some of the exceptions thrown when SimpleDateFormat object is shared include NumberFormatExceptions, ArrayIndexOutOfBoundsExceptions and NullPointerExceptions, etc. Because user requests fail intermittently with different exceptions at different times debugging is difficult. The situation becomes even worse if these exceptions are swallowed, which leaves no trace at all in the log.
- Deadlocks: Deadlocks are rare but serious. This happens due to incorrectly written synchronization blocks or when the lock acquisition is not in proper order. The results include threads keeping busy without actually doing any work.
But how do we know if our production software has concurrency or multithreading bugs? Take a look at the user reported issues or your bug database and look for phrases like
- The application occasionally does not respond
- The application outputs, rarely, an error screen
- The system sometimes behaves in an erratic manner
- I could not reproduce, but it just now glitched out
If you find one or more cases with similar phrases to the above examples, then your application likely has concurrency or related multithreading bugs.
Your Options:
There are actually not many tools available to detect and root cause multithreading issues. The reason is simple. The nature of the issues and their symptoms are intermittent and sporadic due to the underlying system’s non-deterministic scheduling of concurrent threads. Debugging multithreading issues in development environment itself it is a very difficult job and it is even more difficult and painful to detect and debug such issues in production.
- jVisualVM: The good news is jVisualVM comes with JDK installation and it is already available in the JDK bin directory. With jVisualVM, you will be able to connect to local and remote JVM instances, analyze threads, and take thread dumps. But the problem with jVisualVM is that you have to reproduce the defect to effectively debug and root cause the issue. As the behavior of multithreading defects is intermittent and sporadic, it is highly difficult to reproduce these types of issues. Even if you are lucky, and were able to reproduce the defect, and take thread dumps, know that debugging and analyzing thread dumps to find the root cause still rests on your shoulders because jVisualVM is only a debugging tool. Hence detecting multithreading issues using jVisualVM is time-consuming and still a painful job.
- Seagence: Seagence is a Realtime Defect Monitoring Platform that proactively detects the hard to find and fix defects missed by other approaches including concurrency and multithreading defects. The good news is, Seagence proactively detects these defects, even before end-users report them, and provides the root cause eliminating the need for manual debugging. So there is no need to reproduce the defect. Also, it seamlessly plugs into any production Java application using a tiny Java agent and starts monitoring. The Seagence agent is compatible with the APM and Observability agents used by the major APM and Observability vendors like Datadog and New Relic.
Seagence brings a new approach to production monitoring. Using its unique ExecutionPath Technology and machine learning, Seagence detects every defect as they occur and sends you an alert. How does Seagence do this? Seagence’s ExecutionPath Technology differentiates successfully executed requests from failed requests and machine learning helps separate them into different groups or clusters.
What separates Seagence from other tools is that it not only detects defects due to HTTP 500s and/or internal server errors but also detects defects due to any type of exception whether it is a swallowed exception or a caught exception or even an HTTP 200. With Seagence provided defect and root cause in hand, you fix your broken code without needing any debugging.
Also Read: Difference between Error Monitoring and Defect Monitoring
Conclusion:
No software is bug-free. Any amount of testing you do defects still sneak into production and create trouble to end users. Your best bet to improve your end user experience is by using a proper production monitoring tool that proactively detects all defects on its own with root cause and is also capable of helping you debug your production application when you want to find root cause of a defect. Seagence is the only tools built to find and help you fix difficult issues like concurrency and multi-threading defects in production. You can start using Seagence for free here.