Introduction:
Identifying defects and troubleshooting for their root cause is one of the important but painful tasks in software engineering and essential to maintaining good quality software. The fact is that engineers are still using old-fashioned troubleshooting techniques like pouring through tons of log files, gathering steps to reproduce the defect and troubleshooting in customer environments which take huge amounts of time and lead to increased MTTR (Mean Time To Repair).
To help them in the quest for improving MTTR, software developers use application monitoring tools. There are number of application monitoring tools available in this space. It is important to know that different tools take different approach to monitoring. Based on the approach, these tools are broadly categorized into Error Monitoring, Defect Monitoring, Log Monitoring, Production Debugging and APM tools. A discussion on all these approaches takes multiple blogs so this blog will keep the focus on Error Monitoring vs Defect Monitoring.
What is Error Monitoring?
Error monitoring (often also called error tracking or error tracing) is a system that reports and tracks uncaught exceptions and errors in your application to ensure that your software runs seamlessly at all times. Error monitoring tools report these events (see below for what is an event) in real-time, aggregate and index them for easy searching and present them in a nice dashboard when developers want to troubleshoot a defect. Another interesting thing about these tools is that they group similar events and help eliminate a lot of clutter and noise.
What is an Event: Every monitoring tool reports application runtime data to their respective servers and they call these data items as events. Different tools collect different types of data so the definition of event also varies from tool to tool. For Error Monitoring tools an event is a reported occurrence of an exception or error. For Defect Monitoring tools an event is a reported invocation of an endpoint along with all thrown exceptions and errors during that invocation.
If you want to report anything other than uncaught exception or error, for example an exception that you caught or some context you want to add to an event, these tools provide extensive APIs and SDKs for you to write appropriate code to report the data you want. However, these API/SDK libraries create dependency and they need to be present when you are building your application.
They also alert any suspicious or abnormal behavior with knowledge from engineers in the form of alert rules, thresholds and suspicious exception configurations. The point is that these rule breaks or threshold breaches signify a sign of defect in the application code so engineers can immediately jump into action.
What is Defect Monitoring?
Defect Monitoring is a system that detects defects and root cause in realtime to ensure that your software runs seamlessly at all times. They record and report every invocation of every endpoint in your application. Defect Monitoring tools also record all thrown exceptions and errors (caught, uncaught and swallowed) and the context you need of every invocation. The reported events are aggregated, indexed and presented in a nice dashboard when needed. These tools group events of an endpoint based on the execution similarity.
These tools use instrumentation and runtime agents to collect data they need for analysis. Hence they don’t create any dependency on your application and don’t require you to make any code changes. The most important thing about these tools is that they perform root cause automation and find defects with root cause in real-time and alert. They intelligently segregate all invocations of an endpoint and group successful invocations into success groups and failed invocations into defect groups and present them in the dashboard. It doesn’t matter whether the root cause exception of the defect is caught, uncaught or swallowed. And they do so without needing any input or knowledge like alert rules, thresholds and suspicious exception configurations from developers.
How do these tools help developers when code breaks?
-
Defects due to HTTP 500
Finding a defect due to an uncaught exception or HTTP 500 error code or internal server error is easy. Both types of tools find these issues very efficiently and report.
-
Defects due to swallowed or caught exceptions
What if a defect occurs due to a swallowed exception and server returns an HTTP status 200? Developers will have a nightmare figuring out the root cause because there will be no trace of that exception in the log or in error monitoring tool’s dashboard. However, Defect Monitoring tools record all caught, uncaught and swallowed exceptions, organize them by invocation, perform root cause automation and find defects due to any type of exception or error even if server returns HTTP 200.
-
Exceptions are not junk
It is also important to note that applications throw tons of exceptions and still recover. 99.9% of these exceptions do not cause any issue. Hence it has been the industry’s general opinion that recording all thrown exceptions is useless and junk. But the fact is that defects can occur due to any exception and developers need all possible debugging details to figure out the root cause. While Error Monitoring tools only record uncaught exceptions by default, Defect Monitoring tools efficiently record all exceptions and errors by default for the purpose of root cause automation and for further debugging if it becomes necessary. So, exceptions are not Junk, they are quality data needed for ML.
-
Dependency and code changes
If you want to report anything other than uncaught exception or error, Error Monitoring tools provide extensive APIs and SDKs to make required code changes. But the big question to developer is what exceptions to report? If they know the exceptions that will cause problems in production, they would rather fix them before releasing instead of facing surprises. When it comes to Defect Monitoring tools they record and organize all exceptions and errors by invocation and help developers if troubleshooting becomes necessary.
The below table lists differences between Error Monitoring and Defect Monitoring tools
Error Monitoring | Defect Monitoring | |
1) Finds defects due to uncaught exceptions | Yes | Yes |
2) Finds defects due to any exception | No | Yes |
3) Requires code changes | Yes | No |
4) Takes input/knowledge from developer | Yes | No |
5) Reports exceptions | Uncaught only | Caught, Uncaught and Swallowed |
6) Reports errors | Yes | Yes |
7) Integration | Compile-time library | Runtime agent |
8) Grouping | Events grouped based on exception similarity | Events grouped based on execution/invocation similarity |
9) How do they help to troubleshoot? | Come handy for troubleshooting after end-user reports a defect | They find defects as they manifest in production with root cause. You fix before end-user reports. |
10) Requires debugging for root cause | Yes | No |
Conclusion:
There are numerous tools for error monitoring and they all do a great job. They aggregate vast amounts of data from multiple sources and index for efficient searching. That helps developers do the root cause analysis. They also find and report abnormal behavior with some input from developer. But they lack quality data needed to find defects with root cause. Defect monitoring tools on the other hand find defects and their root cause and eliminate the need for debugging. Seagence is one such tool and find detailed information about Seagence here.