How to Ctrl-F "Error"​ in log and find $1M costing defect?


Every software engineer's million dollar question is "how to find that million dollar defect". In that quest, they sift through gigabytes of log or use a log management tool to ease the process of manually reading the log. As application throws tons of errors and exceptions (for valid reasons) their job becomes too complicated and tedious. In such case Ctrl-F "Error" will return 1M occurrences of the word, not the one you are searching for.

One type of situation/exception that doesn't cause any headache is the one that is properly handled and recovered. This is the most common type of situation that application runs into. That allows the callee to return information about an abnormal event and the caller to take appropriate action to recover from the abnormal event. For example when application wants to load a configuration, it would like to get informed that such configuration is unavailable so that caller in the application will make a corrective action of loading a replacement. This is done by, for example, catching configuration unavailable exception thrown by the callee. Hence such conditions are anticipated, program is written to recover from and don't cause any code breakage. Out of 10,000 errors/exceptions thrown, 9999 of them will be this type.

Source: tecHindustan

Then what type of situations cause code breaks? Defects and failures usually occur when application runs into uncharted waters (unanticipated condition) and code is not written to properly handle such condition. Let's take a look at them.

1) Swallowed (Logged or Not logged)

These are silent killers. The callee does not handle it and denies the caller a chance to handle, it gets swallowed. Such an exception being abnormal condition and swallowed, application doesn't know how to handle the situation, breaks the code and leaves visible impact on customer experience. They will come to light through customer reporting the problem.<

2) Uncaught

These are nasty. If you are lucky, your framework library will catch/log it and deliver a more decent error page. In the absence of a framework, the exception will unwind the call stack and print a nasty stack trace to the user.

3) Re-thrown without wrapping the old one

This is dangerous too and like swallowed these are also silent killers. The old exception has vital information and is lost if not wrapped in the new exception being thrown. Which leaves no trace of the actual error condition in the log and misleads troubleshooting.

Application does not know how to recover from these situations, result in failures and come to light through end user. A stack trace, if available in the log, is a beautiful thing to someone trying to fix the issue. An error message without a stack trace in the log, or worse yet, silence, is the cloaked harbinger of hours, days, even weeks of hair-pulling debugging and you are likely cursing somebody (maybe even yourself). These situations are cause of

million dollar defect you are searching for. These need to be corrected by code fixing.

Below are the events that unfold before fixing code.

1) Know that there is a defect in the code. - Usually end user reports it through technical support channels.

2) You need log to see if you are lucky enough and find that root cause stack trace (know that it is like looking for a needle in a haystack). -You have to get the log from production.

3) Steps to reproduce. Know that more often than not tickets get closed with the infamous... “could not reproduce”. -You get steps from end user.

4) Correct use case, data submitted that lead to defect. -You get it from end user.

5) Perform RCA to find root cause of the defect.

These processes take huge amounts of time and lead to increased MTTR (Mean Time To Repair).

How to improve MTTR

Modern and new age tools are arriving and disrupting application monitoring space and are an organization's best chance to reduce MTTR. Monitoring can provide 24/7 insight into the health of the system. Using this real time information, an organization can establish its MTTR, allowing engineers to run preventative maintenance and to plan for repairs proactively. Take a look at the below tools...

OverOps : Indexes events (errors and exceptions) and allows easy searching in addition to showing run-time variables and their values.

Rookout : Allows you place break points in production code and capture snapshot of run-time variables and their values without pausing the application.

Both these tools help in RCA.

Seagence : Automates all of the 5 steps above. Using ML, Seagence identifies that $1M costing defect with root cause (exception stack trace, logged or not) in real time as the failure manifests in production and drops a crisp and clear root cause email in the inbox. See below benefits of using Seagence.

1) No need of end user reporting failure. Seagence identifies that million dollar failure and root cause as it manifests in production in real time.

2) You don't need to sift through gigabytes of log as you already have defect and root cause.

3) Steps to reproduce. Continuously monitoring your application, Seagence provides functions that were invoked before leading to the defect. This is useful when you want to test your fix.

4) Input data. Continuously monitoring your application, Seagence provides input data (query parameters and form data) submitted by user that lead to defect. This is useful when you want to test your fix.

5) No need of RC analysis by engineer. This task is eliminated as Seagence finds root cause for you in real time.

So yes, you can find that million dollar defect using Seagence even without a button click. Seagence drops failure and root cause email in your inbox. Interested? Request a demo Here.