Peculiar production bug

Peculiar production bug

One of the more nuanced bugs I had to grapple with recently

"The art of debugging is figuring out what you really told your program to do rather than what you thought you told it to do." - Andrew Singer

Recently I found myself in a hardcore session of head scratching. Maybe one day when I'm big I will know how to find a bug like this faster... or better, how to pick it up during unit testing or integration testing phases... or better yet, how to avoid a bug like this slipping into the code in the first place. But each solved bug is a learning experience and I've certainly learnt a bit from this incident.

As it happens this bug presented in some legacy code which I have the (mis)fortune of maintaining. But no sooner had I solved the problem and understood what caused it than I had to go back to some of my own projects to fix exactly the same bug.

The application in question is written in Go and the bug pertains to a SQL query against a Microsoft SQL Server 2017 database - not that the version is particularly important.

I might add that we're using the denisenkom/go-mssqldb SQL driver but the issue here is related to the Go database/sql package.

We have a fairly straight forward query like this:

rows, err := db.Query(
    "SELECT ... FROM Receipts WHERE MerchantNumber = @MerchantNumber AND ...",
    sql.Named("MerchantNumber", someStringVariable),
)

I'm leaving out some details on the select list and the WHERE clause of course but the pertinent bits are shown above. The Receipts table has an index on MerchantNumber which is a VARCHAR(20) field. So here is our first mistake; Our Dev and Testing environments have truncated and sanitised versions of the Receipts table. As such our various stages of testing would never have picked up the bug which explains how it slipped through into the production environment.

All the information to identify the bug is already available above, and of course I would now be able to pick it up in an instant. But unfortunately in the real world where I did not have the luxury of hindsight, it took me a fair bit longer.

The problem we were seeing is that the query above started becoming very slow to execute to the point where it timed out for bigger customers. When I issued the same query (or so I thought) in SSMS, it performed with lightning speed as we'd expect. So what was going on?

Whenever a query shows different performance when issued from code vs SSMS, I immediately start thinking of cached execution plans in SQL Server. We started investigating that and sure enough, we discovered that SQL Server was not using the index on the MerchantNumber column when executing the query. This would explain perfectly why the response would be so slow, given the size of the Receipts table. But there was just no logical explanation we could find for that. It certainly wasn't because of a cached execution plan that for some reason was wrong and kept being reused.

Remember I mentioned that I thought I had issued the same query in SSMS? Well, let's get back to that. Here's what I issued in SSMS:

DECLARE @MerchantNumber VARCHAR(20) = '12345'
SELECT ... FROM Receipts WHERE MerchantNumber = @MerchantNumber AND ...

And it executed in milliseconds. We looked at the execution plan and, sure enough, the index was being utilised. So we thought of looking a bit more closely at the received query that SQL Server executes when issued from code and, after stripping away all of the unimportant bits, this is what remained:

(@MerchantNumber nvarchar(5))SELECT ... FROM Receipts WHERE MerchantNumber = @MerchantNumber AND ...

Wait! What? What is with that nvarchar(5)? And that's when the penny dropped. The database/sql package provides no way to specify the type of a named parameter (or numbered parameter for that matter), at least not that I am aware of. The type is simply inferred from the value that is provided, in this case the string "12345" which is passed on to SQL Server as a nvarchar(5).

If this disparity of types were reversed, as in the table column had a type of NVARCHAR while the parameter passed had a type of VARCHAR, this would not have been a problem. But because of the order of precedence of data types in SQL Server, the lower order VARCHAR column values had to be cast to NVARCHAR before a comparison can be made which renders the query non-SARGable.

But now what? As I mentioned, it doesn't look like the database/sql package provides for a way to specify the type of a parameter. In C#, I normally specify a parameter type as in

SqlCommand.Parameters.Add("@MerchantNumber", SqlDbType.Varchar, 20).Value = someStringVariable;

as opposed to

SqlCommand.Parameters.AddWithValue("@MerchantNumber", someStringVariable);

The best solution I could find for this bug was to change the SQL query so as to explicitly force the type conversion in the query.

So this:

SELECT ... FROM Receipts WHERE MerchantNumber = @MerchantNumber AND ...

becomes this:

SELECT ... FROM Receipts WHERE MerchantNumber = CAST(@MerchantNumber AS VARCHAR) AND ...

And bingo! Problem solved. I'll be honest though that I'm not a great fan of this solution and I'm a tad disappointed that Go doesn't provide a mechanism for specifying parameter types. I'll have to keep my eye on this one.