1

I have run a query on our SQL Server 2012 which returned no results. I discovered that this was incorrect and I SHOULD have gotten 16 records. I changed the query and get the answer expected but I am at a loss to understand why my original query did not work as expected.

So my ORIGINAL query which returned no results was:

SELECT
    WPB.[ID number]
FROM
    [Fact].[REPORT].[WPB_LIST_OF_IDS] WPB
WHERE
    [ID number] NOT IN (SELECT DISTINCT IdNumber 
                        FROM MasterData.Dimension.Customer DC)

The reworked query is this:

SELECT
    WPB.[ID number]
FROM
    [Fact].[REPORT].[WPB_LIST_OF_IDS] WPB
LEFT JOIN
    MasterData.Dimension.Customer DC ON WPB.[ID number] = DC.IdNumber
WHERE
    DC.IdNumber IS NULL

Can anyone tell me WHY the first query (which incidentally runs in fractions of a second vs the 2nd which takes a minute) does not work? I don't want to repeat this mistake in the future!

3
  • The second query doesn't work either. If it takes 1 minute it means you are missing indexes - one or both of the ID fields aren't indexed. In any case, if you want help with SQL you should provide table schemas, indexes, sample data and desired output. If you want help with performance you should first check the execution plan Commented Jul 11, 2018 at 10:35
  • NOT IN (select distinct...) distinct is redundand here and can affect performance. Anyway, performance will differ as the queries are logically different. If you don't want to repeat that mistake in the future then DO NOT USE NOT IN in subquery. NEVER! Commented Jul 11, 2018 at 10:39
  • BTW Select distinct IdNumber causes an unnecessary DISTINCT operation. You don't care how many 1 are returned, you only care whether there are any or none. The query optimizer will either ignore that distinct* or end up performing a useless sort/distinct operation. If *any* IdNumber` entry in a Dimension table is NULL you have a very serious problem. Dimensions shouldn't have nulls, they should have explicit records for Missing, NotApplicable, NotFound values. Again without schema and data people can only guess Commented Jul 11, 2018 at 10:46

1 Answer 1

5

Don't use not in with a subquery. It doesn't work the way you expect with NULL values. If any value returned by the subquery is NULL, then no rows are returned at all.

Instead, use not exists. This has the semantics that you expect:

select wpb.[ID number]
from [Fact].[REPORT].[WPB_LIST_OF_IDS] wpb
where not exists (select 1
                  from MasterData.Dimension.Customer dc
                  where wpb.[ID number] = dc.IdNumber
                 );

Of course, the left join method also works.

Sign up to request clarification or add additional context in comments.

2 Comments

Gordon is always few seconds ahead with his answers. Just to explain the NOT IN. It is converted to AND statements, so column NOT IN (value1, value2, NULL) will be translated into (column <> value1 AND column <> value2 AND column <> NULL) so you'll gt TRUE AND TRUE AND FALSE as column <> NULL is always false.
@DmitrijKultasev: To be very, atrociously, nit-pickingly precise: column <> NULL is always UNKNOWN if ANSI_NULLS is ON (yes, UNKNOWN -- no, I don't know why they didn't just name this NULL). In practice this distinction is irrelevant, because I'm not aware of any context where UNKNOWN is not treated the same way as FALSE. And, of course, T-SQL has no separate Boolean data type that you could use to store and observe the result.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.