Invalid results from SQL Server "NOT IN" clause

Question

I have run a query on our SQL Server 2012 which returned no results. I discovered that this was incorrect and I SHOULD have gotten 16 records. I changed the query and get the answer expected but I am at a loss to understand why my original query did not work as expected.

So my ORIGINAL query which returned no results was:

SELECT
    WPB.[ID number]
FROM
    [Fact].[REPORT].[WPB_LIST_OF_IDS] WPB
WHERE
    [ID number] NOT IN (SELECT DISTINCT IdNumber 
                        FROM MasterData.Dimension.Customer DC)

The reworked query is this:

SELECT
    WPB.[ID number]
FROM
    [Fact].[REPORT].[WPB_LIST_OF_IDS] WPB
LEFT JOIN
    MasterData.Dimension.Customer DC ON WPB.[ID number] = DC.IdNumber
WHERE
    DC.IdNumber IS NULL

Can anyone tell me WHY the first query (which incidentally runs in fractions of a second vs the 2nd which takes a minute) does not work? I don't want to repeat this mistake in the future!

The second query doesn't work either. If it takes 1 minute it means you are missing indexes - one or both of the ID fields aren't indexed. In any case, if you want help with SQL you should provide table schemas, indexes, sample data and desired output. If you want help with performance you should first check the execution plan — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jul 11, 2018 at 10:35
NOT IN (select distinct...) distinct is redundand here and can affect performance. Anyway, performance will differ as the queries are logically different. If you don't want to repeat that mistake in the future then DO NOT USE NOT IN in subquery. NEVER! — Dmitrij Kultasev
– Dmitrij Kultasev, Commented Jul 11, 2018 at 10:39
BTW Select distinct IdNumber causes an unnecessary DISTINCT operation. You don't care how many 1 are returned, you only care whether there are any or none. The query optimizer will either ignore that distinct* or end up performing a useless sort/distinct operation. If *any* IdNumber` entry in a Dimension table is NULL you have a very serious problem. Dimensions shouldn't have nulls, they should have explicit records for Missing, NotApplicable, NotFound values. Again without schema and data people can only guess — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jul 11, 2018 at 10:46

Gordon Linoff · Accepted Answer · 2018-07-11 10:33:55Z

5

Don't use not in with a subquery. It doesn't work the way you expect with NULL values. If any value returned by the subquery is NULL, then no rows are returned at all.

Instead, use not exists. This has the semantics that you expect:

select wpb.[ID number]
from [Fact].[REPORT].[WPB_LIST_OF_IDS] wpb
where not exists (select 1
                  from MasterData.Dimension.Customer dc
                  where wpb.[ID number] = dc.IdNumber
                 );

Of course, the left join method also works.

answered Jul 11, 2018 at 10:33

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dmitrij Kultasev Over a year ago

Gordon is always few seconds ahead with his answers. Just to explain the NOT IN. It is converted to AND statements, so column NOT IN (value1, value2, NULL) will be translated into (column <> value1 AND column <> value2 AND column <> NULL) so you'll gt TRUE AND TRUE AND FALSE as column <> NULL is always false.

Jeroen Mostert Over a year ago

@DmitrijKultasev: To be very, atrociously, nit-pickingly precise: column <> NULL is always UNKNOWN if ANSI_NULLS is ON (yes, UNKNOWN -- no, I don't know why they didn't just name this NULL). In practice this distinction is irrelevant, because I'm not aware of any context where UNKNOWN is not treated the same way as FALSE. And, of course, T-SQL has no separate Boolean data type that you could use to store and observe the result.

Collectives™ on Stack Overflow

Invalid results from SQL Server "NOT IN" clause

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related