Select count Distinct based on combination of two columns

Question

I have the below query that gets counts of distinct USERID's from the various tables and sums them to a grand total. I am expecting a total of 35 as the results, however I am only getting 30 as a result from this query. What it appears to be doing is when it finds the same USERID in more than one row in any table, it is counting them only once (It is fine that USERID's appear more than once in a table based on how it was structured).

I would like to get Distinct values based on the combination of USERID and EXAM_DT, as this combination will satisfy the uniqueness I need.

SQL:

SELECT 'TOTAL', '', COUNT (DISTINCT G.USERID) + COUNT (DISTINCT H.USERID) + 
COUNT (DISTINCT J.USERID) + COUNT (DISTINCT M.USERID) + COUNT (DISTINCT 
P.USERID) + COUNT(DISTINCT S.USERID) + COUNT (DISTINCT V.USERID) + COUNT ( 
DISTINCT Y.USERID) 

FROM PS_JOB F INNER JOIN PS_EMPLMT_SRCH_QRY F1 ON (F.USERID = 
F1.USERID AND F.EMPL_RCD = F1.EMPL_RCD ) 
LEFT OUTER JOIN  PS_GHS_HS_ANN_EXAM G ON  F.USERID = G.USERID AND G.EMPL_RCD 
 = F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_ANTINEO H ON  F.USERID = H.USERID AND H.EMPL_RCD 
 = F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_AUDIO J ON  F.USERID = J.USERID AND J.EMPL_RCD = 
 F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_DOT M ON  F.USERID = M.USERID AND M.EMPL_RCD = 
 F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_HAZMAT P ON  F.USERID = P.USERID AND P.EMPL_RCD = 
 F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_PREPLACE S ON  F.USERID = S.USERID AND S.EMPL_RCD 
 = F.EMPL_RCD  
LEFT OUTER JOIN  PS_GH_RESP_FIT V ON  F.USERID = V.USERID AND V.EMPL_RCD = 
 F.EMPL_RCD  
LEFT OUTER JOIN  PS_GHS_HS_ASBESTOS Y ON  F.USERID = Y.USERID AND Y.USERID = 
 F.USERID

WHERE ( ( F.EFFDT = 
    (SELECT MAX(F_ED.EFFDT) FROM PS_JOB F_ED 
    WHERE F.USERID = F_ED.USERID 
      AND F.EMPL_RCD = F_ED.EMPL_RCD 
      AND F_ED.EFFDT <= SUBSTRING(CONVERT(CHAR,GETDATE(),121), 1, 10)) 
AND F.EFFSEQ = 
    (SELECT MAX(F_ES.EFFSEQ) FROM PS_JOB F_ES 
    WHERE F.USERID = F_ES.USERID 
      AND F.EMPL_RCD = F_ES.EMPL_RCD 
      AND F.EFFDT = F_ES.EFFDT) ))

My results:

 (No column name)      (No column name)      (No column name)
  TOTAL                                       30

Here is an example from one of the tables in the query that contains the USERID 816455 twice, but only counting (in above query) one distinct occurrence of it (when I need the distinct to be based on the combination of USERID and EXAM_DT)

 USERID       USER_RCD       EXAM_DT       EXAM_TYPE_CD       EXPIRE_DT
 001          0              2018-04-17    ANN                2019-04-17
 03           0              2018-04-03    ANN                2019-04-27
 816455       0              2018-03-02    ANN                2018-03-31
 816455       0              2018-03-26    ANN                2018-06-30
 410908       0              2018-03-05    ANN                2019-05-30

I would like to avoid having to use subqueries to do the aggregation on the joins if possible as I need to add the sql to a tool that doesn't support that use. Any help is appreciated!

EDIT:

As LukStorms suggested I tried "Method 1" from his answer as follows:

 SELECT count (distinct concat(G.USERID, G.EXAM_DT)) 
 + count (distinct concat(H.USERID, H.EXAM_DT)) + count (distinct 
concat(J.USERID, J.EXAM_DT)) + count (distinct concat(M.USERID, M.EXAM_DT))
 + count (distinct concat(P.USERID, P.EXAM_DT)) + count (distinct 
concat(S.USERID, S.EXAM_DT)) + count (distinct concat(V.USERID, V.EXAM_DT)) 
 + count (distinct concat(Y.USERID, Y.EXAM_DT))       AS 'Total_Unique'
 FROM PS_JOB F

 LEFT OUTER JOIN  PS_GHS_HS_ANN_EXAM H ON  F.USERID = H.USERID AND 
  H.EMPL_RCD = F.EMPL_RCD
 LEFT OUTER JOIN  PS_GHS_HS_ANTINEO G ON  F.USERID = G.USERID AND G.EMPL_RCD 
  = F.EMPL_RCD
 LEFT OUTER JOIN  PS_GHS_HS_AUDIO J ON  F.USERID = J.USERID AND J.EMPL_RCD = 
  F.EMPL_RCD 
 LEFT OUTER JOIN  PS_GHS_HS_DOT M ON  F.USERID = M.USERID AND M.EMPL_RCD = 
  F.EMPL_RCD 
 LEFT OUTER JOIN  PS_GHS_HS_HAZMAT P ON  F.USERID = P.USERID AND P.EMPL_RCD 
  = F.EMPL_RCD  
 LEFT OUTER JOIN  PS_GHS_HS_PREPLACE S ON  F.USERID = S.USERID AND S 
  .EMPL_RCD = F.EMPL_RCD  
 LEFT OUTER JOIN  PS_GH_RESP_FIT V ON  F.USERID = V.USERID AND V.EMPL_RCD = 
  F.EMPL_RCD  
 LEFT OUTER JOIN  PS_GHS_HS_ASBESTOS Y ON  F.USERID = Y.USERID  

WHERE ( ( F.EFFDT = 
    (SELECT MAX(F_ED.EFFDT) FROM PS_JOB F_ED 
    WHERE F.USERID = F_ED.USERID 
      AND F.EMPL_RCD = F_ED.EMPL_RCD 
      AND F_ED.EFFDT <= SUBSTRING(CONVERT(CHAR,GETDATE(),121), 1, 10)) 
AND F.EFFSEQ = 
    (SELECT MAX(F_ES.EFFSEQ) FROM PS_JOB F_ES 
    WHERE F.USERID = F_ES.USERID 
      AND F.EMPL_RCD = F_ES.EMPL_RCD 
      AND F.EFFDT = F_ES.EFFDT) ))

From the above query I am getting a total count of 42, not 30. I looked at the data without the COUNT aggregation and it appears to retrieving a blank row in the tables, along with the concatenated data.

@D-Shih I provided some sample data above from one of the tables. What exactly are you looking to see? — Nick
– Nick, Commented May 8, 2018 at 12:18
@RyanWilson I added Group By G.EXAM_DT, however this yields 7 'TOTAL' rows each with different values (12,6,7,3,3,1,3), that if you sum them do add to the correct total 35. I wanted it to be all summed on one row though. — Nick
– Nick, Commented May 8, 2018 at 12:26
Do a Select SUM of those 7 results returned, then you have everything you need. — Ryan Wilson
– Ryan Wilson, Commented May 8, 2018 at 12:37

LukStorms · Accepted Answer · 2018-05-09 12:00:48Z

So you want to count distinct based on a combination of USERID and EXAM_DT?

But a count(distinct ...) only allows one field.

So then combine the 2 fields.
You can use concat for that.

Or the alternative. Group em on the date, then sum the totals.

Simplified example snippet:

declare @T table (id int identity(1,1) primary key, userid int, exam_dt datetime);

insert into @T (userid, exam_dt) values
(100, GETDATE()),(200, GETDATE()),(100, GETDATE()-1),(200, GETDATE()+0.001),(NULL,NULL);

select * from @T;

-- Method 1.1
select count(distinct concat(userid,'_',cast(exam_dt as date))) as total_unique from @T where userid is not null;

-- Method 1.2 : Adjustment because of the left joins. When there's no match then the values of the joined table would appear as NULL
select count(distinct nullif(concat(userid,'_',cast(exam_dt as date)),'_')) as total_unique from @T;

-- Method 2
select sum(total) as total_unique 
from(
    select count(distinct t.userid) as total 
    from @T t
    group by cast(t.exam_dt as date)
) q;

Returns 3.

Because userid 100 has 2 records with different dates, therefore counts as 2.
While userid 200 has 2 records with the same date, therefore counts as 1.

Simplified example snippet with joins:

declare @T  table (id int identity(1,1) primary key, userid int, empl_rcd int default 0, exam_dt date);
declare @F1 table (id int identity(1,1) primary key, userid int, empl_rcd int default 0, exam_dt date);
declare @F2 table (id int identity(1,1) primary key, userid int, empl_rcd int default 0, exam_dt date);
insert into @T  (userid, exam_dt) values (100, GETDATE()),(200, GETDATE()),(100, GETDATE()-1),(200, GETDATE()),(300, GETDATE());
insert into @F1 (userid, exam_dt) values (100, GETDATE()),(200, GETDATE()),(200, GETDATE()+1);
insert into @F2 (userid, exam_dt) values (100, GETDATE()),(300, GETDATE()+1),(300, GETDATE()+2);

select (total0 + total1 + total2) as total, q.*
from (
    select 
    count(distinct nullif(concat(t0.userid,'_',t0.exam_dt),'_')) as total0,
    count(distinct nullif(concat(f1.userid,'_',f1.exam_dt),'_')) as total1,
    count(distinct nullif(concat(f2.userid,'_',f2.exam_dt),'_')) as total2
    from @T t0
    left join @F1 f1 on (f1.userid = t0.userid and f1.empl_rcd = t0.empl_rcd)
    left join @F2 f2 on (f2.userid = t0.userid and f2.empl_rcd = t0.empl_rcd)
) q;

Careful with concatenating values to count distinct, make sure that no variations can generate the same value (for example concat(10, 11) with concat(101,1). I always add some weird character in the middle (like a pipe or slash) just to make sure.
@EzLo Oh, that's a good point! On most sql server installations I doubt that would become a problem when using a date. Since when you concat a date it gets printed by default in the ISO format 'YYYY-MM-DD'. But that default could be changed by the systemadmin. Thanks for the insight. I've added a separator in the concat, just to be sure.
I used Method 1 above, and while it works for one field, I noticed that when I do addition with the other fields it's counting Blank ('') rows as well. I'll edit my post above and provide what I'm getting...
@Nick Hmm. Well with a CONCAT, when 1 of the values is NULL then the result won't be NULL. F.e. concat('X',NULL) gives 'X' , but 'X' + NULL gives NULL. Or maybe you need to use the NULLIF function to turn empty strings to NULL. Because NULL's aren't counted, but empty strings are.
@LukStorms I edited my original post above with the new sql, maybe you can see why it might be causing the additional empty strings to be counted. If I do Select count (distinct concat(G.EMPLID, G.EXAM_DT) on the individual tables seperately, the total's are correct, but when left outer joined to PS_JOB is when I get the additional counts.

Collectives™ on Stack Overflow

Select count Distinct based on combination of two columns

1 Answer 1

10 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Related