How do I return identical values between two string aggregated columns in SQL?

Question

I have 2 columns A & B in a SQL Server table that contain a string aggregated list of codes.

The codes in the lists are separated by a semicolon.

The string aggregated lists can take any length and the codes can be in any order.

What I would like to do is to create a new column C which contains the codes that appear somewhere in the same row of the lists of Column A and Column B.

Example:

Column_A	Column_B	Column_C
a;b;c;d;e	c;a;e;i;k	c;a;e
d;e;f;g	e;h;i;j;d	e;d

The example above returns "c", "a" and "e" for the first row of column C, because these codes are present in the same row of both Column A and Column B.

The same for the second row, here "e" and "d" are overlapping in Column A and B and thus returned in Column C.

I have tried something that works, but it does not seem like the best solution in terms of efficiency and performance. Especially because I have many (1m+) rows to check this for and the length of code lists to compare can be very long.

SELECT 
    STRING_AGG(CAST([value] AS NVARCHAR(MAX)),'; ') AS Overlapping_Code 
FROM
    (SELECT a.value 
     FROM MyTable t 
     CROSS APPLY STRING_SPLIT(t.Column_A, ';') a

     INTERSECT

     SELECT b.value 
     FROM MyTable t 
     CROSS APPLY STRING_SPLIT(t.Column_B, ';') b 
) ab

I am looking for a better solution in terms of performance and elegance to compare the string aggregated lists for two columns across (many) rows.

Why are you storing your data in a denormalised format would be my question. Why not use a normalised design? Then getting values in one set that appear in another is trivial. "I have tried something that works, but it does not seem like the best solution in terms of efficiency and performance. " That, likely, is more the fault of your design than anything. — Thom A
– Thom A ♦, Commented Sep 4, 2023 at 13:45
The table that I am working with is a temporary table which is a result of a long query. I had many rows before aggregation, each row containing one code. I thought it might be easier to flatten the data to one row with all codes combined in one string and then compare row by row if the codes are returning somewhere. I am not sure I understand you, can you give an example of what a normalised design would look like? — Yosh07
– Yosh07, Commented Sep 4, 2023 at 13:49
A normalised design would have 3 tables, 1 with all the value for A, another with all the values for B (both with respective IDs) and then a 3rd with the IDs that are linked. — Thom A
– Thom A ♦, Commented Sep 4, 2023 at 13:51
Something like this perhaps: SELECT column_a, column_b, string_agg(value, ';') within GROUP(ORDER BY value) FROM ( SELECT column_a, column_b, y.value FROM ( VALUES (N'a;b;c;d;e', N'c;a;e;i;k', N'c;a;e') , (N'd;e;f;g', N'e;h;i;j;d', N'e;d') ) t (Column_A,Column_B,Column_C) CROSS apply string_split(column_a + ';' + column_b, ';') y GROUP BY y.value, column_a, column_b HAVING count(*) > 1 ) x GROUP BY column_a, column_b — siggemannen
– siggemannen, Commented Sep 4, 2023 at 13:53

Thom A · Accepted Answer · 2023-09-04 14:33:05Z

One method would be to split your strings, check if the value is in the other column's dataset and then reaggregate. This differs from your attempt as this doesn't reference the same table twice, just once. This isn't ideal and the order will be arbitrary, as you are on SQL Server 2019 (not 2022):

SELECT *
FROM (VALUES('a;b;c;d;e','c;a;e;i;k','c;a;e'),
            ('d;e;f;g','e;h;i;j;d','e;d'))V(A,B,C)
     CROSS APPLY (SELECT STRING_AGG(BB.value,';') AS C
                  FROM STRING_SPLIT(V.B,';') BB
                  WHERE EXISTS (SELECT 1
                                FROM STRING_SPLIT(V.a,';') AA
                                WHERE BB.value = AA.value))E;

Ideally, as I mentioned, you should be treating your data in a normalised form. I normalise your data using the sample you've given us, and then you can use an EXISTS to get the values that are in both sets:

SELECT *
INTO #YourBadData
FROM (VALUES(1,'a;b;c;d;e','c;a;e;i;k','c;a;e'),
            (2,'d;e;f;g','e;h;i;j;d','e;d'))V(ID,A,B,C);
GO
--Create normalised tables
CREATE TABLE dbo.TableA (ID int, Code char(1));
CREATE TABLE dbo.TableB (ID int, Code char(1));
CREATE TABLE dbo.RelatedIDs (A int, B int);
GO
INSERT INTO dbo.TableA (ID, Code)
SELECT YBD.ID,
       SS.Value
FROM #YourBadData YBD
     CROSS APPLY STRING_SPLIT(YBD.A,';') SS;
GO
INSERT INTO dbo.TableB (ID, Code)
SELECT YBD.ID,
       SS.Value
FROM #YourBadData YBD
     CROSS APPLY STRING_SPLIT(YBD.B,';') SS;

GO
INSERT INTO dbo.RelatedIDs (A,B) --This is overly simplified, as I assume A and B would likely
                                 --have different IDs and come from different sources
SELECT ID, ID
FROM dbo.#YourBadData; 
GO

SELECT B.ID,
       B.Code
FROM dbo.TableB B
WHERE EXISTS (SELECT 1
              FROM dbo.RelatedIDs RI
                   JOIN dbo.TableA A ON RI.A = A.ID
              WHERE RI.B = B.ID
                AND A.Code = B.Code);


GO
--Clean up
DROP TABLE #YourBadData;
DROP TABLE dbo.TableA;
DROP TABLE dbo.TableB;
DROP TABLE RelatedIDs

db<>fiddle

This confirms that I overcomplicated matters by doing a string aggregation of the codes in the first place. As your first step is actually bringing it back to my original data format. Thanks for taking the time to share.

Yitzhak Khabinsky · Accepted Answer · 2023-09-04 14:39:41Z

Please try the following solution based on the set operator INTERSECT

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, colA VARCHAR(20), colB VARCHAR(20));
INSERT @tbl (colA, colB) VALUES
('a;b;c;d;e', 'c;a;e;i;k'),
('d;e;f;g', 'e;h;i;j;d');
-- DDL and sample data population, end

SELECT t.*,
    (
        SELECT STRING_AGG(value, ';') WITHIN GROUP (ORDER BY value) AS common_codes
        FROM (
            SELECT value
            FROM STRING_SPLIT(ColA, ';')
            INTERSECT
            SELECT value
            FROM STRING_SPLIT(ColB, ';')
        ) AS common_codes
    ) AS ColC
FROM @tbl as t;

Output

ID	colA	colB	ColC
1	a;b;c;d;e	c;a;e;i;k	a;c;e
2	d;e;f;g	e;h;i;j;d	d;e

nvogel · Accepted Answer · 2023-09-04 20:50:55Z

Here is another possible solution:

CREATE TABLE MyTable (a VARCHAR(100) NOT NULL, b VARCHAR(100) NOT NULL /* PRIMARY KEY ?? */);

INSERT INTO MyTable VALUES ('a;b;c;d;e','c;a;e;i;k'),('d;e;f;g','e;h;i;j;d');

SELECT *,
    (SELECT STRING_AGG(x.value,';')
    FROM STRING_SPLIT(a,';') AS x
    , STRING_SPLIT(b,';') AS z
    WHERE x.value=z.value)
FROM mytable AS t;



a          b          c          
---------- ---------- -------
a;b;c;d;e  c;a;e;i;k  c;a;e
d;e;f;g    e;h;i;j;d  e;d

geek45 · Accepted Answer · 2023-09-04 14:05:21Z

-1

Try to use CHARINDEX to check which characters are in A and B.

SELECT
    Column_A,
    Column_B,
    (
        SELECT
            STRING_AGG(code, ';') WITHIN GROUP (ORDER BY code) AS common_codes
        FROM (
            SELECT DISTINCT value AS code
            FROM STRING_SPLIT(Column_A, ';')
            WHERE CHARINDEX(';' + value + ';', ';' + Column_B + ';') > 0
        ) AS common_codes
    ) AS Column_C
FROM yourdata;

answered Sep 4, 2023 at 14:05

geek45

7234 silver badges16 bronze badges

Collectives™ on Stack Overflow

How do I return identical values between two string aggregated columns in SQL?

4 Answers 4

1 Comment

Comments

Comments

Comments

Hot Network Questions