0

I have a set of tables aimed at representing a profile storage system. I have created a view that should display all the important parts of this profile, that being things like the name, email, number of followers and the list of activities they are a part of.

However, I've noticed that the values in the activities cell are duplicating. The same thing appears multiple times when each activity name should only appear once. In the below table the Nulls at the bottom are intentional.

Weight Height Activities TotalFollowing TotalFollowers
68 170 Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting 4 2
63 179 Horse Riding, Horse Riding, Horse Riding, Hiking, Hiking, Hiking, Swimming, Swimming, Swimming, Fishing, Fishing, Fishing 1 3
72 130 NULL 3 1
NULL NULL NULL 1 2

There is a user that is identified by it's MemberID, a link table called User_Activity which only contains MemberID and ActivityID, the latter of which points to the activity table that holds the name for an activity under a certain ID.

ActivityID ActivityName
1 Horse Riding
2 Hiking
3 Bird Watching
4 Backpacking
5 Swimming
6 Fishing
7 Hunting

And the user activity tables holds these.

MemberID ActivityID
1 1
1 3
1 7
2 1
2 2
2 5
2 6

The SQL for this view is as follows.
I assume the issue has come from the fact that the user follows table and the references I use there and up pointing back to the user, so it counts their activities again and adds them to the column depending on how many followers they have, since these two example records repeat a different number of times.

CREATE VIEW [View_Profile_Information] AS
SELECT U.MemberID, 
U.FirstName, 
U.LastName, 
L.Town, 
L.County, 
L.Country, 
U.Email, 
U.AboutMe, 
U.DistanceUnit, 
U.ActivityTimePreference, 
U.Weight,        
U.Height, 
STRING_AGG(A.ActivityName, ', ') AS Activities, 
COUNT(DISTINCT UF.Follows_MemberID) AS TotalFollowing, 
COUNT(DISTINCT UF2.This_MemberID) AS TotalFollowers
FROM CW1.[User] U

LEFT JOIN CW1.[Location] L
ON U.LocationID = L.LocationID

LEFT JOIN CW1.[User_Activity] UA
ON U.MemberID = UA.MemberID    
              
LEFT JOIN CW1.[Activity] A
ON UA.ActivityID = A.ActivityID  

LEFT JOIN CW1.[User_Follows] UF 
ON U.MemberID = UF.This_MemberID

LEFT JOIN CW1.[User_Follows] UF2
ON U.MemberID = UF2.Follows_MemberID

GROUP BY U.MemberID, U.FirstName, LastName, Town, County, Country, Email, AboutMe, DistanceUnit, ActivityTimePreference, Weight, 
       Height
3
  • 2
    Please do not upload images of code/data/errors when asking a question. If you want to supply sample data, please do so as DDL and DML statements, or at least as a markdown table. Images are not useful or helpful. Commented Nov 14, 2023 at 14:21
  • Also, on a seperate note, Whitespace and Linebreaks are paramount to making readable text; not just in code. Please get into the habit of making good use of both. Poor/bad formatting is not going to help you or others when you need to be able to quickly read and understand your code. Using indentation and line breaks really helps easily distinguish specific code blocks and sections, and makes finding errors far easier when a line only contains 10's of characters, rather than 100's. Commented Nov 14, 2023 at 14:24
  • You are essntially just making a giant cross-join between all the tables then trying to distinct it afterwards. You need to preaggregate instead. Commented Nov 14, 2023 at 14:25

1 Answer 1

1

What you have is a query that contains multiple independent one-to-many joins. This is near guaranteed to cause data duplication in the results. Using DISTINCT or GROUP BY to try to resolve the duplicates is rarely the correct solution for unexpected duplicate data.

What you need to do is to moved each relationship branch into separate subqueries. In this case those subqueries can either be included directly in the select list or wrapped up in a CROSS APPLY.

Something like:

CREATE VIEW [View_Profile_Information] AS
SELECT
    U.MemberID, U.FirstName, U.LastName, 
    L.Town, L.County, L.Country, 
    U.Email, U.AboutMe, U.DistanceUnit,
    U.ActivityTimePreference, U.Weight, U.Height,
    (
        SELECT STRING_AGG(A.ActivityName, ', ')
        FROM CW1.[User_Activity] UA
        JOIN CW1.[Activity] A
           ON UA.ActivityID = A.ActivityID
        WHERE UA.MemberID = U.MemberID
    ) AS Activities,
    (
        SELECT COUNT(*)
        FROM CW1.[User_Follows] UF 
        WHERE UF.This_MemberID = U.MemberID
    ) AS TotalFollowing, 
    (
        SELECT COUNT(*)
        FROM CW1.[User_Follows] UF2
        WHERE UF2.Follows_MemberID = U.MemberID
    ) AS TotalFollowers
FROM CW1.[User] U
LEFT JOIN CW1.[Location] L
    ON U.LocationID = L.LocationID

You can also use CROSS APPLY to calculate the aggregated values separately down in the FROM section of the query, so that the final select list is cleaner and more readable.

CREATE VIEW View_Profile_Information AS
SELECT
    U.MemberID, U.FirstName, U.LastName, 
    L.Town, L.County, L.Country, 
    U.Email, U.AboutMe, U.DistanceUnit,
    U.ActivityTimePreference, U.Weight, U.Height,
    A.Activities, TFG.TotalFollowing, TFW.TotalFollowers
FROM CW1.User U
LEFT JOIN CW1.Location L
    ON U.LocationID = L.LocationID
CROSS APPLY (
    SELECT STRING_AGG(A.ActivityName, ', ') AS Activities
    FROM CW1.User_Activity UA
    JOIN CW1.Activity A
       ON UA.ActivityID = A.ActivityID
    WHERE UA.MemberID = U.MemberID
) AS A
CROSS APPLY (
    SELECT COUNT(*) AS TotalFollowing
    FROM CW1.User_Follows UF 
    WHERE UF.This_MemberID = U.MemberID
) AS TFG
CROSS APPLY (
    SELECT COUNT(*) AS TotalFollowers
    FROM CW1.User_Follows UF2
    WHERE UF2.Follows_MemberID = U.MemberID
) AS TFW

Note that I have removed the DISTINCTs and GROUP BY, and COUNT(*) is now just counting rows.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you ever so much for this! I think I understand what's going on here, never thought about using another select statement within the main one. I felt like I might have had something iffy going on since in my head I pictured the joins to be a constantly growing tree, couldn't for the life of me figure out what I could do instead. So in general then, joins should be avoided whenever there is a one to many, and select statements are preferred since they make it one to one? I will bear this in mind moving forward. Thanks a lot!
Multiple joins are okay when, as long as they build linearly on one another, such as Customer (1)=>(many) Order and Order (1)=>(many) OrderDetail. The trouble occurs when you have multiple independent one-to-many branches, such as Customers (1)=>(many) Orders and Customers (1)=>(many) Payments.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.