2
\$\begingroup\$

Would like feedback on my SQL (1) efficiency and (2) readability, and if the conditions I featured in my CASE statement could be transferred to a join, etc.

Here's the problem I'm solving:

Say you have a table bst with a column of nodes and column or corresponding parent nodes:

node   Parent
1       2
2       5
3       5
4       3
5       NULL

We want to write SQL such that we label each node as a “leaf”, “inner” or “Root” node, such that for the nodes above we get:

Node    Label
1       Leaf
2       Inner
3       Inner
4       Leaf
5       Root

Here's my solution:

WITH join_table AS 
(
SELECT 
    a.node a_node,
    a.parent a_parent,
    b.node b_node, 
    b.parent b_parent
 FROM
    bst a 
 LEFT JOIN 
    bst b ON a.parent = b.node 
 )

 SELECT 
    a_node as Node, 
    CASE 
        WHEN b_node IS NOT NULL and b_parent IS NOT NULL THEN 'Leaf'
        WHEN b_node IS NOT NULL and b_parent IS NULL THEN 'Inner'
        WHEN b_node IS NULL and b_parent IS NULL THEN 'Root'
        ELSE 'Error' 
    END AS Label 
 FROM 
    join_table 
\$\endgroup\$
5
  • 1
    \$\begingroup\$ What dialect of SQL are you targeting? (Apart from that, this is a really well-asked SQL question, so well done!) \$\endgroup\$ Commented Jan 30, 2019 at 16:58
  • \$\begingroup\$ @TobySpeight Let's just say Hive, but Postgres or MySQL would work too -- more focused on the underlying logic \$\endgroup\$ Commented Jan 30, 2019 at 17:14
  • 1
    \$\begingroup\$ It is possible to rewrite your query without CASE, however, it will not help the performance since you are reading the whole tables anyway. The sequential scan is inevitable in this case. \$\endgroup\$ Commented Jan 30, 2019 at 19:17
  • 1
    \$\begingroup\$ @RadimBača Feel free to post such a response \$\endgroup\$ Commented Jan 31, 2019 at 16:17
  • 1
    \$\begingroup\$ @zthomas.nc it would be a solution using UNION ALL which would lead to three sequential scans instead of one. \$\endgroup\$ Commented Jan 31, 2019 at 18:19

1 Answer 1

1
\$\begingroup\$

This solution will work only on the dataset that you've mentioned. If you apply this code to another dataset it will throw up erroneous results. Here is why:

  1. Condition b_node IS NULL and b_Parent is NULL will always give correct result i.e. 'Root'
  2. The conditions for 'Leaf' and 'Inner' will change depending on the dataset. let's assume 5 is not the parent root, instead it is an inner node and there is a parent node 6 above it and 6 also has a child node 7

So the table becomes

 1. a_node  a_parent b_node b_parent




 1. 1        2       2       5
 2. 2        5       5       6
 3. 3        5       5       6
 4. 4        3       3       5
 5. 5        6       6       NULL
 6. 6        NULL    NULL    NULL
 7. 7        6       6       NULL

here 2 and 3 are not nulls and will be marked as 'leaf' erroneously.

A better solution is this case is to use CASE to identify those nodes in a_node which can never be a parent in a_parent since they are leaf nodes.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.