7
\$\begingroup\$

The task is to create sub-tables from an original table (like below) that contain the missing data columns (missing data is highlighted in red). The important part of this task is to make the smallest number of output tables based on the combinations of missing data points across different columns for different rows.

Input Table

In the above example the optimal output will create 4 tables such as the ones that are shown below. These tables include common columns that are missing across different examples.

Output Tables

This is the code that works to output these tables. An interesting element which this code does not cover and why it can be improved is that the Example20 row is correctly split across Table2 and Table4 as they had been created previously (and IsSubset returned True). However, had Example20 appeared earlier in the data set, a table would have been created for it with the two columns that it has missing ("G" and "I") and then we would have ended up with 5 output tables as we would also have had tables that covered I and G individually which is not the optimal solution. We are looking for an optimised code that finds optimal solution consistently regardless of the order of the rows.

Sub SeparateGroupedMissingDataToNewWorkbookOptimised()
    Dim ws As Worksheet
    Dim wbNew As Workbook
    Dim wsOutput As Worksheet
    Dim lastRow As Long, lastCol As Long
    Dim currentRow As Long
    Dim col As Integer
    Dim outputRow As Long
    Dim missingKey As String
    Dim dict As Object
    Dim missingDict As Object
    Dim existingKeys As Variant
    Dim headers() As String
    Dim foundSubset As Boolean
    Dim i As Integer, j As Integer
    
    ' Define input sheet
    Set ws = ThisWorkbook.Sheets("Missing Data")
    
    ' Create a new workbook for output
    Set wbNew = Workbooks.Add
    
    ' Initialise dictionaries to keep track of missing data
    Set dict = CreateObject("Scripting.Dictionary")
    Set missingDict = CreateObject("Scripting.Dictionary")
    
    ' Find the last row and column in the input data
    lastRow = ws.Cells(ws.Rows.Count, 1).End(xlUp).Row
    lastCol = ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column
    
    ' Loop through each row to identify missing data patterns
    For currentRow = 2 To lastRow
        missingKey = ""
        
        ' Construct a key that represents missing columns
        For col = 3 To lastCol ' Start from 3 to skip "Name" and "Code"
            If ws.Cells(currentRow, col).Value = "" Then
                missingKey = missingKey & ws.Cells(1, col).Value & "|"
            End If
        Next col
        
        ' If there is missing data
        If missingKey <> "" Then
            foundSubset = False
            existingKeys = dict.Keys
            
            ' Check all existing patterns for subset matches
            For i = 0 To dict.Count - 1
                Dim existingKey As String
                existingKey = existingKeys(i)
                
                ' Check if current missingKey is a subset of existingKey or vice versa
                If IsSubset(existingKey, missingKey) Then
                    foundSubset = True
                    ' Output the row to the existing subset sheet
                    Set wsOutput = missingDict(existingKey)
                    outputRow = wsOutput.Cells(wsOutput.Rows.Count, 1).End(xlUp).Row + 1
                    
                    wsOutput.Cells(outputRow, 1).Value = ws.Cells(currentRow, 1).Value ' Name
                    wsOutput.Cells(outputRow, 2).Value = ws.Cells(currentRow, 2).Value ' Code
                    
                    ' Fill in the missing columns as blanks
                    headers = Split(existingKey, "|")
                    For j = 0 To UBound(headers) - 1
                        If headers(j) <> "" Then
                            For col = 3 To lastCol
                                If wsOutput.Cells(1, col).Value = headers(j) Then
                                    wsOutput.Cells(outputRow, col).Value = "" ' Missing column data
                                End If
                            Next col
                        End If
                    Next j
                End If
            Next i
            
            ' If no matching subset found, create a new table for this missing data pattern
            If Not foundSubset Then
                dict.Add missingKey, dict.Count + 1
                
                ' Create a new worksheet for each unique pattern of missing data
                Set wsOutput = wbNew.Sheets.Add(After:=wbNew.Sheets(wbNew.Sheets.Count))
                wsOutput.Name = "Missing_" & dict.Count
                
                ' Add headers to the new sheet
                wsOutput.Cells(1, 1).Value = "Name"
                wsOutput.Cells(1, 2).Value = "Code"
                
                ' Add the column names that are missing
                headers = Split(missingKey, "|")
                For col = 0 To UBound(headers) - 1
                    If headers(col) <> "" Then
                        wsOutput.Cells(1, col + 3).Value = headers(col)
                    End If
                Next col
                
                ' Initialize missing data dictionary
                missingDict.Add missingKey, wsOutput
                
                ' Output the current row to the new sheet
                outputRow = 2
                wsOutput.Cells(outputRow, 1).Value = ws.Cells(currentRow, 1).Value ' Name
                wsOutput.Cells(outputRow, 2).Value = ws.Cells(currentRow, 2).Value ' Code
                
                ' Fill in the missing columns as blanks
                For j = 0 To UBound(headers) - 1
                    If headers(j) <> "" Then
                        wsOutput.Cells(outputRow, j + 3).Value = "" ' Missing column data
                    End If
                Next j
            End If
        End If
    Next currentRow
    
    ' Adjust column widths for better readability in each output sheet
    For Each wsOutput In wbNew.Sheets
        wsOutput.Cells.EntireColumn.AutoFit
    Next wsOutput
    
    MsgBox "Missing data grouped by missing columns in a new workbook!", vbInformation
End Sub

' Function to check if one key is a subset of another
' key1: A string representing a set of missing column names separated by "|"
' key2: Another string representing a different set of missing column names separated by "|"
' Returns: True if key1 is a subset of key2, False otherwise
Function IsSubset(key1 As String, key2 As String) As Boolean
    ' Declare arrays to hold the column names extracted from key1 and key2
    Dim arr1() As String, arr2() As String
    ' Loop variables
    Dim i As Integer, j As Integer
    ' Flag to indicate if a match is found
    Dim found As Boolean
    
    ' Split the keys into arrays of column names using the "|" delimiter
    arr1 = Split(key1, "|") ' Array of missing column names from key1
    arr2 = Split(key2, "|") ' Array of missing column names from key2
    
    ' If key1 has more elements than key2, it cannot be a subset
    If UBound(arr1) > UBound(arr2) Then
        ' key1 should be smaller or equal in length to key2 to be a subset
        IsSubset = False
        Exit Function
    End If
    
    ' Loop through each element in arr1 (representing key1)
    For i = 0 To UBound(arr1) - 1
        ' Skip empty elements caused by trailing or leading delimiters
        If arr1(i) <> "" Then
            found = False ' Reset the found flag for each element in arr1
            ' Loop through each element in arr2 (representing key2)
            For j = 0 To UBound(arr2) - 1
                ' If the current element in arr1 matches an element in arr2
                If arr1(i) = arr2(j) Then
                    found = True ' Set the found flag to True
                    Exit For ' No need to continue searching in arr2 for this element
                End If
            Next j
            ' If no match is found in arr2 for the current element in arr1
            If Not found Then
                IsSubset = False ' key1 is not a subset of key2
                Exit Function ' Exit the function early
            End If
        End If
    Next i
    
    ' If all elements in arr1 are found in arr2, key1 is a subset of key2
    IsSubset = True
End Function

\$\endgroup\$
2
  • 1
    \$\begingroup\$ Thanks you for your post. Good job on a fairly complex problem. I noticed that you are also going to post it on StackOverflow. My solution should be done by tomorrow. \$\endgroup\$ Commented Sep 14, 2024 at 15:06
  • \$\begingroup\$ Thanks @TinMan, appreciate the advice in your answer below and look forward to reading your solution (did I see you post a solution to this on StackOverflow? I just got round to reading it and can't see it anymore!). \$\endgroup\$ Commented Sep 17, 2024 at 13:20

2 Answers 2

4
\$\begingroup\$

Not really an answer but too long for a comment

On the algorithm, I'll ignore the point that this isn't doing what you want (isn't guaranteed to give best answer) since that kind of thing is off topic for this site. It would be easy to wrap your code in a loop to check every permutation of columns and rows to see which produces the fewest tables. But that would not be very efficient.

I would recommend thinking about this as a binary matrix where a 1 is a gap, a 0 is data. Your problem is then to find which ordering of rows and columns in the matrix minimises the number of rectangular areas containing only 1s. I imagine by using a binary matrix you'll find some analogy in graph theory that allows you to use a graph search algorithm to find the optimal shape.

https://stackoverflow.com/q/66021340/6609896 https://stackoverflow.com/q/11481868/6609896

\$\endgroup\$
8
  • \$\begingroup\$ I misread the post. I'm going to have to think about this a bit. \$\endgroup\$ Commented Sep 13, 2024 at 15:22
  • \$\begingroup\$ @TinMan Yeah "It should be too difficult" I think is accurate, this is not a straightforward problem I don't think! \$\endgroup\$ Commented Sep 13, 2024 at 15:28
  • 1
    \$\begingroup\$ Difficulty 8.25. I should be able to remove the subset from the superset and get the right result but that would require a the data be sorted each time. I'll rewrite it tomorrow using a binary search. \$\endgroup\$ Commented Sep 17, 2024 at 10:01
  • \$\begingroup\$ @TinMan uh oh the number is rising! I hope you figure it out it's an interesting problem \$\endgroup\$ Commented Sep 17, 2024 at 12:01
  • \$\begingroup\$ Uncle! I don't have the credentials (or any credentials for that matter) to solve this problem. Difficult ∞. I think that any solution that I come up with would just be a theory because there could always be an edge case to break it. This seems relevant: Set cover problem. Let me know if you solve it. \$\endgroup\$ Commented Sep 19, 2024 at 0:32
3
\$\begingroup\$

SeparateGroupedMissingDataToNewWorkbookOptimised() - SGMDTNWO

That is a very long and descriptive name (48 characters). I see why its so long, the subroutine is performing many tasks. I try and keep my procedures under 40 lines for readability. Ideally, a method perform a minimum number of task. The fewer tasks a method that are performed the easier it is to debug and modify.

Breaking the code into multiple procedure will also make it more flexible.

Let's start by extracting the core functionality to a function. We are after all primarily looking at how to create a Data Model.

Function GetMissingDataTables(Data As Variant) as Variant()

This function can have a smaller name because it is doing less tasks. The beauty of it is it makes it far easier and flexible to test. Notice, that we're passing in a data array and not a range. This allows us to add more use cases. For instance, we could port the code to Access.

MDTables = GetMissingDataTables(ws.Range("A1").CurrentRegion)

We can now pass our model to multiple views:

Sub CreateMDWorkbookView(wb as Workbook, MDTables() As Variant)

Passing in the a workbook and model adds more flexibility yet.

Sub CreateMDWorksheetView(ws as WorkSheet, MDTables() As Variant)

Maybe you decide that you need all the tables on a single worksheet, just write another method and pass in you data model.

Consider writing a class to manage your code.

MissingDataReport

• Sub LoadData(Data As Variant)

• Sub CreateWorkbookView(wb as Workbook, MDTables() As Variant)

• Sub CreateWorksheetView(ws as WorkSheet, MDTables() As Variant)

• Function TableCount() as Long

• Function GetTable(TableIndex as Long) As Variant()

• Function GetTables() As Variant()

• Function GetTableWithoutMissingData() As Variant()

When tackling a complex problem, I will scaffold out the code. I probably won't use many of the methods but helps me focus on specific tasks. Doing it this way makes me more productive. If I get stuck a one part, I can work on other parts of the code to clear my head.

Dim i As Integer, j As Integer

If I could, I would edit every post about arrays on the internet to use r and c. It's a little thing but it really make the code easier to read.

Dim r As Long, c As Long

By the way, use Long and not Integer. An Integer max value is only 32,767 where as Long is 2,147,483,647.

Consider:

Dim i As Long, j As Long, k As Long, l As Long

Compared to:

Dim r1 As Long, c1 As Long, r2 As Long, c2 As Long

Magic Numbers

Magic numbers are hard-coded values that are not explained. Magic number should be replaced by consts.

Const FirstDataColumn as Long = 3 For currentRow = FirstDataColumn To lastRow

Key Columns

Having hard coded key columns really limits the use case for the code. Consider using key columns and excluded columns to calculate missing data columns. This would take you code to a whole new level.

Add a Reference to the Microsoft Scripting Runtime

This enables Intellisense to do what it does so well; make us more productive.

Scripting.Dictionary Intellisense

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.