Skip to content

Support key-value set dimension type and operators (e.g. HSTORE) #2012

@joshk0

Description

@joshk0

Is your feature request related to a problem? Please describe.
I work on datasets that have per row labels, which are key-value pairs of type string. Particularly, I use this in BigQuery where the underlying type is ARRAY<STRUCT<key STRING, value STRING>>, but I can see parallels in PostgreSQL HSTORE.

Describe the solution you'd like
I see great value in allowing fields of this type to be treated as a high level 'set' type in Cube, and filtered on using set operators.

When I have a dataset of this type, there are several operations I want to do:

Filter by rows which include a certain key-value pair in their labels

This would be implemented with a construct such as:

{
  filters: [{
    member: "MyCube.labels.KEYNAME"
    operator: "equals",
    values: ["accepted_value_1", "accepted_value_2"]
  }]
}

Group by label key's values

Choose as a dimension, the values of a certain key that exist in the labels. If they don't exist, the value can be null or blank. For example, take this sample data:

Cost Labels
1.50 {env: "prod", owner: "josh"}
3.75 {env: "dev", owner: "josh"}
2.11 {env: "prod", owner: "pavel"}

And the cube query snippet:

{
  measures: ["MyCube.cost"], // is defined as a `sum` type
  dimensions: ["MyCube.labels.env"],
}

The result would be:

MyCube.labels.env MyCube.cost
prod 3.61
dev 3.75

Describe alternatives you've considered

LEFT JOIN UNNEST

Up until now, it is recommended (in BigQuery) to use LEFT JOIN UNNEST(labels) with dimensions to break down rows with extra dimensionality. Then I can access labels.key and labels.value as normal in cube. This is bad for me, because I have a measure of cost (as above), whose value will be replicated for every single key-value label pair. While that can then be handled, it would be very brittle.

UNNEST subquery with fake dimension

A good way that has worked for me is to use UNNEST in a subquery. I don't consider this an alternative, more than 'a way I can do what I need with what exists today'.

I actually think this approach should be borrowed for official implementation of set type.

Cube example:

cube('MyCube', {
  sql: `SELECT cost, labels FROM my_cool_table
     WHERE
        -- This is how we implement 'set membership for label key and value' without official Cube support
        ARRAY_LENGTH(ARRAY(SELECT label FROM (SELECT CONCAT(key, ":", value) AS label FROM UNNEST(labels)) WHERE ${FILTER_PARAMS.MyCube.matchLabels.filter('label')})) > 0

  dimensions: {
   
    labelString: {
      -- This is how we implement 'group by certain label keys'  without official Cube support
      sql: `TO_JSON_STRING(ARRAY(SELECT CONCAT(key, ":", value) FROM UNNEST(${CUBE}.labels) WHERE ${FILTER_PARAMS.MyCube.groupLabelKeys.filter('key')}))`,
      type: `string`,
    },

    // Fake dimension used only to gather the `labels` keys to group by the values of those keys across rows
    groupLabelKeys: {
      // HACK: We only use this via FILTER_PARAMS, so make any WHERE involving this dimension, a no-op!
      sql: `${FILTER_PARAMS.MyCube.groupLabelKeys.filter((values) => `${values}`)}`
      type: `string`
    },
    
    // Fake dimension used only to gather the label keys:values to test set membership within `labels`
    matchLabels:
      // HACK: We only use this via FILTER_PARAMS, so make any WHERE involving this dimension, a no-op!
      sql: `${FILTER_PARAMS.MyCube.matchLabels.filter((values) => `${values}`)}`
      type: `string`
    },
  }

In real implementation, we can encode the data collection of groupLabelKeys and matchLabels into the dimensions and filters specs seamlessly, and under the hood we use:

  • TO_JSON_STRING to flatten labels to a printable single column
  • ARRAY_LENGTH((SELECT key, value FROM labels WHERE...)) = N for label membership matching
  • grouping by could be done one of several ways, which I'll discuss more if it seems interesting. But in hack implementation, we reduced the labels presented in labelString based on the group-keys.

Additional context

Slack thread

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedCommunity contributions are welcome.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions