Description
Is your feature request related to a problem? Please describe.
I work on datasets that have per row labels, which are key-value pairs of type string. Particularly, I use this in BigQuery where the underlying type is ARRAY<STRUCT<key STRING, value STRING>>
, but I can see parallels in PostgreSQL HSTORE.
Describe the solution you'd like
I see great value in allowing fields of this type to be treated as a high level 'set' type in Cube, and filtered on using set operators.
When I have a dataset of this type, there are several operations I want to do:
Filter by rows which include a certain key-value pair in their labels
This would be implemented with a construct such as:
{
filters: [{
member: "MyCube.labels.KEYNAME"
operator: "equals",
values: ["accepted_value_1", "accepted_value_2"]
}]
}
Group by label key's values
Choose as a dimension, the values of a certain key that exist in the labels. If they don't exist, the value can be null or blank. For example, take this sample data:
Cost | Labels |
---|---|
1.50 | {env: "prod", owner: "josh"} |
3.75 | {env: "dev", owner: "josh"} |
2.11 | {env: "prod", owner: "pavel"} |
And the cube query snippet:
{
measures: ["MyCube.cost"], // is defined as a `sum` type
dimensions: ["MyCube.labels.env"],
}
The result would be:
MyCube.labels.env |
MyCube.cost |
---|---|
prod | 3.61 |
dev | 3.75 |
Describe alternatives you've considered
LEFT JOIN UNNEST
Up until now, it is recommended (in BigQuery) to use LEFT JOIN UNNEST(labels) with dimensions to break down rows with extra dimensionality. Then I can access labels.key and labels.value as normal in cube. This is bad for me, because I have a measure of cost
(as above), whose value will be replicated for every single key-value label pair. While that can then be handled, it would be very brittle.
UNNEST subquery with fake dimension
A good way that has worked for me is to use UNNEST in a subquery. I don't consider this an alternative, more than 'a way I can do what I need with what exists today'.
I actually think this approach should be borrowed for official implementation of set
type.
Cube example:
cube('MyCube', {
sql: `SELECT cost, labels FROM my_cool_table
WHERE
-- This is how we implement 'set membership for label key and value' without official Cube support
ARRAY_LENGTH(ARRAY(SELECT label FROM (SELECT CONCAT(key, ":", value) AS label FROM UNNEST(labels)) WHERE ${FILTER_PARAMS.MyCube.matchLabels.filter('label')})) > 0
dimensions: {
labelString: {
-- This is how we implement 'group by certain label keys' without official Cube support
sql: `TO_JSON_STRING(ARRAY(SELECT CONCAT(key, ":", value) FROM UNNEST(${CUBE}.labels) WHERE ${FILTER_PARAMS.MyCube.groupLabelKeys.filter('key')}))`,
type: `string`,
},
// Fake dimension used only to gather the `labels` keys to group by the values of those keys across rows
groupLabelKeys: {
// HACK: We only use this via FILTER_PARAMS, so make any WHERE involving this dimension, a no-op!
sql: `${FILTER_PARAMS.MyCube.groupLabelKeys.filter((values) => `${values}`)}`
type: `string`
},
// Fake dimension used only to gather the label keys:values to test set membership within `labels`
matchLabels:
// HACK: We only use this via FILTER_PARAMS, so make any WHERE involving this dimension, a no-op!
sql: `${FILTER_PARAMS.MyCube.matchLabels.filter((values) => `${values}`)}`
type: `string`
},
}
In real implementation, we can encode the data collection of groupLabelKeys
and matchLabels
into the dimensions and filters specs seamlessly, and under the hood we use:
TO_JSON_STRING
to flatten labels to a printable single columnARRAY_LENGTH((SELECT key, value FROM labels WHERE...)) = N
for label membership matching- grouping by could be done one of several ways, which I'll discuss more if it seems interesting. But in hack implementation, we reduced the labels presented in
labelString
based on the group-keys.
Additional context