Grouping into more groups in one iteration

Question

I had a need to group the same dataset in several groups. So instead of repeatedly query the dataset, I made an extension that could do it once. The caveat is, that the result is materialized in dictionaries, because I didn't managed to find an way to avoid that. Maybe you can?

public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)
{
  if (source == null) throw new ArgumentNullException(nameof(source));
  if (groupers == null) throw new ArgumentNullException(nameof(groupers));

  IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>();

  using (var enumer = source.GetEnumerator())
  {
    while (enumer.MoveNext())
    {
      foreach ((var label, var func) in groupers)
      {
        if (!results.TryGetValue(label, out var dict))
        {
          dict = new Dictionary<object, HashSet<T>>();
          results[label] = dict;
        }

        var key = func(enumer.Current);
        if (!dict.TryGetValue(key, out var set))
        {
          set = new HashSet<T>();
          dict[key] = set;
        }

        set.Add(enumer.Current);
      }
    }

  }

  return results;
}

Use case:

static void TestMultiGrouping()
{
  string[] data = 
  {
    "Black",
    "White",
    "Yellow",
    "green",
    "Red",
    "blue",
    "cyan",
    "Magenta",
    "Orange"
  };

  foreach (var result in data.MultiGroupBy(
    ("First UCase", s => s.Length > 0 && char.IsUpper(s[0])), 
    ("Length", s => s.Length), 
    ("Length Four", s => s.Length == 4), 
    ("Contains 'e'", s => s.Contains('e')),
    ("Num n's", s => s.Count(c => c == 'n'))))
  {
    Console.WriteLine($"Results for {result.Key}:");
    foreach (var dict in result.Value)
    {
      Console.WriteLine($"{dict.Key}: {dict.Value.Count} [{(string.Join(", ", dict.Value))}]");
    }
    Console.WriteLine();
  }
}

If you're trying to avoid materializing this data, I think that is better suited to a push interface (IObservable) rather than a pull interface (IEnumerable). Rather than returning a dict mapping keys to sets of elements, you can try returning a dict mapping keys to streams of elements. — Alexander
– Alexander, Commented Jul 19, 2019 at 5:05
@Alexander, tanks for input. Your suggestion could be a solution, but the datasets datatype (IEnumerable) was given. Maybe the header of the question is a little misleading - the primary goal was performance optimization rather than "one iteration", but I thought, that "one iteration" was the way to optimize - which it still may be if memory allocation matters. — user73941
– user73941, Commented Jul 19, 2019 at 6:14
RxNet adds IEnumerable.toObservable(), which would be a perfect fit. I think the performance characteristics would be pretty good. Using a cold observable (one that doesn't do anything until there's a subscriber), nothing happens until anyone is interested in the result, and they won't have to be concurrently stored in memory — Alexander
– Alexander, Commented Jul 19, 2019 at 6:45

Pieter Witvoet · Accepted Answer · 2019-07-18 12:43:14Z

If you only want to enumerate source once, then you'll have to cache it somehow. Either by materializing it immediately, as you do, or whenever the first group is enumerated, but that's more complicated.

If you don't mind duplicate entries in your groups and throwing on duplicate labels, then your code can be simplified to the following:

public static IDictionary<string, IEnumerable<IGrouping<object, T>>> MultiGroupBy<T>(
    this IEnumerable<T> source,
    params (string label, Func<T, object> keySelector)[] groupings)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (groupings == null) throw new ArgumentNullException(nameof(groupings));

    var materializedSource = source.ToArray();
    return groupings.ToDictionary(
        grouping => grouping.label,
        grouping => materializedSource.GroupBy(grouping.keySelector));
}

This materializes source up-front, but each grouping is lazily evaluated. Some quick enumeration tests with randomly generated strings show a roughly 40% speed improvement. I haven't measured memory consumption, but I expect that to be a bit higher due to the extra references/values stored in materializedSource.

I suspect the main reason for the speed difference is that your code performs a lookup into results for every item/grouping combination, something that separate GroupBy calls don't need to do.

Other notes:

That using GetEnumerator/while MoveNext construction can be simplified to a foreach loop.
You do not guard against duplicate labels, so you can end up with mixed results (and even mixed key types).

Nifty. But I can't get the same result as you when it comes to performance: If the source is a materialized array itself - yes, but if not, then mine is about double as fast as yours and for large sets even more. But one thing yours showed me, is that I can't use HashSet<T>as the inner vector, because it handles doublets wrongly for value types. `HashSet<T> made sense in the original context though. — user73941
– user73941, Commented Jul 18, 2019 at 13:46
About foreach vs. GetEnumerator() - it is my experience that the latter is slightly (sometimes much) faster - which is strange because foreach calls GetEnumerator(). Duplicate labels - thanks - has to be dealt with. — user73941
– user73941, Commented Jul 18, 2019 at 13:46
It's getting into micro-optimisation, but there's an argument for using ToList() in preference to ToArray() for temporary private eager evaluation results: the implementation is effectively almost the same, but ToArray() does a final copy to a shorter array for most inputs. — Peter Taylor
– Peter Taylor, Commented Jul 18, 2019 at 14:17

Peter Taylor · Accepted Answer · 2019-07-18 14:13:41Z

public static IDictionary<string, Dictionary<object, HashSet<T>>> MultiGroupBy<T>(this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupers)

I don't understand the mixture of interfaces (IDictionary) and implementations (Dictionary, HashSet), nor the mixture of generics (<T>) and non-generics (object). Why is it not

public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)

?

  IDictionary<string, Dictionary<object, HashSet<T>>> results = new Dictionary<string, Dictionary<object, HashSet<T>>>();

  using (var enumer = source.GetEnumerator())
  {
    while (enumer.MoveNext())
    {
      foreach ((var label, var func) in groupers)
      {
        if (!results.TryGetValue(label, out var dict))
        {
          dict = new Dictionary<object, HashSet<T>>();
          results[label] = dict;
        }

        ...

I'm not quite sure why you want to return an empty dictionary if the source is empty. As a caller of your library, I'd probably rather get a dictionary mapping the grouper names to empty dictionaries.

That also simplifies the initialisation:

  var results = groupers.ToDictionary(grouper => grouper.Item1, _ => new Dictionary<object, HashSet<T>>());

  using (var enumer = source.GetEnumerator())
  {
    while (enumer.MoveNext())
    {
       ...
    }

  }

KISS. foreach is much kinder on the maintenance programmer, who doesn't have to check for correct usage patterns of the unsugared API. Using MoveNext() / Current for speed is the epitome of premature optimisation unless benchmarking shows that it's a bottleneck, in which case there should be a comment explaining the bottleneck to justify the more complex code.

Moreover, if this is a bottleneck then it seems likely that the dictionary lookups in results for every single element in the source will be slower than the overhead of foreach, so you could start by replacing results with a List<(string Label, Func<T, K> Getter, IDictionary<K, ISet<T>> Groups)> and just convert it to a dictionary after the loop.

      foreach ((var label, var func) in groupers)

var (label, func) saves the repetition.

After my proposed refactors and some minor tidying of whitespace, I get

public static IDictionary<string, IDictionary<K, ISet<T>>> MultiGroupBy<T, K>(this IEnumerable<T> source, params (string Label, Func<T, K> Getter)[] groupers)
{
  if (source == null) throw new ArgumentNullException(nameof(source));
  if (groupers == null) throw new ArgumentNullException(nameof(groupers));

  var results = groupers.ToDictionary(grouper => grouper.Item1, _ => (IDictionary<K, ISet<T>>)new Dictionary<K, ISet<T>>());

  foreach (var elt in source)
  {
    foreach (var (label, func) in groupers)
    {
      var dict = results[label];
      var key = func(elt);
      if (!dict.TryGetValue(key, out var set))
      {
        set = new HashSet<T>();
        dict[key] = set;
      }

      set.Add(elt);
    }
  }

  return results;
}

Fine solution. You're right about the interface/class mix. One thing though: I can't use the type parameter K because the keys may not be of the same type - that's why I use object. The instantiation of the results dictionary the way you do, is definitely smarter and cleaner code - I'll use that, but apparently I can't measure any significant difference from mine. About ISet/HashSet see my comment to Pieter Witvoet answer. Tanks for interesting points - as always. — user73941
– user73941, Commented Jul 18, 2019 at 14:47
LINQ provides public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer);. Why not use that one? — dfhwze
– dfhwze, Commented Jul 18, 2019 at 14:49
@HenrikHansen, nothing stops you calling it with object for K. I admit that the downside is that the inference is less likely to work, so it's usually going to be necessary to use explicit type arguments. — Peter Taylor
– Peter Taylor, Commented Jul 18, 2019 at 15:25
It was intended on Pieter's answer. See my answer for an explanation. — dfhwze
– dfhwze, Commented Jul 18, 2019 at 15:26

dfhwze · Accepted Answer · 2019-07-18 15:24:25Z

GroupBy vs ToLookup

From reference source: Linq Enumerable

Dictionary<object, HashSet<T>>> can be replaced with a ILookup<object, T>.

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) 
{
    // impl ..
}

There is also an overload in case you require compliant behavior with HashSet<T>.

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, 
    IEqualityComparer<TKey> comparer)
{
    // impl ..
}

This is much faster than GroupBy. Have a look at the implementation of the latter.

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) 
{
    return new GroupedEnumerable<TSource, TKey, TSource>(source, keySelector, IdentityFunction<TSource>.Instance, null);
}

And GroupedEnumerable wraps Lookup.

public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator() 
{
    return Lookup<TKey, TElement>.Create<TSource>(source, keySelector, elementSelector, comparer).GetEnumerator();
}

Refactored Code

Pieter's answer could be updated with a performance boost substituting GroupBy with ToLookup, also including Peter's micro-optimized ToList.

public static IDictionary<string, ILookup<object, T>> MultiLookupBy<T>(
    this IEnumerable<T> source, params (string Label, Func<T, object> Getter)[] groupings)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (groupings == null) throw new ArgumentNullException(nameof(groupings));

    var materializedSource = source.ToList();
    return groupings.ToDictionary(
        grouping => grouping.Label, 
        grouping => materializedSource.ToLookup(grouping.Getter));
}

And your test code would change a bit.

 sb.AppendLine($"Results for {result.Key}:");
 foreach (var dict in result.Value)
 {
     sb.AppendLine($"{dict.Key}: {dict.Count()} [{(string.Join(", ", dict))}]");
 }
 sb.AppendLine();

I get performance close to the initial OP with this refactored code.

This seems to be the "testwinner" both according to elegance and performance (for large datasets about 20 %). My problem is now to accept an answer, because I think ,you have contributes evenly to the best solution. I have written a number [1,99] in notepad. Comment a number, and closest will be accepted :-) — user73941
– user73941, Commented Jul 18, 2019 at 15:55
@HenrikHansen you should usually accept the answer, which is the most useful to you and future readers, which stumble upon this problem. People landing here via search will often take the accepted answer as providing a final "best" solution - so the accepted answer should include all relevant facts and not just be a partial answer. - If you are worried about fame, the accepted answer could be edited to a community answer, with credits to both posters. — Falco
– Falco, Commented Jul 19, 2019 at 11:24

t3chb0t · Accepted Answer · 2019-07-20 06:44:17Z

3

Consistency with LINQ

I find in order to be consistent with other LINQ APIs and to make the usage of this extension more intuitive you should slightly adjust the parameter names and the return value and rename it to ToLookups.

ToLookup calls the Func keySelector and since this extension is accepting a collection, I suggest the name keySelectors.

As far as the return value is concerned, I would use ILookup twice here so that the behaviour of the result is consistent.

Unexpected behaviour due to `HashSet`

If you require unique elements then the source should be prefiltered. Ignoring them here is not something I would expect from a grouping. On the contrary, it should group them together because this is what grouping is for. A HashSet could also change the order of elements which the builtin grouping wouldn't so it's another surprise here.

Suggested code

This is how I think it should look like:

public static ILookup<string, ILookup<object, T>> ToLookups<T>
(
    this IEnumerable<T> source, 
    params (string Name, Func<T, object> KeySelector)[] keySelectors
)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (keySelectors == null) throw new ArgumentNullException(nameof(keySelectors));

    var materializedSource = source.ToList();       
    return 
        keySelectors
            .Select(t => (t.Name, Lookup: materializedSource.ToLookup(t.KeySelector)))
            .ToLookup(t => t.Name, t => t.Lookup);
}

answered Jul 20, 2019 at 6:44

t3chb0t

44.7k9 gold badges84 silver badges191 bronze badges

\$\begingroup\$ Check my answer on HashSet behavior. You can use ToLookup with an overload that takes a IEqualityComparer<TKey> comparer. \$\endgroup\$

dfhwze
– dfhwze

2019-07-20 06:48:09 +00:00
Commented Jul 20, 2019 at 6:48
\$\begingroup\$ @dfhwze I'm not quite sure what it has to do with the HashSet? The IEqualityComparer<T> It's for grouping, whereas HashSet is for throwing away duplicates and a grouping shouldn't be doing that. \$\endgroup\$

t3chb0t
– t3chb0t

2019-07-20 06:52:47 +00:00
Commented Jul 20, 2019 at 6:52
\$\begingroup\$ You are right, it's on the Key. Doh! \$\endgroup\$

dfhwze
– dfhwze

2019-07-20 06:56:35 +00:00
Commented Jul 20, 2019 at 6:56
\$\begingroup\$ Thanks for the answer. According to HashSet, you're absolutely right - see my first comment to Pieter Witvoet's answer. I'll consider the renaming, but I think I prefer my name of the method itself. I look forward to benchmark you solution a little later. \$\endgroup\$

user73941
– user73941

2019-07-20 07:06:19 +00:00
Commented Jul 20, 2019 at 7:06

Add a comment |

user73941 · Accepted Answer · 2019-07-21 04:45:17Z

FYI. I've now tested the different versions of the algorithm from the different answers, and the result is as follows:

Data Size: 10
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50        0.38341        0.05530       16.09750       19.17070        2.24480    [Milliseconds]
dfhwze    :        50        0.09890        0.01250        3.96660        4.94510        0.55250    [Milliseconds]
Peter Tayl:        50        0.14559        0.01500        6.16400        7.27940        0.85970    [Milliseconds]
T3chb0t   :        50        0.18089        0.01240        8.06260        9.04470        1.12590    [Milliseconds]
Original  :        50        0.11584        0.01640        4.54850        5.79220        0.63330    [Milliseconds]

Data Size: 100
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50        0.52665        0.48760        0.78700       26.33230        0.05190    [Milliseconds]
dfhwze    :        50        0.14118        0.11800        0.24010        7.05920        0.02070    [Milliseconds]
Peter Tayl:        50        0.15725        0.14010        0.35670        7.86250        0.03030    [Milliseconds]
T3chb0t   :        50        0.13385        0.11880        0.18680        6.69250        0.01470    [Milliseconds]
Original  :        50        0.15542        0.14090        0.32780        7.77100        0.02600    [Milliseconds]

Data Size: 1000
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50        4.86897        4.56660        5.49500      243.44840        0.19180    [Milliseconds]
dfhwze    :        50        1.22802        1.14460        1.55030       61.40110        0.10070    [Milliseconds]
Peter Tayl:        50        1.51039        1.41420        1.83450       75.51970        0.10540    [Milliseconds]
T3chb0t   :        50        1.33878        1.13730        2.61480       66.93920        0.21000    [Milliseconds]
Original  :        50        1.53352        1.39930        1.93510       76.67620        0.12120    [Milliseconds]

Data Size: 10000
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50       53.30435       48.53940       59.39360     2665.21760        2.12420    [Milliseconds]
dfhwze    :        50       13.29163       11.58010       17.93610      664.58150        1.42940    [Milliseconds]
Peter Tayl:        50       15.99885       13.73030       19.87350      799.94260        1.62800    [Milliseconds]
T3chb0t   :        50       13.35479       11.60260       17.27620      667.73940        1.33350    [Milliseconds]
Original  :        50       16.06655       14.10760       21.15530      803.32750        1.57870    [Milliseconds]

Data Size: 100000
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50      759.18213      671.44490      972.02490    37959.10640      106.57280    [Milliseconds]
dfhwze    :        50      184.68625      157.19610      240.79290     9234.31240       27.82440    [Milliseconds]
Peter Tayl:        50      247.55367      207.27300      296.28640    12377.68350       38.71610    [Milliseconds]
T3chb0t   :        50      200.40129      159.78880      241.07520    10020.06430       31.49570    [Milliseconds]
Original  :        50      250.01759      208.41280      324.99400    12500.87940       39.78020    [Milliseconds]

Data Size: 500000
Name       Iterations        Average            Min            Max          Total        Std Dev    Units
Pieter Wit:        50     4241.30253     3572.39540     4887.39420   212065.12660      382.99050    [Milliseconds]
dfhwze    :        50     1009.33538      798.42660     1143.81710    50466.76910      124.30220    [Milliseconds]
Peter Tayl:        50     1344.13312     1085.37460     1562.34310    67206.65590      185.08020    [Milliseconds]
T3chb0t   :        50     1002.87650      784.16660     1195.38060    50143.82510      136.03740    [Milliseconds]
Original  :        50     1354.36220     1072.92070     1536.09860    67718.10980      171.94550    [Milliseconds]

Test Data: randomly generated strings of length [0, 20), and the testcase was:

  foreach (var result in data.MultiGroupBy(
    ("First UCase", s => s.Length > 0 && char.IsUpper(s[0])),
    ("Length", s => s.Length),
    ("Length Four", s => s.Length == 4),
    ("Contains 'e'", s => s.Contains('e')),
    ("Num 'n's", s => s.Count(c => c == 'n'))))
  {
    foreach (var dict in result.Value)
    {
      sum += dict.Value.Count;
    }
  }

In order to get equivalent results, I changed the HashSet in the original with a List.

It's somehow a little disappointing that my efforts to do it in one iteration didn't pay off.

I'm too lazy to try it myself, but what happens when you replace the results dictionaries with an array in the original and Peter Taylor's answers? There's really no reason to use a dictionary over the labels in the loop, when the groupers are already in an ordered list. — VisualMelon
– VisualMelon, Commented Jul 21, 2019 at 9:17
@VisualMelon: Good suggestion. I've tried with the original, and it seems to improve it by about 7-10 percent. — user73941
– user73941, Commented Jul 21, 2019 at 14:12

Stack Exchange Network

Grouping into more groups in one iteration

5 Answers 5

GroupBy vs ToLookup

Refactored Code

Consistency with LINQ

Unexpected behaviour due to `HashSet`

Suggested code

You must log in to answer this question.

Hot Network Questions

Grouping into more groups in one iteration

5 Answers 5

GroupBy vs ToLookup

Refactored Code

Consistency with LINQ

Unexpected behaviour due to HashSet

Suggested code

You must log in to answer this question.

Related

Hot Network Questions

Unexpected behaviour due to `HashSet`