How to split one iterator into two?

Question

How would you split one iterator into two without iterating twice or using additional memory to store all the data?

Solution when you can store everything in memory:

l = [{'a': i, 'b': i * 2} for i in range(10)]
def a(iterator):
    for item in iterator:
        print(item)
def b(iterator):
   for item in iterator:
        print(item)

a([li['a'] for li in l])
b([li['b'] for li in l])

or if you can iterate twice,

class SomeIterable(object):
    def __iter__(self):
        for i in range(10):
            yield {'a': i, 'b': i * 2}


def a(some_iterator):
    for item in some_iterator:
        print(item)


def b(some_iterator):
    for item in some_iterator:
        print(item)


s = SomeIterable()

a((si['a'] for si in s))
b((si['b'] for si in s))

But how would I make it if I just want to iterate once?

Without iterating twice or using additional memory? You don't. — Mark Ransom
– Mark Ransom, Commented Jun 15, 2015 at 22:37
If a must complete before b begins, this is literally impossible. If that isn't a requirement, the problem is still a huge pain; you either need to rewrite a and b, or you need to use threads. — user2357112
– user2357112, Commented Jun 15, 2015 at 22:38
Your question is puzzling, but have you tried itertools.tee ? — hpaulj
– hpaulj, Commented Jun 15, 2015 at 22:47
By the way, your SomeIterator class is not actually an iterator. An iterator has a __next__ (or next in Python 2) method, in addition to an __iter__ method that returns itself. Your class is more appropriately described as an "iterable". — Blckknght
– Blckknght, Commented Jun 15, 2015 at 22:49
@lqdc I understand that the a and b that you're showing are simple examples to show what you want and not the real ones, but does a really need to go through the whole dataset before running b? Would it be ok to supply just one element to a, then to b, and then go to the next in the generator? — Ricardo Cárdenes
– Ricardo Cárdenes, Commented Jun 15, 2015 at 23:01

user2357112 · Accepted Answer · 2015-06-15 23:37:32Z

From the clarification in the comments, a and b are external library functions you cannot rewrite, but it's okay to interleave their execution. In that case, what you want is possible, but it pretty much requires threads:

import multiprocessing.pool # for ThreadPool, not multiprocessing
import Queue

_endofinput = object()

def _queueiter(queue):
    while True:
        item = queue.get()
        if item is _endofinput:
            break
        yield item

def parallel_execute(funcs, iterable, maxqueue):
    '''Interleaves the execution of funcs[0](iterable), funcs[1](iterable), etc.

    No function is allowed to lag more than maxqueue items behind another.
    (This will require adjustment if a function might return before consuming
    all input.)

    Makes only one pass over iterable.

    '''

    queues = [Queue.Queue(maxsize=maxqueue) for func in funcs]
    queueiters = [_queueiter(queue) for queue in queues]
    threadpool = multiprocessing.pool.ThreadPool(processes=len(funcs))

    results = threadpool.map_async(lambda (f, x): f(x), zip(funcs, queueiters))

    for item in iterable:
        for queue in queues:
            queue.put(item)

    for queue in queues:
        queue.put(_endofinput)

    threadpool.close()
    return results.get()

Blckknght · Accepted Answer · 2015-06-15 23:18:28Z

If the functions that consume your two iterators are not under your control and don't return control of the program to your code before consuming all of the iterator contents, there is no way to do what you want. You'll either need to hold all of the data in memory in between function calls or regenerate the iterator for the second function.

Now, if your functions were generators (that yield back to your code after consuming some small number of items from the input), you could make it work with itertools.tee. There might also be some other partial workarounds if you can call one or both of your functions with various parts of the input data at a time and then somehow compile the results of the repeated calls together into the desired output. Otherwise you're probably out of luck

Ricardo Cárdenes · Accepted Answer · 2015-06-15 23:10:13Z

1

Ok, if your functions are stateless, but still expect an iterable as argument, and that's the whole problem, then this should do:

for si in s:
    a([si['a']])
    b([si['b']])

edited Jun 15, 2015 at 23:10

answered Jun 15, 2015 at 22:44

Ricardo Cárdenes

9,1941 gold badge23 silver badges35 bronze badges

5 Comments

lqdc Over a year ago

well, they need the whole iterator to generate output. Like you cannot add to a trie once it's generated because it's read only after that.

Ricardo Cárdenes Over a year ago

Well, if the functions are really stateless, then I don't see how this shouldn't work. Either that, or there's still something we need to learn about your problem

lqdc Over a year ago

alright, replace a with def a(iterable): return marisa_trie.Trie(iterable). So I was wrong when saying they are stateless. They expect the whole iterable to generate output and then it's read only after that. An example is from this package: github.com/kmike/marisa-trie

Ricardo Cárdenes Over a year ago

I think the only important part there is if marisa_trie.Trie(iterable) returns the same as [marisa_trie.Trie([x]) for x in iterable]. Then you can call a and b separately for each key, and this works.

lqdc Over a year ago

That would generate n tries for n items in the iterable. Basically I want one object at the end, and you cannot pass another iterable to it once it's finished with the first one.

Collectives™ on Stack Overflow

How to split one iterator into two?

3 Answers 3

Comments

Comments

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

5 Comments

Related