Python Generators
One of Python's strengths is its powerful generators. Even a basic understand of them unlocks the the ability to elegantly handle even huge datasets. But they hold some suprises in what is executed when -- understand them at a deeper level allows you to use and debug them more effectively.
Iterables and Iterators🔗
First, some basics. An iterable is an object like a list or a tuple
that can be iterated over in a for
loop. We will see shortly that there are
other types of iterables.
a = [1, 2, 3]
for i in a:
print(i)
# 1
# 2
# 3
An iterator itr
is an object with an __next()__
method which yields the
next object to be iterated over. If there are no more objects, next()
will
instead raise a StopIteration
exception. The pythonic way to do this is the
builtin next
function: next(itr)
. There is a deep connection between
iterables and iterators; part of the iterable contract is that an iterable
must expose an __iter__()
method which returns an iterator. The pythonic
way to call this is using the builtin iter()
method: iter(a)
a = [1, 2]
itr = iter(a)
print(next(itr))
# 1
print(next(itr))
# 2
print(next(itr))
# StopIteration
In fact, the for
loop over an iterable a
can be implemented as:
# Vanilla for loop
for x in a:
f(x)
# or the iterator version
itr = iter(a)
while True:
try:
x = next(itr)
except StopIteration:
break
f(x)
To make life slightly easier, iterators must also be iterables; i.e. they
must expose the __iter__
method, which in this case will just return
themselves:
iter([1, 2, 3]) is [1, 2, 3]
# False
itr = iter([1, 2, 3])
iter(itr) is itr
# True
One important difference between iterators and most iterables is that
iterators are always one-shot, while iterables might be able to be consumed
more than once. If call next
on an iterator (in a for loop or otherwise),
you are consuming one iteration, and it can't be consumed again. When you
iterate over most iterables (like lists, etc), you create a new iterator,
which starts from the beginning.
a = [1, 2]
b = iter([1, 2])
def print1(itbl):
for x in itbl:
print(x)
break
print1(a)
# 1
print1(a)
# 1
print1(a)
# 1
print1(b)
# 1
print1(b)
# 2
print1(b)
# Nothing! We have exhausted b already.
In fact, for many iterables, you can create multiple independent iterators:
a = [1, 2, 3]
b = iter(a)
c = iter(a)
next(b)
# 1
next(c)
# 1
next(c)
# 2
next(c)
# 3
next(c)
# StopIteration
next(b)
# 2
next(b)
# 3
next(b)
# StopIteration
To summarize, an iterable is something that exposes an iterator via iter
, and
an iterator is something that you can call next
on.
Generators🔗
If you use the yield
keyword in a function, it does not return a normal value,
but instead returns a generator.
def f():
yield 1
g = f()
type(g)
# <type 'generator'>
Generators are in fact iterators.
print(next(g))
# 1
print(next(g))
# StopIteration
Each time you call the function, you'll get a new, independent, iterator.
def f():
yield 1
g1 = f()
g2 = f()
print(next(g1))
# 1
print(next(g1))
# StopIteration
print(next(g2))
# 1
print(next(g2))
# StopIteration
One of the most important features of a generator is that it lazily yields
values. Between next
calls, it suspends execution, but maintains its
internal state. The classic example is the infinite counter, which lazily
produces an infinite sequence of integers.
def counter():
c = 0
while True:
yield c
c += 1
count = counter()
print(next(count))
# 0
print(next(count))
# 1
This will never raise a StopIteration
; it will continue to count forever. Be
careful not to iterate it in a for
loop without a break
statement!
Let's examine the sequence of operations of this simple generator more closely.
def counter():
print('Starting counter')
c = 0
while True:
yield c
c += 1
print('Incremented counter to', c)
count = counter()
y = next(count)
# Starting counter
print('Counted', y)
# Counted 0
y = next(count)
# Incremented counter to 1
print('Counted', y)
# Counted 1
Let's examine the flow:
count = counter()
initializes the generator, assigning tocount
an iterator. Notice it does not execute anything inside of counter.y = next(count)
iterates the iterator, executing the code incounter
up to the first time we yield. It assigns the yielded value0
toy
.print 'Counted', y
prints the yielded value0
.y = next(count)
executes the loop from the firstyield
to the second. It also assigns the yielded value1
toy
.print 'Counted', y
prints the yielded value1
.
Sending🔗
This is a basic generator that you can use to iterate/etc. But yield
also
has a special power; it can receive values from the outside the function and
assign them to variables inside. Let's make a basic consumer:
def printer():
print('Starting printer')
while True:
x = yield
print('Printing', x)
p = printer()
y = next(p)
# Starting printer
print('Outside', y)
# Outside None
y = next(p)
# Printing None
print('Outside', y)
# Outside None
This creates a generator that doesn't yield anything. Let's follow the flow.
p = printer()
creates p, which does not do any execution.y = next(p)
executesprinter
up to the linex = yield
, which waits for iteration. Since we didn't specify anything to yield that's the equivalent ofyield None
, andy
is assigned toNone
.print('Outside', y)
Sincey
isNone
, we printOutside None
.y = next(p)
executesprinter
from theyield
. We assign the result of theyield
expression tox
. This is alsoNone
, because we didn'tsend
anything -- foreshadowing! Since we are yet again not yielding anything,None
is returned and assigned toy
.print('Outside', y)
Sincey
isNone
, we printOutside None
.
For yield
to actually receive information, we need to use the send
method
of the generator. In the above example, we didn't send anything to p
, because
we wanted to emphases that printer
is just a normal generator, but that
yield
actually returns a value that can be assigned to a variable. Now let's
actually get to send
:
p = printer()
y = p.send(None)
# Starting printer
print('Outside', y)
# Outside None
y = p.send('a')
# Printing a
print('Outside', y)
# Outside None
p.send(2)
# Printing 2
Now we're sending things to the generator to be processed. Let's follow the flow.
p = printer()
creates p, which does not do any execution.y = p.send(None)
starts the generators execution, exactly likenext(p)
would. In fact, you could replace this withnext(p)
and it would be the same. You need to "prime" the generator in this way; if you try to send a non-None
value to a just-started generator, it will throw an error.print('Outside', y)
prints the value ofy
, which isNone
, becausesend
has no return value.y = p.send('a')
sends the value'a'
to the generator. The linex = yield
means that it will assign tox
the value that is sent, in this case'a'
, and so that is printed before we loop around to theyield
statement again.print('Outside', y)
shows thaty
is stillNone
, and will always be so.p.send(2)
sends the value2
, which is printed as expected.
Now how do we tie these together? Here's a very simple way:
count = counter()
p = printer()
p.send(None)
for x in range(3)
p.send(next(count))
# > Starting counter
# > Printing 0
# > Incremented counter to 1
# > Printing 1
# > Incremented counter to 2
# > Printing 2
We say that count
is a producer (because it is yielding results to the
outside), and p
is a consumer, because we are sending values to it. Of
course, we can have a generator that consumes a producer.
def times2(gen):
print('Starting times2')
while True:
x = next(gen)
print('Multiplying %s by 2' % x)
yield 2*x
count = counter()
t2 = times2(count)
next(t2)
# > Starting times2
# > Starting counter
# > Multiplying 0 by 2
# 0
next(t2)
# > Incremented counter to 1
# > Multiplying 1 by 2
# 2
The while
statement above is actually just a cumbersome for
loop. It
could be written as
def times2(gen):
for x in gen:
yield 2*x