Streaming JSON with Flask

I have a SQLAlchemy query (able to behave as an iterator) which could return a large result set. First version of the code was very simple. Release objects have a to_dict() function which returns a dictionary, so I append to a list and jsonify the result:

# releases = <SQLAlchemy query object>

output = []
for r in releases:
    output.append(r.to_dict())

return jsonify(releases=output), 200

(context on github)

This result set could potentially grow to a point that fitting it memory would be impractical – with only a thousand releases there is already a significant lag before we start getting results.

Unfortunately, Flask’s jsonify() function doesn’t support streaming, so we have to do it manually as described in the Flask documentation. I thus came up with a simple generator like so:

# query = <something>

def generate():
    yield '{"releases": ['
    for release in query:
        yield json.dumps(release.to_dict()) + ', '
    yield ']}'

return Response(generate(), content_type='application/json')

The problem is, that trying to json.loads() the output of this, will result in “ValueError: No JSON object could be decoded”, because the last element in the list will have a comma. No .join() for us!

Thus we need to detect the last iteration, and omit the comma.

How does one do this? I found a handy answer on stackoverflow, which describes using what is called a “lagging generator”. On each yield we return the previous iteration, which allows us to look ahead.

So I modified the generator, and came up with the following:

def generate():
    releases = query.__iter__()
    prev_release = next(releases)  # get first result

    yield '{"releases": ['

    # Iterate over the releases
    for release in releases:
        yield json.dumps(prev_release.to_dict()) + ', '
        prev_release = release

    # Now yield the last iteration without comma but with the closing brackets
    yield json.dumps(prev_release.to_dict()) + ']}'

Now we can detect the last iteration and omit the comma, substituting for the closing brackets instead.

There’s just one problem. When the length of the query result is zero (a reasonable situation), the first next(releases) call will raise StopIteration before we’ve outputted any JSON. Code that expects a valid JSON document will thus fail.

The solution is therefore to catch the first StopIteration, yield a valid “empty” JSON result set, and re-raise the StopIteration. The final solution is thus:

def generate():
    """
    A lagging generator to stream JSON so we don't have to hold everything in memory

    This is a little tricky, as we need to omit the last comma to make valid JSON,
    thus we use a lagging generator, similar to http://stackoverflow.com/questions/1630320/
    """
    releases = query.__iter__()
    try:
        prev_release = next(releases)  # get first result
    except StopIteration:
        # StopIteration here means the length was zero, so yield a valid releases doc and stop
        yield '{"releases": []}'
        raise StopIteration

    # We have some releases. First, yield the opening json
    yield '{"releases": ['

    # Iterate over the releases
    for release in releases:
        yield json.dumps(prev_release.to_dict()) + ', '
        prev_release = release

    # Now yield the last iteration without comma but with the closing brackets
    yield json.dumps(prev_release.to_dict()) + ']}'

return Response(generate(), content_type='application/json')

(github link)

Leave a Reply