Skip to content

Question about scroll API and find_each / find_in_batches #651

Closed
@howtwizer

Description

@howtwizer

Elasticsearch 5.1

Following docs:
http://www.rubydoc.info/gems/elasticsearch-api/Elasticsearch/API/Actions#scroll-instance_method

When I'm trying to reproduce example 'Call the scroll API until all the documents are returned', I notice that this call

# Call the `scroll` API until empty results are returned
while r = client.scroll(scroll_id: r['_scroll_id'], scroll: '5m') and not r['hits']['hits'].empty? do
  puts "--- BATCH #{defined?($i) ? $i += 1 : $i = 1} -------------------------------------------------"
  puts r['hits']['hits'].map { |d| d['_source']['title'] }.inspect
  puts
end

doesn't contains the results of this initial call:

# Open the "view" of the index with the `scan` search_type
r = client.search index: 'test', search_type: 'scan', scroll: '5m', size: 10

So in the end we missing positions counting by size of initial scroll call.

Example.
If we have index ['test1', 'test2',' test3' ..... 'test100']
calling the scroll API with initial size 10 will return ['test11', 'test12', .... 'test100'] with missing first 10 results.

I have same results in elasticsearch console - first call of scroll does not include results of initial call, so seems that the scroll method works like it need.

But the question is in find_each
According docs:

 Iterate effectively over models using the `find_in_batches` method.
          #
          # All the options are passed to `find_in_batches` and each result is yielded to the passed block.
          #
          # @example Print out the people's names by scrolling through the index
          #
          #     Person.find_each { |person| puts person.name }
          #
          #     # # GET http://localhost:9200/people/person/_search?scroll=5m&search_type=scan&size=20
          #     # # GET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=c2Nhbj...
          #     # Test 0
          #     # Test 1
          #     # Test 2
          #     # ...
          #     # # GET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=c2Nhbj...
          #     # Test 20
          #     # Test 21
          #     # Test 22
          #

But, in fact it will return

          #     # Test 20
          #     # Test 21
          #     # Test 22

Think that the problem in rewriting of 'response' var in find_in_batches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions