Skip to content

Issue with cubestore TableImport during preAggregation with unload: support more CSV quote options #7071

@igorcalabria

Description

@igorcalabria

Hi, I'm implementing unload support for presto/trino's driver and I'm seeing an odd issue with cubestore's import behavior. The way I've implemented this is to follow the same strategy as athena's driver, but using CREATE TABLE instead of UNLOAD (since trino doesn't implement this).

What I'm seeing is that sometimes, cubestore is dropping an arbitrary number of records from each file that it imports. I've double checked its logs and all files are reported as complete

<pid:1> Running job completed (14.535796782s): IdRow { ...

To give you guys more context, the unload strategy is as follows

  • Execute a CREATE TABLE ... AS ... query on trino that writes that data using TEXTFILE format with gzip compression
  • Get the columns from the created table
  • List files on table dir and generate the signed s3 urls

This is basically the same strategy as athena's driver but using CREATE TABLE instead. I've compared cube's imported data and the data on the table, but couldn't find anything obvious yet. When the issue happens, about 50% of the records are missing on cube but breaking it down per imported file it varies a lot(from about 20% to almost 80%).

I realize that this issue is basically impossible to reproduce, but I"m looking for advice on which points cubestore may report that it successfully imported a file but it's only partially successful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cube storeIssues relating to Cube Storehelp wantedCommunity contributions are welcome.pre-aggregationsIssues related to pre-aggregations

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions