Description
Hi, I'm implementing unload support for presto/trino's driver and I'm seeing an odd issue with cubestore's import behavior. The way I've implemented this is to follow the same strategy as athena's driver, but using CREATE TABLE
instead of UNLOAD
(since trino doesn't implement this).
What I'm seeing is that sometimes, cubestore is dropping an arbitrary number of records from each file that it imports. I've double checked its logs and all files are reported as complete
<pid:1> Running job completed (14.535796782s): IdRow { ...
To give you guys more context, the unload strategy is as follows
- Execute a
CREATE TABLE ... AS ...
query on trino that writes that data using TEXTFILE format with gzip compression - Get the columns from the created table
- List files on table dir and generate the signed s3 urls
This is basically the same strategy as athena's driver but using CREATE TABLE instead. I've compared cube's imported data and the data on the table, but couldn't find anything obvious yet. When the issue happens, about 50% of the records are missing on cube but breaking it down per imported file it varies a lot(from about 20% to almost 80%).
I realize that this issue is basically impossible to reproduce, but I"m looking for advice on which points cubestore may report that it successfully imported a file but it's only partially successful.