Support backwards compatible reads for unnanotated repeated primitive fields in Parquet#20943
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
76c7f3a to
a13e0a1
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
a13e0a1 to
e48dcb7
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
Fixed failing unit tests by adding an additional check in Other changes:
|
e48dcb7 to
8ac025f
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
8ac025f to
f628064
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
f628064 to
80682f3
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
80682f3 to
b439b81
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
b439b81 to
ab353ce
Compare
There was a problem hiding this comment.
The approach looks good to me, @raunaqmorarka PTAL. @martint can you process the CLA on this one too?
ab353ce to
063a45f
Compare
pettyjamesm
left a comment
There was a problem hiding this comment.
Looks good to me, thanks @mxmarkovics !
This is needed so that changes in constructField implementation are shared by both Iceberg and Hive (e.g. trinodb#20943) Additionally, this adds validation of field ids for Iceberg array and map types.
Description
Per the Parquet Spec:
A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So the following schemas should be treated equivalently.However, Trino currently does not support reading the old format and will throw the following error if attempting to read the field as a
array<int>:If instead the field is defined in the table as simply the primitive type itself e.g.
intthen another unhelpful error will be thrown:This issue is also present in Presto, however the second case will actually return incorrect results instead of throwing the second error.
Spark is able to correctly handle this however, e.g.:
Have added two old parquet file that was written with the old format showing the reproduction and used for unit testing. Tested on AWS Athena.
There are several issues here:
ParquetTypeUtilsthe table declared typearray<int>assumes that the ParquetColumnIOwill be aGroupColumIO, however in this case theColumnIOis aPrimitiveColumnIOwhich is repeated.PrimitiveFieldwith aGroupFieldas it would be in the case of the new schema format causes theGroupFieldto have the same repetition and definition levels of thePrimitiveFieldchild, where it should have 1 less for each since it is one level above. This change is also consistent with the change made in Spark to handle this case as well: apache/spark@d246010.Note: There is a similar code path in the Iceberg connector in
IcebergParquetColumnIOConverter, however I was unable to reproduce this issue using Iceberg as it currently does not allow this case, so we don't need to handle the Iceberg case in Trino. https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java#L74-L77.Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: