Skip to content

Parquet-cli unable to read variant shredding tests 86 and 126? #97

@scovich

Description

@scovich

While adding support for variant array unshredding to arrow-rs, I discovered that parquet-cli is unable to correctly read the parquet files for cases 86 and 126, both due to the same index out of bounds error:

% parquet cat parquet-testing/shredded_variant/case-086.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file parquet-testing/shredded_variant/case-086.parquet
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
        at org.apache.parquet.cli.Main.run(Main.java:169)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/ryan.johnson/arrow-rs/parquet-testing/shredded_variant/case-086.parquet
        at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
        at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
        ... 3 more
Caused by: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 0
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
        at java.base/java.util.Objects.checkIndex(Objects.java:385)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:72)
        at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:66)
        at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:308)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:141)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:105)
        at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:186)
        at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:105)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:156)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
        ... 9 more

The backtrace for case-126.parquet is identical.

Looking at arrow-rs debug printouts of the arrays, I don't see anything obviously wrong, tho?

arrow-rs debug printout

For case-086.parquet, the input data is:

typed_value array: ListArray
[
  StructArray
-- validity:
[
  valid,
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  [0],
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "comedy",
  null,
  "drama",
]
],
]

And for case-126.parquet, we have:

typed_value array: ListArray
[
  StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
  1,
  2,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "comedy",
  "drama",
]
]
]
],
  StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  [2, 1, 2, 0, 4, 13, 115, 116, 114],
  [2, 1, 3, 0, 5, 44, 40, 77, 0, 0],
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
  3,
  4,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "action",
  "horror",
]
]
]
],
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions