-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof-of-concept Parquet GEOMETRY logical type implementation #43977
base: main
Are you sure you want to change the base?
Proof-of-concept Parquet GEOMETRY logical type implementation #43977
Conversation
Co-authored-by: Gang Wu <ustcwg@gmail.com>
@wgtmac I have added ColumnIndex and covering support. Please help review this PR when you have time, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just finished reviewing it for the 2nd pass. Thanks for the great work!
My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?
Geometry(std::string crs, LogicalType::GeometryEdges::edges edges, | ||
LogicalType::GeometryEncoding::geometry_encoding encoding, | ||
std::string metadata) | ||
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN), | |
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNSIGNED), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SortOrder::UNSIGNED
is the default sort order of BYTE_ARRAY
type. Could we just use this so you don't have to change a line in column_writer.cc. The good thing is that ColumnIndex of geometry type can also be generated automatically, though the min/max values are derived from their binary values and useless. This is the same practice used in the Java PoC impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is life-saving. Now all the special handling of geometry statistics for unknown sort order has gone away.
cpp/src/parquet/statistics.cc
Outdated
out.mmax = maxes[3]; | ||
|
||
if (coverings_.empty()) { | ||
// Generate coverings from bounding box if coverings is not present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will coverings_
be empty? Is it the default behavior? I'm not sure if we need to check if the edges is planar since bbox is not accurate for spherical edges. BTW, if we don't have a good implementation for coverings, I think we can just ignore it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for generating coverings from the bounding box when assembling the encoded representation of the geometry statistics. I've added a member called generate_covering_
to make it more explicit.
cpp/src/parquet/statistics.h
Outdated
|
||
class GeometryStatisticsImpl; | ||
|
||
class PARQUET_EXPORT GeometryStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding theses!
cpp/src/parquet/statistics.h
Outdated
return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make( | ||
descr, encoded_min, encoded_max, num_values, null_count, distinct_count, | ||
has_min_max, has_null_count, has_distinct_count, pool)); | ||
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly add const EncodedGeometryStatistics* geometry_statistics = NULLPTR
to the end of the existing function signature?
…several review comments
… coverings from bounding box when populating the encoded statistics
…and upper-right points
I've changed the min/max statistics of geometry columns to be the WKB representation of lower-left and upper-right corners in the last commit according to apache/iceberg#10981 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick change! I've left some minor comments.
cpp/src/parquet/column_reader.cc
Outdated
EncodedGeometryStatistics encoded_geometry_stats; | ||
if (stats.__isset.geometry_stats) { | ||
encoded_geometry_stats = FromThrift(stats.geometry_stats); | ||
} | ||
page_statistics.set_geometry(encoded_geometry_stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncodedGeometryStatistics encoded_geometry_stats; | |
if (stats.__isset.geometry_stats) { | |
encoded_geometry_stats = FromThrift(stats.geometry_stats); | |
} | |
page_statistics.set_geometry(encoded_geometry_stats); | |
page_statistics.set_geometry(FromThrift(stats.geometry_stats)); |
cpp/src/parquet/statistics.h
Outdated
return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make( | ||
descr, encoded_min, encoded_max, num_values, null_count, distinct_count, | ||
has_min_max, has_null_count, has_distinct_count, pool)); | ||
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we'd better not to add another overload. If there is a compelling reason to do so, we can add a ARROW_DEPRECATED
macro to the old one.
1. Remove metadata property of geometry logical types 2. Remove covering from geometry statistics
1. geometry statistics moved out of statistics, it is now a field of column metadata 2. geometry statistics is removed from page index
8a50947
to
da55a55
Compare
Rationale for this change
This is a continuation of #43196
In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.
What changes are included in this PR?
We are still in progress of completing all necessary changes to integrate geometry logical type support to the C++ implementation.
Are these changes tested?
The tests added only cover very basic use cases. Comprehensive tests will be added in future commits.
Are there any user-facing changes?
Yes! (And will eventually be documented)