From eb0137245643975bf7fc1ab7f901aaaaf500d01b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ra=C3=BAl=20Cumplido?= Date: Thu, 10 Oct 2024 15:16:34 +0200 Subject: [PATCH 01/12] Website: Add blog post for 18.0.0 --- _posts/2024-10-16-18.0.0-release.md | 121 ++++++++++++++++++++++++++++ 1 file changed, 121 insertions(+) create mode 100644 _posts/2024-10-16-18.0.0-release.md diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md new file mode 100644 index 000000000000..34054f6f6358 --- /dev/null +++ b/_posts/2024-10-16-18.0.0-release.md @@ -0,0 +1,121 @@ +--- +layout: post +title: "Apache Arrow 18.0.0 Release" +date: "2024-10-16 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 18.0.0 release. This covers +over 3 months of development work and includes [**XXX resolved issues**][1] +on [**YYY distinct commits**][2] from [**ZZZ distinct contributors**][2]. +See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 17.0.0 release, JJJJJ has been invited to be committer. +No new members have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Linux packages notes + + +## C Data Interface notes + + +## Arrow Flight RPC notes + + +## C++ notes + +For C++ notes refer to the full changelog. + +### Highlights + + +### Acero + + +### Compute + + +### Dataset + + +### Filesystems + + +### GPU + + +### IPC + + +### Parquet + + +### Substrait + + +## C# notes + + +## Java notes + + +## JavaScript notes + + +## Python notes + + +## R notes + +For more on what’s in the 18.0.0 R package, see the [R changelog][4]. + +## Ruby and C GLib notes + +### Ruby + +### C GLib + + +## Rust notes and Go notes + +The Rust and Go projects have moved to separate repositories outside the +main Arrow monorepo. For notes on the latest release of the Rust +implementation, see the latest [Arrow Rust changelog][5]. +For notes on the latest release of the Go implementation, see the latest +[Arrow Go changelog][6] + +[1]: https://github.com/apache/arrow/milestone/62?closed=1 +[2]: {{ site.baseurl }}/release/17.0.0.html#contributors +[3]: {{ site.baseurl }}/release/17.0.0.html#changelog +[4]: {{ site.baseurl }}/docs/r/news/ +[5]: https://github.com/apache/arrow-rs/tags +[6]: https://github.com/apache/arrow-go/tags From 85d166964983683ffccacf595fdf99f9227d1071 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ra=C3=BAl=20Cumplido?= Date: Thu, 10 Oct 2024 15:18:26 +0200 Subject: [PATCH 02/12] Update URLs for 18.0.0 --- _posts/2024-10-16-18.0.0-release.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 34054f6f6358..dbebc09c3480 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -113,9 +113,9 @@ implementation, see the latest [Arrow Rust changelog][5]. For notes on the latest release of the Go implementation, see the latest [Arrow Go changelog][6] -[1]: https://github.com/apache/arrow/milestone/62?closed=1 -[2]: {{ site.baseurl }}/release/17.0.0.html#contributors -[3]: {{ site.baseurl }}/release/17.0.0.html#changelog +[1]: https://github.com/apache/arrow/milestone/64?closed=1 +[2]: {{ site.baseurl }}/release/18.0.0.html#contributors +[3]: {{ site.baseurl }}/release/18.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/tags [6]: https://github.com/apache/arrow-go/tags From 0abdbec041377a2f80b453ab489188bc89ccb68f Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 11 Oct 2024 11:34:51 +0900 Subject: [PATCH 03/12] Add Linux packages note. --- _posts/2024-10-16-18.0.0-release.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index dbebc09c3480..97c98a47e854 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -44,6 +44,7 @@ Thanks for your contributions and participation in the project! ## Linux packages notes +Azure file system is enabled. ## C Data Interface notes From 28ab22b6c7833333e7ac6744ebd1c2f3566bffe0 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 11 Oct 2024 11:43:42 +0900 Subject: [PATCH 04/12] Add GLib notes --- _posts/2024-10-16-18.0.0-release.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 97c98a47e854..ff1e097cc5dc 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -105,6 +105,11 @@ For more on what’s in the 18.0.0 R package, see the [R changelog][4]. ### C GLib +* Add support for Azure file system: [GH-43738](https://github.com/apache/arrow/issues/43738) +* FlightRPC: Add support for DoPut: [GH-41056](https://github.com/apache/arrow/issues/41056) +* FlightRPC: Add support for timeout: [GH-44178](https://github.com/apache/arrow/issues/44178) +* Parquet: Add support for writing a record batch: [GH-40860](https://github.com/apache/arrow/issues/40860) +* Add support for pull style IPC stream format decoder: [GH-40493](https://github.com/apache/arrow/issues/40493) ## Rust notes and Go notes From 157c76d0866a23c992d4bdde7d623d0c12a3e579 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 11 Oct 2024 11:43:51 +0900 Subject: [PATCH 05/12] Add Ruby notes --- _posts/2024-10-16-18.0.0-release.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index ff1e097cc5dc..53d5ed8a1af4 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -103,6 +103,11 @@ For more on what’s in the 18.0.0 R package, see the [R changelog][4]. ### Ruby +* Add workaround for install failure due to `re2.pc` on Ubuntu 20.04: [GH-41396](https://github.com/apache/arrow/issues/41396) +* Add support for `0` decimal value: [GH-43877](https://github.com/apache/arrow/issues/43877) + +C GLib related improvements are also available in Ruby. + ### C GLib * Add support for Azure file system: [GH-43738](https://github.com/apache/arrow/issues/43738) From eb6dc86207e36098b6e29b8c58005dc08ccf5316 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ra=C3=BAl=20Cumplido?= Date: Fri, 11 Oct 2024 11:53:43 +0200 Subject: [PATCH 06/12] Apply several release notes from review Co-authored-by: David Li Co-authored-by: Rossi Sun Co-authored-by: Curt Hagenlocher --- _posts/2024-10-16-18.0.0-release.md | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 53d5ed8a1af4..298991ba473c 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -51,6 +51,11 @@ Azure file system is enabled. ## Arrow Flight RPC notes +**Flight UCX is deprecated.** We plan to remove this experiment in the next couple of releases. + +The Java implementation now transparently handles compressed Arrow data when reading, instead of requiring explicit configuration. (GH-43469) + +The Ruby bindings now support implementing DoPut on the server. (GH-43814) ## C++ notes @@ -61,6 +66,8 @@ For C++ notes refer to the full changelog. ### Acero +- Enhanced the row-oriented representation by widening the offset type from 32-bit to 64-bit, resolving crashes and data corruption in aggregation and hash join on large datasets due to offset overflow (GH-43495). +- Improved ordered aggregation performance by reducing complexity from `O(n*m)` to `O(n)`, where `n` is the number of rows and `m` the number of segments in the batch (GH-44052). ### Compute @@ -85,9 +92,21 @@ For C++ notes refer to the full changelog. ## C# notes - +- Partial support has been added for LargeBinary, LargeString and LargeList. The column sizes cannot exceed 2 GB in length. (GH-43266). +- Changes to Flight support were made for better control and compatibility, and to allow Flight Server to be hosted in pre-Kestrel versions of .NET (GH-43907, GH-43672, GH-41347). +- Support has been added for newly-defined types decimal32 and decimal64 (GH-44271). +- The import of sliced arrays through the C Data interface now works correctly. (GH-43267) ## Java notes +**Java 8 is no longer supported.** (GH-38051) + +**Gandiva may not work in this release.** For details, please see [GH-43576](https://github.com/apache/arrow/issues/43576). + +Basic support for RunEndEncoded was added (GH-39982). The ListView/StringView vector implementations are now more complete, including C Data support (multiple issues). + +Several APIs have been updated to accept `long` for addresses in preparation for FFM/large buffer support (GH-43902). We no longer expose `sun.misc.Unsafe` (GH-43479). We no longer ship the `shaded` flight-core JARs (GH-43217). + +More options were added to the Dataset ScanOptions API (GH-28866). ## JavaScript notes From 0ef7e0401bde28dc9423d5ae41c1cda8aaf46923 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Mon, 14 Oct 2024 17:49:50 +0200 Subject: [PATCH 07/12] Add C++ and format notes --- _posts/2024-10-16-18.0.0-release.md | 48 +++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 298991ba473c..96f6581cacaa 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -42,6 +42,12 @@ No new members have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project! +## Columnar format + +The Arrow columnar format now allows 32-bit and 64-bit decimal data, in +addition to the already existing 128-bit and 256-bit decimal data types +(GH-43956). + ## Linux packages notes Azure file system is enabled. @@ -59,10 +65,25 @@ The Ruby bindings now support implementing DoPut on the server. (GH-43814) ## C++ notes -For C++ notes refer to the full changelog. +The default memory pool has changed to mimalloc on all platforms (GH-43254). +Previously, jemalloc was used by default on Linux. Using mimalloc by default +provides a more consistent experience accross different platforms, and +makes configuration easier. It is expected that this might either increase +or decrease performance on user workloads that use the default memory pool; +please benchmark accordingly. Jemalloc can still be selected by setting +the `ARROW_DEFAULT_MEMORY_POOL` environment variable to "jemalloc". -### Highlights +A new class `arrow::ArrayStatistics` has been added to encode basic statistics +about an Arrow array. It provides a source-agnostic representation for statistics +provided by third-party sources such as Parquet files (GH-41909). +The new Decimal32 and Decimal64 types have been made available (GH-43956). + +Several canonical extension types have been implemented: +- the [Opaque](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#opaque) extension type (GH-43454); +- the [8-bit boolean](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#bit-boolean) extension type (GH-17682); +- the [UUID](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#uuid) extension type (GH-15058); +- the [JSON](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#json) extension type (GH-32538). ### Acero @@ -71,21 +92,44 @@ For C++ notes refer to the full changelog. ### Compute +Casting between string-like and string-view-like types has been implemented (GH-42247). ### Dataset ### Filesystems +Writing small files to S3 now uses a single S3 API call instead of three +(GH-40557). Files larger than 5 MB still go through the regular multipart +upload mechanism. + +Background writes are now implemented and enabled by default for the Azure +filesystem, dramatically improving the performance of writing to remote files +(GH-40036). + +Finalization of the S3 filesystem layer should hopefully be more robust (GH-44071). + +### Gandiva + +LLVM 19.1 is now supported (GH-44222). ### GPU ### IPC +The seed corpus used for fuzzing the IPC reader has been improved, hopefully +helping make the IPC reader even more robust against corrupt or malicious +IPC streams (GH-38041). ### Parquet +A new command line utility `parquet-dump-footer` allows dumping the Thrift-encoded +footer metadata of a Parquet file, optionally scrubbing confidential data +(GH-42102). This is part of the effort to collect real-world Parquet metadata +so as to evaluate the efficiency of future improvements to the Parquet format. +Please see https://github.com/apache/parquet-benchmark for instructions to submit +footers representative of your own workloads. ### Substrait From 3bc8b43cf895e8156558b4352ed48aeeb590b36f Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Thu, 24 Oct 2024 11:16:13 +0200 Subject: [PATCH 08/12] Update _posts/2024-10-16-18.0.0-release.md Co-authored-by: Alenka Frim --- _posts/2024-10-16-18.0.0-release.md | 49 ++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 96f6581cacaa..237e1422adbb 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -156,7 +156,54 @@ More options were added to the Dataset ScanOptions API (GH-28866). ## Python notes - +Compatibility notes: +* NumPy required dependency has been removed from pyarrow packaging + [GH-43846](https://github.com/apache/arrow/issues/43846) and has been + made an optional runtime dependency [GH-25118](https://github.com/apache/arrow/issues/25118). +* Support for Python 3.8 has been dropped [GH-43518](https://github.com/apache/arrow/issues/43518) +* No longer used serialize/deserialize Pyarrow C++ functions have been + deprecated [GH-44063](https://github.com/apache/arrow/issues/44063). +* Passing of build flags to setup.py (e.g. `setup.py --with-parquet`) has been + deprecated [GH-43514](https://github.com/apache/arrow/issues/43514) + +New features: +* Non-cpu work has continued with [GH-43973](https://github.com/apache/arrow/issues/43973), + [GH-43728](https://github.com/apache/arrow/issues/43728), [GH-43727](https://github.com/apache/arrow/issues/43727), + [GH-43391](https://github.com/apache/arrow/issues/43391), + [GH-42222](https://github.com/apache/arrow/issues/42222) and + [GH-41665](https://github.com/apache/arrow/issues/41665). +* Arrow C++ ``arrow::dataset::Partitioning::Format`` method has been exposed in + Python [GH-43684](https://github.com/apache/arrow/issues/43684). +* UUID canonical extension type is now supported in Python + [GH-15058](https://github.com/apache/arrow/issues/15058). +* Opaque canonical extension type has been implemented + [GH-43454](https://github.com/apache/arrow/issues/43454). +* ``StructArray.from_array`` now accepts a type in addition to names or fields + [GH-42014](https://github.com/apache/arrow/issues/42014). +* New attributes have been added to ``StructType`` in order to access all its fields + [GH-30058](https://github.com/apache/arrow/issues/30058). + +Other improvements: +* In order to support free-threaded build of CPython 3.13 additional work has been made: + [GH-44046](https://github.com/apache/arrow/issues/44046), + [GH-44355](https://github.com/apache/arrow/issues/44355) and + [GH-43964](https://github.com/apache/arrow/issues/43964). Umbrella issue + [GH-43536](https://github.com/apache/arrow/issues/43536). +* PyCapsule interface now has precedence over others in pa.schema(..) + [GH-43388](https://github.com/apache/arrow/issues/43388). +* Usage of deprecated ``pkg_resources`` in setup.py has been replaced with + ``numpy.get_include()`` [GH-43532](https://github.com/apache/arrow/issues/43532). +* Conversion from Arrow to JAX via dlpack as added to the documentation examples + [GH-44229](https://github.com/apache/arrow/issues/44229). + +Relevant bug fixes: +* ``pyarrow.Table.rename_columns`` has been updated and should have accepted ``tuples``, + not only ``list`` or ``dict``. This has been fixed + [GH-43588](https://github.com/apache/arrow/issues/43588). +* Python reference handling in UDF implementation has been sanitized + [GH-43487](https://github.com/apache/arrow/issues/43487). +* Files included when building wheels have been cleaned (unnecessary files removed) + [GH-43299](https://github.com/apache/arrow/issues/43299). ## R notes From 419e00b9d4322c80b7a484ca9ce22adad6ebe61a Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Thu, 24 Oct 2024 12:09:29 +0200 Subject: [PATCH 09/12] Improve description of S3 single API call upload --- _posts/2024-10-16-18.0.0-release.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 237e1422adbb..5a20864bfc7a 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -99,8 +99,9 @@ Casting between string-like and string-view-like types has been implemented (GH- ### Filesystems -Writing small files to S3 now uses a single S3 API call instead of three -(GH-40557). Files larger than 5 MB still go through the regular multipart +Writing small files to S3 can use a single S3 API call instead of three, +provided the new option `allow_delayed_open` is enabled (GH-40557). +Files larger than 5 MB still go through the regular multipart upload mechanism. Background writes are now implemented and enabled by default for the Azure From cb768a02939126479e16992c21e9ea8a2ac69ae1 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Thu, 24 Oct 2024 12:09:42 +0200 Subject: [PATCH 10/12] Typo Co-authored-by: Bryce Mecum --- _posts/2024-10-16-18.0.0-release.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 5a20864bfc7a..fe6561936706 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -67,7 +67,7 @@ The Ruby bindings now support implementing DoPut on the server. (GH-43814) The default memory pool has changed to mimalloc on all platforms (GH-43254). Previously, jemalloc was used by default on Linux. Using mimalloc by default -provides a more consistent experience accross different platforms, and +provides a more consistent experience across different platforms, and makes configuration easier. It is expected that this might either increase or decrease performance on user workloads that use the default memory pool; please benchmark accordingly. Jemalloc can still be selected by setting From 86957b9ad4f41b63cfcd4692aa211d1197448d78 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Thu, 24 Oct 2024 12:10:06 +0200 Subject: [PATCH 11/12] Add link to env var Co-authored-by: Bryce Mecum --- _posts/2024-10-16-18.0.0-release.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index fe6561936706..7262673d8dd2 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -71,7 +71,7 @@ provides a more consistent experience across different platforms, and makes configuration easier. It is expected that this might either increase or decrease performance on user workloads that use the default memory pool; please benchmark accordingly. Jemalloc can still be selected by setting -the `ARROW_DEFAULT_MEMORY_POOL` environment variable to "jemalloc". +the [`ARROW_DEFAULT_MEMORY_POOL`](https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL) environment variable to "jemalloc". A new class `arrow::ArrayStatistics` has been added to encode basic statistics about an Arrow array. It provides a source-agnostic representation for statistics From 525cb797fbe9145b8018a281d4a8bf38b2523a46 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ra=C3=BAl=20Cumplido?= Date: Thu, 24 Oct 2024 14:56:50 +0200 Subject: [PATCH 12/12] Update _posts/2024-10-16-18.0.0-release.md Co-authored-by: Bryce Mecum --- _posts/2024-10-16-18.0.0-release.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-10-16-18.0.0-release.md b/_posts/2024-10-16-18.0.0-release.md index 7262673d8dd2..2e32ec7afb32 100644 --- a/_posts/2024-10-16-18.0.0-release.md +++ b/_posts/2024-10-16-18.0.0-release.md @@ -154,7 +154,7 @@ Several APIs have been updated to accept `long` for addresses in preparation for More options were added to the Dataset ScanOptions API (GH-28866). ## JavaScript notes - +- Accessing individual rows in Tables or Structs should now be more performant ([GH-30863](https://github.com/apache/arrow/issues/30863)). ## Python notes Compatibility notes: