Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Website: Add blog post for 18.0.0 #547

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
243 changes: 243 additions & 0 deletions _posts/2024-10-16-18.0.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
layout: post
title: "Apache Arrow 18.0.0 Release"
date: "2024-10-16 00:00:00"
author: pmc
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->


The Apache Arrow team is pleased to announce the 18.0.0 release. This covers
over 3 months of development work and includes [**XXX resolved issues**][1]
on [**YYY distinct commits**][2] from [**ZZZ distinct contributors**][2].
See the [Install Page](https://arrow.apache.org/install/)
to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the [complete changelog][3].

## Community

Since the 17.0.0 release, JJJJJ has been invited to be committer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should here being updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update the number of commits and the new committers once we approve RC0

No new members have joined the Project Management Committee (PMC).

Thanks for your contributions and participation in the project!

## Columnar format

The Arrow columnar format now allows 32-bit and 64-bit decimal data, in
addition to the already existing 128-bit and 256-bit decimal data types
(GH-43956).

## Linux packages notes

kou marked this conversation as resolved.
Show resolved Hide resolved
Azure file system is enabled.
Comment on lines +51 to +53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this section really deserve being at the top?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can move this because this is not so important in 18.0.0.
This section was important in 17.0.0 because we dropped support for one platform: https://arrow.apache.org/blog/2024/07/16/17.0.0-release/


## C Data Interface notes

pitrou marked this conversation as resolved.
Show resolved Hide resolved

## Arrow Flight RPC notes

raulcd marked this conversation as resolved.
Show resolved Hide resolved
**Flight UCX is deprecated.** We plan to remove this experiment in the next couple of releases.
Comment on lines +58 to +60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flight UCX is C++ specific, this shouldn't be at the top of the release notes. The "Arrow Flight RPC notes" should be for Flight specification changes IMHO.


The Java implementation now transparently handles compressed Arrow data when reading, instead of requiring explicit configuration. (GH-43469)

The Ruby bindings now support implementing DoPut on the server. (GH-43814)

## C++ notes

The default memory pool has changed to mimalloc on all platforms (GH-43254).
Previously, jemalloc was used by default on Linux. Using mimalloc by default
provides a more consistent experience across different platforms, and
makes configuration easier. It is expected that this might either increase
or decrease performance on user workloads that use the default memory pool;
please benchmark accordingly. Jemalloc can still be selected by setting
the [`ARROW_DEFAULT_MEMORY_POOL`](https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL) environment variable to "jemalloc".

A new class `arrow::ArrayStatistics` has been added to encode basic statistics
about an Arrow array. It provides a source-agnostic representation for statistics
provided by third-party sources such as Parquet files (GH-41909).

The new Decimal32 and Decimal64 types have been made available (GH-43956).

pitrou marked this conversation as resolved.
Show resolved Hide resolved
Several canonical extension types have been implemented:
- the [Opaque](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#opaque) extension type (GH-43454);
- the [8-bit boolean](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#bit-boolean) extension type (GH-17682);
- the [UUID](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#uuid) extension type (GH-15058);
- the [JSON](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#json) extension type (GH-32538).

### Acero

raulcd marked this conversation as resolved.
Show resolved Hide resolved
raulcd marked this conversation as resolved.
Show resolved Hide resolved
- Enhanced the row-oriented representation by widening the offset type from 32-bit to 64-bit, resolving crashes and data corruption in aggregation and hash join on large datasets due to offset overflow (GH-43495).
- Improved ordered aggregation performance by reducing complexity from `O(n*m)` to `O(n)`, where `n` is the number of rows and `m` the number of segments in the batch (GH-44052).

### Compute

Casting between string-like and string-view-like types has been implemented (GH-42247).

### Dataset


### Filesystems

Writing small files to S3 can use a single S3 API call instead of three,
provided the new option `allow_delayed_open` is enabled (GH-40557).
Files larger than 5 MB still go through the regular multipart
upload mechanism.

Background writes are now implemented and enabled by default for the Azure
filesystem, dramatically improving the performance of writing to remote files
(GH-40036).

Finalization of the S3 filesystem layer should hopefully be more robust (GH-44071).

### Gandiva

LLVM 19.1 is now supported (GH-44222).

### GPU


### IPC

The seed corpus used for fuzzing the IPC reader has been improved, hopefully
helping make the IPC reader even more robust against corrupt or malicious
IPC streams (GH-38041).

### Parquet

A new command line utility `parquet-dump-footer` allows dumping the Thrift-encoded
footer metadata of a Parquet file, optionally scrubbing confidential data
(GH-42102). This is part of the effort to collect real-world Parquet metadata
so as to evaluate the efficiency of future improvements to the Parquet format.
Please see https://github.com/apache/parquet-benchmark for instructions to submit
footers representative of your own workloads.

### Substrait


## C# notes

raulcd marked this conversation as resolved.
Show resolved Hide resolved
- Partial support has been added for LargeBinary, LargeString and LargeList. The column sizes cannot exceed 2 GB in length. (GH-43266).
- Changes to Flight support were made for better control and compatibility, and to allow Flight Server to be hosted in pre-Kestrel versions of .NET (GH-43907, GH-43672, GH-41347).
- Support has been added for newly-defined types decimal32 and decimal64 (GH-44271).
- The import of sliced arrays through the C Data interface now works correctly. (GH-43267)
## Java notes

raulcd marked this conversation as resolved.
Show resolved Hide resolved
**Java 8 is no longer supported.** (GH-38051)

**Gandiva may not work in this release.** For details, please see [GH-43576](https://github.com/apache/arrow/issues/43576).

raulcd marked this conversation as resolved.
Show resolved Hide resolved
Basic support for RunEndEncoded was added (GH-39982). The ListView/StringView vector implementations are now more complete, including C Data support (multiple issues).

Several APIs have been updated to accept `long` for addresses in preparation for FFM/large buffer support (GH-43902). We no longer expose `sun.misc.Unsafe` (GH-43479). We no longer ship the `shaded` flight-core JARs (GH-43217).

jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
More options were added to the Dataset ScanOptions API (GH-28866).

## JavaScript notes
- Accessing individual rows in Tables or Structs should now be more performant ([GH-30863](https://github.com/apache/arrow/issues/30863)).

## Python notes
Compatibility notes:
* NumPy required dependency has been removed from pyarrow packaging
[GH-43846](https://github.com/apache/arrow/issues/43846) and has been
made an optional runtime dependency [GH-25118](https://github.com/apache/arrow/issues/25118).
* Support for Python 3.8 has been dropped [GH-43518](https://github.com/apache/arrow/issues/43518)
* No longer used serialize/deserialize Pyarrow C++ functions have been
deprecated [GH-44063](https://github.com/apache/arrow/issues/44063).
* Passing of build flags to setup.py (e.g. `setup.py --with-parquet`) has been
deprecated [GH-43514](https://github.com/apache/arrow/issues/43514)

New features:
* Non-cpu work has continued with [GH-43973](https://github.com/apache/arrow/issues/43973),
[GH-43728](https://github.com/apache/arrow/issues/43728), [GH-43727](https://github.com/apache/arrow/issues/43727),
[GH-43391](https://github.com/apache/arrow/issues/43391),
[GH-42222](https://github.com/apache/arrow/issues/42222) and
[GH-41665](https://github.com/apache/arrow/issues/41665).
Comment on lines +171 to +175
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just enumerating those issue links might not be that informative without more context (for someone reading this, you would have to click on all those links to have an idea what actually changed)

* Arrow C++ ``arrow::dataset::Partitioning::Format`` method has been exposed in
Python [GH-43684](https://github.com/apache/arrow/issues/43684).
* UUID canonical extension type is now supported in Python
[GH-15058](https://github.com/apache/arrow/issues/15058).
* Opaque canonical extension type has been implemented
[GH-43454](https://github.com/apache/arrow/issues/43454).
* ``StructArray.from_array`` now accepts a type in addition to names or fields
[GH-42014](https://github.com/apache/arrow/issues/42014).
* New attributes have been added to ``StructType`` in order to access all its fields
[GH-30058](https://github.com/apache/arrow/issues/30058).

Other improvements:
* In order to support free-threaded build of CPython 3.13 additional work has been made:
[GH-44046](https://github.com/apache/arrow/issues/44046),
[GH-44355](https://github.com/apache/arrow/issues/44355) and
[GH-43964](https://github.com/apache/arrow/issues/43964). Umbrella issue
[GH-43536](https://github.com/apache/arrow/issues/43536).
* PyCapsule interface now has precedence over others in pa.schema(..)
[GH-43388](https://github.com/apache/arrow/issues/43388).
* Usage of deprecated ``pkg_resources`` in setup.py has been replaced with
``numpy.get_include()`` [GH-43532](https://github.com/apache/arrow/issues/43532).
* Conversion from Arrow to JAX via dlpack as added to the documentation examples
[GH-44229](https://github.com/apache/arrow/issues/44229).

Relevant bug fixes:
* ``pyarrow.Table.rename_columns`` has been updated and should have accepted ``tuples``,
not only ``list`` or ``dict``. This has been fixed
[GH-43588](https://github.com/apache/arrow/issues/43588).
* Python reference handling in UDF implementation has been sanitized
[GH-43487](https://github.com/apache/arrow/issues/43487).
* Files included when building wheels have been cleaned (unnecessary files removed)
[GH-43299](https://github.com/apache/arrow/issues/43299).

## R notes

For more on what’s in the 18.0.0 R package, see the [R changelog][4].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonkeane @paleolimbot @assignUser can you help with the R notes?

Comment on lines +209 to +211
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## R notes
For more on what’s in the 18.0.0 R package, see the [R changelog][4].
## R notes
* R functions that users write that use functions that Arrow supports in dataset
queries now can be used in queries too. Previously, only functions that used
arithmetic operators worked.
For example, `time_hours <- function(mins) mins / 60` worked,
but `time_hours_rounded <- function(mins) round(mins / 60)` did not;
now both work. These are automatic translations rather than true user-defined
functions (UDFs); for UDFs, see `register_scalar_function()`. [GH-41223](https://github.com/apache/arrow/issues/41223)
* `mutate()` expressions can now include aggregations, such as `x - mean(x)`. [GH-41350](https://github.com/apache/arrow/issues/41350)
* `summarize()` supports more complex expressions, and correctly handles cases
where column names are reused in expressions. [GH-41223](https://github.com/apache/arrow/issues/41223)
* The `na_matches` argument to the `dplyr::*_join()` functions is now supported.
This argument controls whether `NA` values are considered equal when joining. [GH-41358](https://github.com/apache/arrow/issues/41358)
* R metadata, stored in the Arrow schema to support round-tripping data between
R and Arrow/Parquet, is now serialized and deserialized more strictly.
This makes it safer to load data from files from unknown sources into R data.frames. [GH-41969](https://github.com/apache/arrow/issues/41969)
* Turn on the S3 and ZSTD features by default for macOS. [GH-42210](https://github.com/apache/arrow/issues/42210)
* Fix a bug in our implementation of `pull` on grouped datasets, it now
returns the expected column. [GH-43172](https://github.com/apache/arrow/issues/43172)
For full details of what’s in the 18.0.0 R package, see the [R changelog][4].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonkeane these are the 17.0.0 changes though right?

Nic and I are working on a NEWS.md patch in apache/arrow#44496 and we can copy that in here in a bit.


## Ruby and C GLib notes

kou marked this conversation as resolved.
Show resolved Hide resolved
### Ruby

kou marked this conversation as resolved.
Show resolved Hide resolved
* Add workaround for install failure due to `re2.pc` on Ubuntu 20.04: [GH-41396](https://github.com/apache/arrow/issues/41396)
* Add support for `0` decimal value: [GH-43877](https://github.com/apache/arrow/issues/43877)

C GLib related improvements are also available in Ruby.

### C GLib

kou marked this conversation as resolved.
Show resolved Hide resolved
* Add support for Azure file system: [GH-43738](https://github.com/apache/arrow/issues/43738)
* FlightRPC: Add support for DoPut: [GH-41056](https://github.com/apache/arrow/issues/41056)
* FlightRPC: Add support for timeout: [GH-44178](https://github.com/apache/arrow/issues/44178)
* Parquet: Add support for writing a record batch: [GH-40860](https://github.com/apache/arrow/issues/40860)
* Add support for pull style IPC stream format decoder: [GH-40493](https://github.com/apache/arrow/issues/40493)

## Rust notes and Go notes

The Rust and Go projects have moved to separate repositories outside the
main Arrow monorepo. For notes on the latest release of the Rust
implementation, see the latest [Arrow Rust changelog][5].
For notes on the latest release of the Go implementation, see the latest
[Arrow Go changelog][6]
raulcd marked this conversation as resolved.
Show resolved Hide resolved

[1]: https://github.com/apache/arrow/milestone/64?closed=1
[2]: {{ site.baseurl }}/release/18.0.0.html#contributors
[3]: {{ site.baseurl }}/release/18.0.0.html#changelog
[4]: {{ site.baseurl }}/docs/r/news/
[5]: https://github.com/apache/arrow-rs/tags
[6]: https://github.com/apache/arrow-go/tags