Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.concat silently produces invalid output when its output size is greater than 4GiB #55422

Open
rotemdan opened this issue Oct 17, 2024 · 14 comments
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. regression Issues related to regressions.

Comments

@rotemdan
Copy link

Version

v22.9.0, v23.0.0

Platform

Windows 11 x64

Microsoft Windows NT 10.0.22631.0 x64

Subsystem

Buffer

What steps will reproduce the bug?

const largeBuffer = Buffer.alloc(2 ** 32 + 5)
largeBuffer.fill(111)

const result = Buffer.concat([largeBuffer])
console.log(result)

How often does it reproduce? Is there a required condition?

Consistent in v22.9.0 and v23.0.0

What is the expected behavior? Why is that the expected behavior?

All bytes of the return buffer produced by Buffer.concat([largeBuffer]) should be identical to the source:

In this example:

111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, ....

What do you see instead?

In the returned buffer, first 5 bytes are 111, and all following ones are 0.

111, 111, 111, 111, 111, 0, 0, 0, 0, 0, 0, ....

The console.log(result) output looks like:

<Buffer 6f 6f 6f 6f 6f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 4294967251
 more bytes>

Additional information

No response

@targos targos added the buffer Issues and PRs related to the buffer subsystem. label Oct 17, 2024
@rotemdan rotemdan changed the title Buffer.concat silently produces invalid output when its output size is greater than 4GB Buffer.concat silently produces invalid output when its output size is greater than 4GiB Oct 17, 2024
@RedYetiDev RedYetiDev added the confirmed-bug Issues with confirmed bugs. label Oct 17, 2024
@rotemdan
Copy link
Author

My current workaround (tested to produce correct results with sizes greater than 4 GiB):

export function concatBuffers(buffers: Buffer[]) {
	let totalLength = 0

	for (const buffer of buffers) {
		totalLength += buffer.length
	}

	const resultBuffer = Buffer.alloc(totalLength)

	if (totalLength === 0) {
		return resultBuffer
	}

	let writeOffset = 0

	for (const buffer of buffers) {
		resultBuffer.set(buffer, writeOffset)

		writeOffset += buffer.length
	}

	return resultBuffer
}

@RedYetiDev
Copy link
Member

RedYetiDev commented Oct 17, 2024

The issue started in v22.7.0. I'll start bisecting. Maybe #54087?

@RedYetiDev RedYetiDev added the regression Issues related to regressions. label Oct 17, 2024
@RedYetiDev
Copy link
Member

RedYetiDev commented Oct 17, 2024

I've finished bisecting. This was indeed caused by #54087 cc @ronag.

9f8f26eb2ff36f9352dd85643073af876b9d6b46 is the first bad commit
commit 9f8f26eb2ff36f9352dd85643073af876b9d6b46 (HEAD)
Author: Robert Nagy <ronagy@icloud.com>
Date:   Fri Aug 2 11:19:41 2024 +0200

    buffer: use native copy impl
    
    PR-URL: https://github.com/nodejs/node/pull/54087
    Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
    Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
    Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
    Reviewed-By: Daniel Lemire <daniel@lemire.me>

 benchmark/buffers/buffer-copy.js |  6 ------
 lib/buffer.js                    | 11 ++++++-----
 src/node_buffer.cc               | 56 +++++++++++++++++++++++++++-----------------------------
 src/node_external_reference.h    |  9 +++++++++
 4 files changed, 42 insertions(+), 40 deletions(-)

@ronag
Copy link
Member

ronag commented Oct 21, 2024

Anyone care to open a PR? I think this could be a simple case of just switching to .set(srcBuffer) (instead of using native methods) in the case where the total length exceeds e.g. 2 GB.

@ronag ronag added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Oct 21, 2024
@duncpro
Copy link

duncpro commented Oct 21, 2024

I reproduced this on macOS.

@ronag I'd like to try and tackle this one.

@MrJithil
Copy link
Member

I reproduced this on macOS.

@ronag I'd like to try and tackle this one.

good luck.

@rotemdan
Copy link
Author

This call to _copy is possibly the reason:

function _copyActual(source, target, targetStart, sourceStart, sourceEnd) {
  if (sourceEnd - sourceStart > target.byteLength - targetStart)
    sourceEnd = sourceStart + target.byteLength - targetStart;

  let nb = sourceEnd - sourceStart;
  const sourceLen = source.byteLength - sourceStart;
  if (nb > sourceLen)
    nb = sourceLen;

  if (nb <= 0)
    return 0;

  _copy(source, target, targetStart, sourceStart, nb); // <--

  return nb;
}

_copy is imported from some sort of internal binding.

const {
  byteLengthUtf8,
  compare: _compare,
  compareOffset,
  copy: _copy, // <--
  fill: bindingFill,
  isAscii: bindingIsAscii,
  isUtf8: bindingIsUtf8,
  indexOfBuffer,
  indexOfNumber,
  indexOfString,
  swap16: _swap16,
  swap32: _swap32,
  swap64: _swap64,
  kMaxLength,
  kStringMaxLength,
  atob: _atob,
  btoa: _btoa,
} = internalBinding('buffer');

A thorough solution is to ensure this method correctly handles large array sizes, or fails.

Just working around it by falling back to TypedArray.set, would leave the possibility of a future issue if some other code calls _copy.

@duncpro
Copy link

duncpro commented Oct 21, 2024

So the root cause of this problem is 32-bit integer overflow in SlowCopy in node_buffer.cc here.

const auto target_start = args[2]->Uint32Value(env->context()).ToChecked();
const auto source_start = args[3]->Uint32Value(env->context()).ToChecked();
const auto to_copy = args[4]->Uint32Value(env->context()).ToChecked();

Apparently Uint32Value performs a wrapping conversion. So that's why in the example below the target buffer only gets filled with 5 bytes.

const largeBuffer = Buffer.alloc(2 ** 32 + 5)
largeBuffer.fill(111)

const result = Buffer.concat([largeBuffer])
console.log(result); // 6f 6f 6f 6f 6f 00 00 00 ...
                     // 1  2  3  4  5

Simply replacing Uint32Value with IntegerValue will fix this barring edge cases I've yet to fully consider.

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

I'm not sure what exactly the binding refers to, but I found a candidate method in the C++ code (at node/src/node_buffer.cc) that treats all arguments as Uint32:

// Assume caller has properly validated args.
void SlowCopy(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);

  ArrayBufferViewContents<char> source(args[0]);
  SPREAD_BUFFER_ARG(args[1].As<Object>(), target);

  const auto target_start = args[2]->Uint32Value(env->context()).ToChecked();
  const auto source_start = args[3]->Uint32Value(env->context()).ToChecked();
  const auto to_copy = args[4]->Uint32Value(env->context()).ToChecked();

  memmove(target_data + target_start, source.data() + source_start, to_copy);
  args.GetReturnValue().Set(to_copy);
}

Regardless on whether it's the method used in the binding, using Uint32Value to extract the arguments doesn't seem right.

This method follows, also taking in uint32_ts:

uint32_t FastCopy(Local<Value> receiver,
                  const v8::FastApiTypedArray<uint8_t>& source,
                  const v8::FastApiTypedArray<uint8_t>& target,
                  uint32_t target_start,
                  uint32_t source_start,
                  uint32_t to_copy) {
  uint8_t* source_data;
  CHECK(source.getStorageIfAligned(&source_data));

  uint8_t* target_data;
  CHECK(target.getStorageIfAligned(&target_data));

  memmove(target_data + target_start, source_data + source_start, to_copy);

  return to_copy;
}

@duncpro
Copy link

duncpro commented Oct 21, 2024

@rotemdan this is correct

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

If you simply search for the string "uint32" in node/src/node_buffer.cc, you'd realize that many other methods assume that indices are uint32 (4 GiB max). Examples I've found:

  • CopyArrayBuffer
  • Fill
  • StringWrite
  • FastByteLengthUtf8
  • SlowIndexOfNumber (makes assumption that needle is uint32 - not the index)
  • FastIndexOfNumber (makes assumption that needle is uint32 - not the index)
  • WriteOneByteString
  • FastWriteString
  • ...

@ronag
Copy link
Member

ronag commented Oct 21, 2024

I think the fast methods won't get called with anything that doesn't fit into uint32.

@ronag
Copy link
Member

ronag commented Oct 21, 2024

It's the slow methods that need fixing I guess. Should we even support 4G+ Buffers? @jasnell

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

It already supports large typed arrays (new Uint8Array(>= 4 GiB)) and buffers (Buffer.alloc(>= 4 GiB)) since version 22 (or earlier? not sure), which I think is great because it opened up many use cases that were limited before (in my case audio processing of multi-hour audio, and loading large machine-learning models, etc).

Fixing the methods in node/src/node_buffer.cc, by itself, isn't really that hard. It's more about ensuring that the code works correctly in various 32 bit and 64 bit platforms and processor architectures that are currently supported by Node.js.

As an intermediate solution, you could allow large ArrayBuffers but disallow large Buffer objects, but eventually you'd want to fix the Buffer objects to match the capabilities of ArrayBuffers (unless Buffer would be entirely deprecated at some point, or something like that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. regression Issues related to regressions.
Projects
None yet
Development

No branches or pull requests

6 participants