Skip to content

Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167

Open
sachinnn99 wants to merge 2 commits into
apache:mainfrom
sachinnn99:fix/16127-bytebufferinputstream-read-eof
Open

Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167
sachinnn99 wants to merge 2 commits into
apache:mainfrom
sachinnn99:fix/16127-bytebufferinputstream-read-eof

Conversation

@sachinnn99
Copy link
Copy Markdown
Contributor

Fixes #16127.

SingleBufferInputStream.read() and MultiBufferInputStream.read() throw EOFException when the stream is exhausted. This violates the java.io.InputStream contract, which requires the no-arg read() to return -1 at EOF.

The multi-byte read(byte[], int, int) in both classes already correctly returns -1 at EOF — the two overloads were inconsistent.

Changes:

  • SingleBufferInputStream.read(): return -1 instead of throwing EOFException
  • MultiBufferInputStream.read(): return -1 at both EOF entry points instead of throwing EOFException
  • Update testReadByte() to assert -1 return (including idempotency check)

Other EOFException-throwing methods (slice(), sliceBuffers(), skipFully()) are unchanged — they request specific byte counts where EOFException is the correct signal.

@github-actions github-actions Bot added the core label Apr 30, 2026
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for read(); looking at read(buffer[], offset, len), it seems ok too.

slice() probably needs attention at the same time

@sachinnn99
Copy link
Copy Markdown
Contributor Author

Thanks for the review! I looked at slice() — its EOFException is intentional since it's a "read exactly N bytes" operation (not part of the InputStream contract). Callers expect either the full slice or a failure, so throwing there is the correct behavior.

The fix here is scoped to the no-arg read(), which has a clear contract violation per InputStream.read() javadoc.

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src LGTM.

But i would suggest a few additional test cases, just to ensure future coverage:

  1. invariants after the first -1 is encountered:
  2. an explicit empty case:

we can simply add a utility function to call with any stream to verify:

private static void assertAtEOF(ByteBufferInputStream stream) {
  long pos = stream.getPos();
  assertThat(stream.read()).as("read() at EOF").isEqualTo(-1);
  assertThat(stream.read()).as("read() should keep returning -1 at EOF").isEqualTo(-1);
  assertThat(stream.getPos()).as("Position should not advance past EOF").isEqualTo(pos);
  assertThat(stream.available()).as("available() should be 0 at EOF").isEqualTo(0);
}

To make full coverage of all paths you can add an additional test that calls the assertAtEOF with empty streams that are empty from the beginning.

@sachinnn99 sachinnn99 force-pushed the fix/16127-bytebufferinputstream-read-eof branch from 249ad40 to 9ea1f31 Compare May 8, 2026 05:38
@sachinnn99
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestions @Baunsgaard! Added in the latest push:

  • assertAtEOF helper matching your snippet
  • Called it after EOF is first encountered in testReadAll, testSmallReads, testPartialBufferReads, and testReadByte
  • Added testEmptyStream covering single empty buffer, multiple empty buffers, and an empty buffer list

Copy link
Copy Markdown
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract fix is right: InputStream.read() is supposed to return -1 at EOF, and the old EOFException broke the standard while ((b = in.read()) != -1) pattern. The assertAtEOF helper, with idempotency and position-stability checks, is exactly the right test shape.

One concern before merge: this changes behavior at four existing single-byte read() call sites that don’t guard for -1. The multi-byte overload was already contract-correct, so those callers are fine, but these single-byte paths currently rely on the throw to fail loudly. After this PR, they can silently corrupt data or, in one case, hang.

Brief grep-search give me at least 4 places:

  1. ValuesAsBytesReader.getByte()(byte) read() at EOF becomes 0xFF, which can decode as 8 bogus true bits.
  2. BaseVectorizedParquetValuesReader.readUnsignedVarInt() — at EOF, -1 & 0x80 keeps the continuation bit set forever → infinite loop.
  3. readIntLittleEndian() — EOF returns a valid-looking but garbage int.
  4. readIntLittleEndianPaddedOnBitWidth() — same.

In practice, these shouldn’t hit EOF on well-formed Parquet data because decoders know byte counts from page headers. So this is more defense-in-depth than a happy-path regression. But the trade-off is real: a loud failure becomes silent or hanging if corrupted input or an upstream bug reads past EOF.

My preference would be to fix the callers: add a small if (b < 0) throw new EOFException() guard at those four sites. That keeps the contract fix clean, while preserving the old loud-fail behavior where the code expects a byte.

Keeping the PR as-is is also defensible, but could you call out the choice in the PR description so downstream reviewers understand the behavior change?

Smaller nits left inline.

public int read() throws IOException {
if (current == null) {
throw new EOFException();
return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract fix is right, but flagging this for visibility: I found four existing call sites that consume the result of read() without a -1 guard and currently depend on the throw to fail loudly. After this change they fail silently — or hang, in one case:

  • ValuesAsBytesReader.getByte() (parquet) — casts to byte, (byte) -1 = 0xFF → 8 spurious true bits in boolean decoding
  • BaseVectorizedParquetValuesReader.readUnsignedVarInt() (arrow) — -1 & 0x80 != 0 always true → infinite loop
  • BaseVectorizedParquetValuesReader.readIntLittleEndian() and readIntLittleEndianPaddedOnBitWidth() (arrow) — return garbage int instead of throwing

These callers shouldn't reach EOF on well-formed input (Parquet decoders know exact byte counts), so this is defense-in-depth, not a happy-path regression. But the failure mode changes from loud to silent/hanging.

Leaving the call to you — see options A/B/C in the top-level body. If you go with A (fix at callers), happy to look at the follow-up changes. Whatever you decide, could you note it in the PR description so it's visible to other reviewers?

public int read() throws IOException {
if (!buffer.hasRemaining()) {
throw new EOFException();
return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the import java.io.EOFException; at the top of this file is likely unused now (no throw new EOFException() remains). Same check for MultiBufferInputStream.java and for import java.io.EOFException in TestByteBufferInputStreams.java — the assertThatThrownBy(...EOFException.class) assertion is gone. Please drop the unused imports.


protected abstract void checkOriginalData();

private static void assertAtEOF(ByteBufferInputStream stream) throws IOException {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice helper — the idempotency check is exactly the regression catch we want. One small addition to consider: assert read(byte[]) returns -1 too, so the helper pins both overloads at a single site. The multi-byte path already returns -1 today, but bundling it means a future change to either overload trips the same assertion.

@Test
public void testEmptyStream() throws Exception {
assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0)));
assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0), ByteBuffer.allocate(0)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good coverage of the three construction shapes. Worth one more: a MultiBufferInputStream built from a non-empty buffer list, then fully drained — that exercises the nextBuffer()return -1 path at line 275, which is a structurally distinct branch from the current == null path at line 266. testReadByte() covers it for one shape, but a tiny dedicated drained-multi-buffer test pins the second branch explicitly.

@steveloughran
Copy link
Copy Markdown
Contributor

great reviewing @laskoviymishka

Add read(byte[]) assertion to assertAtEOF helper to pin both overloads,
and add testDrainedMultiBufferStream to explicitly exercise the
nextBuffer() -> return -1 code path in MultiBufferInputStream.
@sachinnn99
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @laskoviymishka!

On the 4 call sites: Those methods use org.apache.parquet.bytes.ByteBufferInputStream (Parquet's class), not org.apache.iceberg.io.ByteBufferInputStream (the class modified here). They're in completely separate type hierarchies -- Parquet's extends InputStream directly, while Iceberg's extends SeekableInputStream. You can verify by checking the imports at line 24 in both ValuesAsBytesReader.java and BaseVectorizedParquetValuesReader.java. I also confirmed that no production code outside core/io references Iceberg's ByteBufferInputStream, so there are no callers that relied on the EOFException behavior.

On unused imports: EOFException is still actively used in SingleBufferInputStream (seek(), slice(), sliceBuffers()), MultiBufferInputStream (same methods + a catch clause), and the test file (testWholeSliceBuffers, testSkipFully). The imports are still needed.

On enhancing assertAtEOF: Good suggestion -- added read(byte[]) assertion to the helper.

On the drained multi-buffer test: Added testDrainedMultiBufferStream which creates a multi-buffer stream from non-empty buffers, drains it via read(byte[]), then calls assertAtEOF. This explicitly exercises the nextBuffer() -> return -1 path at line 275.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ByteBufferInputStream.read() throws EOFException at EOF instead of returning -1, violating InputStream contract

4 participants