Core: Fix ByteBufferInputStream.read() to return -1 at EOF by sachinnn99 · Pull Request #16167 · apache/iceberg

sachinnn99 · 2026-04-30T04:56:39Z

SingleBufferInputStream.read() and MultiBufferInputStream.read() throw EOFException when the stream is exhausted. This violates the java.io.InputStream contract, which requires the no-arg read() to return -1 at EOF.

The multi-byte read(byte[], int, int) in both classes already correctly returns -1 at EOF — the two overloads were inconsistent.

Changes:

SingleBufferInputStream.read(): return -1 instead of throwing EOFException
MultiBufferInputStream.read(): return -1 at both EOF entry points instead of throwing EOFException
Update testReadByte() to assert -1 return (including idempotency check)

Other EOFException-throwing methods (slice(), sliceBuffers(), skipFully()) are unchanged — they request specific byte counts where EOFException is the correct signal.

steveloughran

looks good for read(); looking at read(buffer[], offset, len), it seems ok too.

slice() probably needs attention at the same time

sachinnn99 · 2026-04-30T09:23:51Z

Thanks for the review! I looked at slice() — its EOFException is intentional since it's a "read exactly N bytes" operation (not part of the InputStream contract). Callers expect either the full slice or a failure, so throwing there is the correct behavior.

The fix here is scoped to the no-arg read(), which has a clear contract violation per InputStream.read() javadoc.

Baunsgaard

src LGTM.

But i would suggest a few additional test cases, just to ensure future coverage:

invariants after the first -1 is encountered:
an explicit empty case:

we can simply add a utility function to call with any stream to verify:

private static void assertAtEOF(ByteBufferInputStream stream) {
  long pos = stream.getPos();
  assertThat(stream.read()).as("read() at EOF").isEqualTo(-1);
  assertThat(stream.read()).as("read() should keep returning -1 at EOF").isEqualTo(-1);
  assertThat(stream.getPos()).as("Position should not advance past EOF").isEqualTo(pos);
  assertThat(stream.available()).as("available() should be 0 at EOF").isEqualTo(0);
}

To make full coverage of all paths you can add an additional test that calls the assertAtEOF with empty streams that are empty from the beginning.

… throwing EOFException (apache#16127)

sachinnn99 · 2026-05-08T05:39:13Z

Thanks for the suggestions @Baunsgaard! Added in the latest push:

assertAtEOF helper matching your snippet
Called it after EOF is first encountered in testReadAll, testSmallReads, testPartialBufferReads, and testReadByte
Added testEmptyStream covering single empty buffer, multiple empty buffers, and an empty buffer list

Baunsgaard

LGTM

laskoviymishka

The contract fix is right: InputStream.read() is supposed to return -1 at EOF, and the old EOFException broke the standard while ((b = in.read()) != -1) pattern. The assertAtEOF helper, with idempotency and position-stability checks, is exactly the right test shape.

One concern before merge: this changes behavior at four existing single-byte read() call sites that don’t guard for -1. The multi-byte overload was already contract-correct, so those callers are fine, but these single-byte paths currently rely on the throw to fail loudly. After this PR, they can silently corrupt data or, in one case, hang.

Brief grep-search give me at least 4 places:

ValuesAsBytesReader.getByte() — (byte) read() at EOF becomes 0xFF, which can decode as 8 bogus true bits.
BaseVectorizedParquetValuesReader.readUnsignedVarInt() — at EOF, -1 & 0x80 keeps the continuation bit set forever → infinite loop.
readIntLittleEndian() — EOF returns a valid-looking but garbage int.
readIntLittleEndianPaddedOnBitWidth() — same.

In practice, these shouldn’t hit EOF on well-formed Parquet data because decoders know byte counts from page headers. So this is more defense-in-depth than a happy-path regression. But the trade-off is real: a loud failure becomes silent or hanging if corrupted input or an upstream bug reads past EOF.

My preference would be to fix the callers: add a small if (b < 0) throw new EOFException() guard at those four sites. That keeps the contract fix clean, while preserving the old loud-fail behavior where the code expects a byte.

Keeping the PR as-is is also defensible, but could you call out the choice in the PR description so downstream reviewers understand the behavior change?

Smaller nits left inline.

laskoviymishka · 2026-05-15T16:40:47Z

  public int read() throws IOException {
    if (current == null) {
-      throw new EOFException();
+      return -1;


The contract fix is right, but flagging this for visibility: I found four existing call sites that consume the result of read() without a -1 guard and currently depend on the throw to fail loudly. After this change they fail silently — or hang, in one case:

ValuesAsBytesReader.getByte() (parquet) — casts to byte, (byte) -1 = 0xFF → 8 spurious true bits in boolean decoding

BaseVectorizedParquetValuesReader.readUnsignedVarInt() (arrow) — -1 & 0x80 != 0 always true → infinite loop

BaseVectorizedParquetValuesReader.readIntLittleEndian() and readIntLittleEndianPaddedOnBitWidth() (arrow) — return garbage int instead of throwing

These callers shouldn't reach EOF on well-formed input (Parquet decoders know exact byte counts), so this is defense-in-depth, not a happy-path regression. But the failure mode changes from loud to silent/hanging.

Leaving the call to you — see options A/B/C in the top-level body. If you go with A (fix at callers), happy to look at the follow-up changes. Whatever you decide, could you note it in the PR description so it's visible to other reviewers?

laskoviymishka · 2026-05-15T16:40:47Z

  public int read() throws IOException {
    if (!buffer.hasRemaining()) {
-      throw new EOFException();
+      return -1;


Minor: the import java.io.EOFException; at the top of this file is likely unused now (no throw new EOFException() remains). Same check for MultiBufferInputStream.java and for import java.io.EOFException in TestByteBufferInputStreams.java — the assertThatThrownBy(...EOFException.class) assertion is gone. Please drop the unused imports.

laskoviymishka · 2026-05-15T16:40:48Z


  protected abstract void checkOriginalData();

+  private static void assertAtEOF(ByteBufferInputStream stream) throws IOException {


Nice helper — the idempotency check is exactly the regression catch we want. One small addition to consider: assert read(byte[]) returns -1 too, so the helper pins both overloads at a single site. The multi-byte path already returns -1 today, but bundling it means a future change to either overload trips the same assertion.

laskoviymishka · 2026-05-15T16:40:48Z

+  @Test
+  public void testEmptyStream() throws Exception {
+    assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0)));
+    assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0), ByteBuffer.allocate(0)));


Good coverage of the three construction shapes. Worth one more: a MultiBufferInputStream built from a non-empty buffer list, then fully drained — that exercises the nextBuffer() → return -1 path at line 275, which is a structurally distinct branch from the current == null path at line 266. testReadByte() covers it for one shape, but a tiny dedicated drained-multi-buffer test pins the second branch explicitly.

steveloughran · 2026-05-15T19:37:23Z

great reviewing @laskoviymishka

Add read(byte[]) assertion to assertAtEOF helper to pin both overloads, and add testDrainedMultiBufferStream to explicitly exercise the nextBuffer() -> return -1 code path in MultiBufferInputStream.

sachinnn99 · 2026-05-16T09:43:49Z

Thanks for the thorough review @laskoviymishka!

On the 4 call sites: Those methods use org.apache.parquet.bytes.ByteBufferInputStream (Parquet's class), not org.apache.iceberg.io.ByteBufferInputStream (the class modified here). They're in completely separate type hierarchies -- Parquet's extends InputStream directly, while Iceberg's extends SeekableInputStream. You can verify by checking the imports at line 24 in both ValuesAsBytesReader.java and BaseVectorizedParquetValuesReader.java. I also confirmed that no production code outside core/io references Iceberg's ByteBufferInputStream, so there are no callers that relied on the EOFException behavior.

On unused imports: EOFException is still actively used in SingleBufferInputStream (seek(), slice(), sliceBuffers()), MultiBufferInputStream (same methods + a catch clause), and the test file (testWholeSliceBuffers, testSkipFully). The imports are still needed.

On enhancing assertAtEOF: Good suggestion -- added read(byte[]) assertion to the helper.

On the drained multi-buffer test: Added testDrainedMultiBufferStream which creates a multi-buffer stream from non-empty buffers, drains it via read(byte[]), then calls assertAtEOF. This explicitly exercises the nextBuffer() -> return -1 path at line 275.

github-actions Bot added the core label Apr 30, 2026

steveloughran reviewed Apr 30, 2026

View reviewed changes

Baunsgaard reviewed May 7, 2026

View reviewed changes

Core: Fix ByteBufferInputStream.read() to return -1 at EOF instead of…

9ea1f31

… throwing EOFException (apache#16127)

sachinnn99 force-pushed the fix/16127-bytebufferinputstream-read-eof branch from 249ad40 to 9ea1f31 Compare May 8, 2026 05:38

Baunsgaard approved these changes May 8, 2026

View reviewed changes

laskoviymishka requested changes May 15, 2026

View reviewed changes

Core: Enhance EOF test coverage for ByteBufferInputStream

4f64035

Add read(byte[]) assertion to assertAtEOF helper to pin both overloads, and add testDrainedMultiBufferStream to explicitly exercise the nextBuffer() -> return -1 code path in MultiBufferInputStream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167

Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167
sachinnn99 wants to merge 2 commits into
apache:mainfrom
sachinnn99:fix/16127-bytebufferinputstream-read-eof

sachinnn99 commented Apr 30, 2026

Uh oh!

steveloughran left a comment

Uh oh!

sachinnn99 commented Apr 30, 2026

Uh oh!

Baunsgaard left a comment

Uh oh!

sachinnn99 commented May 8, 2026

Uh oh!

Baunsgaard left a comment

Uh oh!

laskoviymishka left a comment

Uh oh!

laskoviymishka May 15, 2026

Uh oh!

laskoviymishka May 15, 2026

Uh oh!

laskoviymishka May 15, 2026

Uh oh!

laskoviymishka May 15, 2026

Uh oh!

steveloughran commented May 15, 2026

Uh oh!

sachinnn99 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		protected abstract void checkOriginalData();

		private static void assertAtEOF(ByteBufferInputStream stream) throws IOException {

Conversation

sachinnn99 commented Apr 30, 2026

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

sachinnn99 commented Apr 30, 2026

Uh oh!

Baunsgaard left a comment

Choose a reason for hiding this comment

Uh oh!

sachinnn99 commented May 8, 2026

Uh oh!

Baunsgaard left a comment

Choose a reason for hiding this comment

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

laskoviymishka May 15, 2026

Choose a reason for hiding this comment

Uh oh!

laskoviymishka May 15, 2026

Choose a reason for hiding this comment

Uh oh!

laskoviymishka May 15, 2026

Choose a reason for hiding this comment

Uh oh!

laskoviymishka May 15, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran commented May 15, 2026

Uh oh!

sachinnn99 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants