Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167
Core: Fix ByteBufferInputStream.read() to return -1 at EOF#16167sachinnn99 wants to merge 2 commits into
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
looks good for read(); looking at read(buffer[], offset, len), it seems ok too.
slice() probably needs attention at the same time
|
Thanks for the review! I looked at The fix here is scoped to the no-arg |
Baunsgaard
left a comment
There was a problem hiding this comment.
src LGTM.
But i would suggest a few additional test cases, just to ensure future coverage:
- invariants after the first -1 is encountered:
- an explicit empty case:
we can simply add a utility function to call with any stream to verify:
private static void assertAtEOF(ByteBufferInputStream stream) {
long pos = stream.getPos();
assertThat(stream.read()).as("read() at EOF").isEqualTo(-1);
assertThat(stream.read()).as("read() should keep returning -1 at EOF").isEqualTo(-1);
assertThat(stream.getPos()).as("Position should not advance past EOF").isEqualTo(pos);
assertThat(stream.available()).as("available() should be 0 at EOF").isEqualTo(0);
}
To make full coverage of all paths you can add an additional test that calls the assertAtEOF with empty streams that are empty from the beginning.
… throwing EOFException (apache#16127)
249ad40 to
9ea1f31
Compare
|
Thanks for the suggestions @Baunsgaard! Added in the latest push:
|
laskoviymishka
left a comment
There was a problem hiding this comment.
The contract fix is right: InputStream.read() is supposed to return -1 at EOF, and the old EOFException broke the standard while ((b = in.read()) != -1) pattern. The assertAtEOF helper, with idempotency and position-stability checks, is exactly the right test shape.
One concern before merge: this changes behavior at four existing single-byte read() call sites that don’t guard for -1. The multi-byte overload was already contract-correct, so those callers are fine, but these single-byte paths currently rely on the throw to fail loudly. After this PR, they can silently corrupt data or, in one case, hang.
Brief grep-search give me at least 4 places:
ValuesAsBytesReader.getByte()—(byte) read()at EOF becomes0xFF, which can decode as 8 bogustruebits.BaseVectorizedParquetValuesReader.readUnsignedVarInt()— at EOF,-1 & 0x80keeps the continuation bit set forever → infinite loop.readIntLittleEndian()— EOF returns a valid-looking but garbage int.readIntLittleEndianPaddedOnBitWidth()— same.
In practice, these shouldn’t hit EOF on well-formed Parquet data because decoders know byte counts from page headers. So this is more defense-in-depth than a happy-path regression. But the trade-off is real: a loud failure becomes silent or hanging if corrupted input or an upstream bug reads past EOF.
My preference would be to fix the callers: add a small if (b < 0) throw new EOFException() guard at those four sites. That keeps the contract fix clean, while preserving the old loud-fail behavior where the code expects a byte.
Keeping the PR as-is is also defensible, but could you call out the choice in the PR description so downstream reviewers understand the behavior change?
Smaller nits left inline.
| public int read() throws IOException { | ||
| if (current == null) { | ||
| throw new EOFException(); | ||
| return -1; |
There was a problem hiding this comment.
The contract fix is right, but flagging this for visibility: I found four existing call sites that consume the result of read() without a -1 guard and currently depend on the throw to fail loudly. After this change they fail silently — or hang, in one case:
ValuesAsBytesReader.getByte()(parquet) — casts to byte,(byte) -1 = 0xFF→ 8 spurious true bits in boolean decodingBaseVectorizedParquetValuesReader.readUnsignedVarInt()(arrow) —-1 & 0x80 != 0always true → infinite loopBaseVectorizedParquetValuesReader.readIntLittleEndian()andreadIntLittleEndianPaddedOnBitWidth()(arrow) — return garbage int instead of throwing
These callers shouldn't reach EOF on well-formed input (Parquet decoders know exact byte counts), so this is defense-in-depth, not a happy-path regression. But the failure mode changes from loud to silent/hanging.
Leaving the call to you — see options A/B/C in the top-level body. If you go with A (fix at callers), happy to look at the follow-up changes. Whatever you decide, could you note it in the PR description so it's visible to other reviewers?
| public int read() throws IOException { | ||
| if (!buffer.hasRemaining()) { | ||
| throw new EOFException(); | ||
| return -1; |
There was a problem hiding this comment.
Minor: the import java.io.EOFException; at the top of this file is likely unused now (no throw new EOFException() remains). Same check for MultiBufferInputStream.java and for import java.io.EOFException in TestByteBufferInputStreams.java — the assertThatThrownBy(...EOFException.class) assertion is gone. Please drop the unused imports.
|
|
||
| protected abstract void checkOriginalData(); | ||
|
|
||
| private static void assertAtEOF(ByteBufferInputStream stream) throws IOException { |
There was a problem hiding this comment.
Nice helper — the idempotency check is exactly the regression catch we want. One small addition to consider: assert read(byte[]) returns -1 too, so the helper pins both overloads at a single site. The multi-byte path already returns -1 today, but bundling it means a future change to either overload trips the same assertion.
| @Test | ||
| public void testEmptyStream() throws Exception { | ||
| assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0))); | ||
| assertAtEOF(ByteBufferInputStream.wrap(ByteBuffer.allocate(0), ByteBuffer.allocate(0))); |
There was a problem hiding this comment.
Good coverage of the three construction shapes. Worth one more: a MultiBufferInputStream built from a non-empty buffer list, then fully drained — that exercises the nextBuffer() → return -1 path at line 275, which is a structurally distinct branch from the current == null path at line 266. testReadByte() covers it for one shape, but a tiny dedicated drained-multi-buffer test pins the second branch explicitly.
|
great reviewing @laskoviymishka |
Add read(byte[]) assertion to assertAtEOF helper to pin both overloads, and add testDrainedMultiBufferStream to explicitly exercise the nextBuffer() -> return -1 code path in MultiBufferInputStream.
|
Thanks for the thorough review @laskoviymishka! On the 4 call sites: Those methods use On unused imports: On enhancing On the drained multi-buffer test: Added |
Fixes #16127.
SingleBufferInputStream.read()andMultiBufferInputStream.read()throwEOFExceptionwhen the stream is exhausted. This violates thejava.io.InputStreamcontract, which requires the no-argread()to return-1at EOF.The multi-byte
read(byte[], int, int)in both classes already correctly returns-1at EOF — the two overloads were inconsistent.Changes:
SingleBufferInputStream.read(): return-1instead of throwingEOFExceptionMultiBufferInputStream.read(): return-1at both EOF entry points instead of throwingEOFExceptiontestReadByte()to assert-1return (including idempotency check)Other
EOFException-throwing methods (slice(),sliceBuffers(),skipFully()) are unchanged — they request specific byte counts whereEOFExceptionis the correct signal.