feat(tika): support RAR5 archives via 7-Zip-JBinding#1176
feat(tika): support RAR5 archives via 7-Zip-JBinding#1176ksaurabhAparavi wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR replaces Tika's JUnRAR-based RAR parser with a new 7-Zip-JBinding implementation that supports both RAR4 and RAR5 archives. Changes add a RarSevenZipParser class, Maven dependencies, JVM-level library initialization, and configuration wiring to swap the parser in Tika's configuration. ChangesRAR5 Archive Support via 7-Zip-JBinding
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java`:
- Around line 435-439: ConfigBuilder.getConfig() currently unconditionally
removes org.apache.tika.parser.pkg.RarParser and adds
com.rocketride.tika_api.parsers.rar.RarSevenZipParser even when 7-Zip failed to
initialize; add a readiness getter on TikaApi (e.g., TikaApi.isSevenZipReady()
set from SevenZip.isInitializedSuccessfully() in TikaApi.init() after calling
SevenZip.initSevenZipFromPlatformJAR()) and then change
ConfigBuilder.getConfig() to only call removeParser(doc,
"org.apache.tika.parser.pkg.RarParser") and findOrAddParser(doc,
"com.rocketride.tika_api.parsers.rar.RarSevenZipParser") when that readiness
getter returns true so the original RarParser remains as a fallback when 7-Zip
isn’t initialized.
In
`@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/parsers/rar/RarSevenZipParser.java`:
- Around line 156-172: The RarSevenZipParser currently writes extracted data to
a temp file without any size limit; modify the extraction block around
tmp.createTemporaryFile()/new FileOutputStream(entryFile) and the
item.extractSlow(...) ISequentialOutStream.write implementation to enforce a
MAX_ENTRY_UNPACKED_BYTES guard (define a sensible constant), track cumulative
bytesWritten as chunks are written, and if the limit would be exceeded: stop
extraction by throwing a SevenZipException (or otherwise aborting), close and
delete entryFile, log a warning including name and bytesWritten, and return
early so the oversized entry is not persisted; ensure proper resource cleanup in
the try-with-resources and that the existing ExtractOperationResult check still
runs for normal failures.
In `@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java`:
- Around line 321-327: The TikaApi.init() block currently catches Throwable when
initializing SevenZip which hides JVM-fatal Errors; change the catch to only
handle non-fatal exceptions by catching Exception (or specific exceptions) from
SevenZip.initSevenZipFromPlatformJAR(), and do not swallow Errors—either let
Errors propagate or rethrow them (e.g., catch Exception e and call
logger.log(..., e)); update the catch around
SevenZip.initSevenZipFromPlatformJAR() / SevenZip.isInitializedSuccessfully() to
reference SevenZip and logger so only recoverable failures are logged while JVM
Errors are not suppressed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 535ef594-b1c1-4228-999e-6650b9f223e1
📒 Files selected for processing (4)
packages/tika/lib/tika/pom-template.xmlpackages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.javapackages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.javapackages/tika/lib/tika/src/main/java/com/rocketride/tika_api/parsers/rar/RarSevenZipParser.java
| // Replace Tika's junrar-backed RarParser (RAR4-only; throws on RAR5) with our | ||
| // 7-Zip-JBinding-backed parser, which handles both RAR4 and RAR5. | ||
| removeParser(doc, "org.apache.tika.parser.pkg.RarParser"); | ||
| findOrAddParser(doc, "com.rocketride.tika_api.parsers.rar.RarSevenZipParser"); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class ConfigBuilder|removeParser\\(doc, \"org\\.apache\\.tika\\.parser\\.pkg\\.RarParser\"|RarSevenZipParser|isSevenZipReady|SevenZip|7-zip" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
rg -n "TikaApi\\.isSevenZipReady|isSevenZipReady|SevenZip|native|init failure|RarSevenZipParser|RarParser" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java || true
ls -R packages/tika/lib/tika/src/main/java/com/rocketride/tika_api | headRepository: rocketride-org/rocketride-server
Length of output: 6664
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '280,360p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
sed -n '400,470p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
fd -a "TikaApi.java" packages/tika/lib/tika/src/main/java -t f || true
rg -n "TikaApi" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api || trueRepository: rocketride-org/rocketride-server
Length of output: 9467
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find init / sevenzip init code region
sed -n '250,380p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
# 2) Find ConfigBuilder.getConfig() call sites
rg -n "ConfigBuilder\\.getConfig\\(|new ConfigBuilder\\(|getConfig\\(\\)" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java packages/tika/lib/tika/src/main/java/com/rocketride/tika_api -S || true
# 3) Inspect relevant part of TikaApi where it wires Tika configuration / extraction startup
rg -n "getConfig\\(|TikaConfig\\b|tika-config\\.xml|ConfigBuilder" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
sed -n '580,760p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
# 4) Inspect RarSevenZipParser to see how it depends on 7-zip binding init
sed -n '1,200p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/parsers/rar/RarSevenZipParser.javaRepository: rocketride-org/rocketride-server
Length of output: 19086
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "isSevenZipReady|sevenZipReady|SevenZip.*ready|initSevenZip|isInitializedSuccessfully\\(|initializedSuccessfully" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
rg -n "RarSevenZipParser|RarParser" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api -S
rg -n "catch \\(.*TikaException|catch \\(.*Exception\\)" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
rg -n "parse\\(|tikaConfig|tika.*config|new TikaConfig|tika.*parse" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.javaRepository: rocketride-org/rocketride-server
Length of output: 1321
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '80,220p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/EmbeddedContentExtractor.java
rg -n "ConfigBuilder\\.getConfig\\(" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api -S
sed -n '300,380p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/EmbeddedContentExtractor.java 2>/dev/null || trueRepository: rocketride-org/rocketride-server
Length of output: 5973
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "SevenZip\\b" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
sed -n '1,80p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.javaRepository: rocketride-org/rocketride-server
Length of output: 58
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# locate helper methods + tika-config.xml usage
rg -n "removeParser\\(|findOrAddParser\\(|tika-config\\.xml" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
# inspect helper implementations
sed -n '360,520p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
# find shipped config file in repo (if present) or template
fd "tika-config.xml" . -t f || true
rg -n "RarParser|RarSevenZipParser|rar" -S . --glob "tika-config.xml" || trueRepository: rocketride-org/rocketride-server
Length of output: 4601
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect helper methods in ConfigBuilder
sed -n '120,280p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
# Inspect shipped tika-config.xml around RarParser / rar parser entries
rg -n "RarParser|RarSevenZipParser|rar" packages/tika/lib/tika/tika-config.xml
# show a narrower window around the matches
rg -n -o --context 5 "RarParser|RarSevenZipParser" packages/tika/lib/tika/tika-config.xml | head -n 200Repository: rocketride-org/rocketride-server
Length of output: 5076
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find defaultParserName and surrounding constants in ConfigBuilder
rg -n "defaultParserName|defaultParser\\b" packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
sed -n '1,120p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java
# Show tika-config.xml sections near RarParser / RarSevenZipParser / default parser excludes
rg -n "RarParser|RarSevenZipParser|parser-exclude|AutoDetectParser|defaultParser" packages/tika/lib/tika/tika-config.xml
# Print a limited window around any RarParser mentions
rg -n "RarParser|RarSevenZipParser" packages/tika/lib/tika/tika-config.xml | head -n 50
# If large, show top 250 lines to locate defaults, then windows around matches
head -n 260 packages/tika/lib/tika/tika-config.xml
# Narrow windows around the matches (use line numbers from rg)
matches=$(rg -n "RarParser|RarSevenZipParser" packages/tika/lib/tika/tika-config.xml | cut -d: -f1 | head -n 20)
if [ -n "${matches:-}" ]; then
for ln in $matches; do
start=$((ln-40)); end=$((ln+40));
echo "---- tika-config.xml:$ln ----";
sed -n "${start},${end}p" packages/tika/lib/tika/tika-config.xml;
done
fiRepository: rocketride-org/rocketride-server
Length of output: 4253
🏁 Script executed:
#!/bin/bash
set -euo pipefail
XML="packages/tika/lib/tika/tika-config.xml"
# Search for any RAR-related parser names/classes or type hints
rg -n -i "rar" "$XML" || true
# Also search for "parser-exclude" blocks near the top to see what default parser excludes exist
# Print the first ~220 lines to include the <parsers> section and any excludes
head -n 260 "$XML"
# If there are rar-related matches, print a bit more around them
matches=$(rg -n -i "rar" "$XML" | cut -d: -f1 | head -n 20 || true)
if [ -n "${matches:-}" ]; then
for ln in $matches; do
start=$((ln-60)); end=$((ln+60));
echo "---- $XML:$ln ----"
sed -n "${start},${end}p" "$XML"
done
fiRepository: rocketride-org/rocketride-server
Length of output: 2216
Gate RAR parser replacement on successful 7-Zip-JBinding initialization
TikaApi.init() logs and continues when SevenZip.initSevenZipFromPlatformJAR() fails, but ConfigBuilder.getConfig() always removes org.apache.tika.parser.pkg.RarParser and installs com.rocketride.tika_api.parsers.rar.RarSevenZipParser (RAR parsing will then have no working fallback if the 7-Zip native init didn’t succeed). ConfigBuilder.java:437-438, TikaApi.init().
Suggested direction
- removeParser(doc, "org.apache.tika.parser.pkg.RarParser");
- findOrAddParser(doc, "com.rocketride.tika_api.parsers.rar.RarSevenZipParser");
+ if (TikaApi.isSevenZipReady()) {
+ removeParser(doc, "org.apache.tika.parser.pkg.RarParser");
+ findOrAddParser(doc, "com.rocketride.tika_api.parsers.rar.RarSevenZipParser");
+ } else {
+ // keep upstream parser as fallback when native 7-Zip init is unavailable
+ removeParser(doc, "com.rocketride.tika_api.parsers.rar.RarSevenZipParser");
+ findOrAddParser(doc, "org.apache.tika.parser.pkg.RarParser");
+ }(Requires adding a small readiness flag/getter in TikaApi based on SevenZip.isInitializedSuccessfully() after init attempt.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.java`
around lines 435 - 439, ConfigBuilder.getConfig() currently unconditionally
removes org.apache.tika.parser.pkg.RarParser and adds
com.rocketride.tika_api.parsers.rar.RarSevenZipParser even when 7-Zip failed to
initialize; add a readiness getter on TikaApi (e.g., TikaApi.isSevenZipReady()
set from SevenZip.isInitializedSuccessfully() in TikaApi.init() after calling
SevenZip.initSevenZipFromPlatformJAR()) and then change
ConfigBuilder.getConfig() to only call removeParser(doc,
"org.apache.tika.parser.pkg.RarParser") and findOrAddParser(doc,
"com.rocketride.tika_api.parsers.rar.RarSevenZipParser") when that readiness
getter returns true so the original RarParser remains as a fallback when 7-Zip
isn’t initialized.
| File entryFile = tmp.createTemporaryFile(); | ||
| try (OutputStream fos = new FileOutputStream(entryFile)) { | ||
| ExtractOperationResult result = item.extractSlow(new ISequentialOutStream() { | ||
| @Override | ||
| public int write(byte[] data) throws SevenZipException { | ||
| try { | ||
| fos.write(data); | ||
| } catch (IOException e) { | ||
| throw new SevenZipException(e); | ||
| } | ||
| return data.length; | ||
| } | ||
| }); | ||
| if (result != ExtractOperationResult.OK) { | ||
| logger.log(Level.WARNING, "RAR entry extraction returned " + result + " for " + name); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Add a decompressed-size guard before writing archive entries to disk.
Line 156 currently writes each entry fully to temp storage without a maximum bound. A crafted RAR can exhaust disk space and stall parsing workers.
Suggested fix
public class RarSevenZipParser implements Parser {
+ private static final long MAX_ENTRY_BYTES = 512L * 1024 * 1024; // make configurable if possible
@@
- File entryFile = tmp.createTemporaryFile();
+ if (size != null && size > MAX_ENTRY_BYTES) {
+ logger.log(Level.WARNING, "Skipping oversized RAR entry: " + name + " (" + size + " bytes)");
+ return;
+ }
+ File entryFile = tmp.createTemporaryFile();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/parsers/rar/RarSevenZipParser.java`
around lines 156 - 172, The RarSevenZipParser currently writes extracted data to
a temp file without any size limit; modify the extraction block around
tmp.createTemporaryFile()/new FileOutputStream(entryFile) and the
item.extractSlow(...) ISequentialOutStream.write implementation to enforce a
MAX_ENTRY_UNPACKED_BYTES guard (define a sensible constant), track cumulative
bytesWritten as chunks are written, and if the limit would be exceeded: stop
extraction by throwing a SevenZipException (or otherwise aborting), close and
delete entryFile, log a warning including name and bytesWritten, and return
early so the oversized entry is not persisted; ensure proper resource cleanup in
the try-with-resources and that the existing ExtractOperationResult check still
runs for normal failures.
| try { | ||
| if (!SevenZip.isInitializedSuccessfully()) { | ||
| SevenZip.initSevenZipFromPlatformJAR(); | ||
| logger.log(Level.INFO, "7-Zip-JBinding initialized: " + SevenZip.getSevenZipVersion().version); | ||
| } | ||
| } catch (Throwable t) { | ||
| logger.log(Level.WARNING, "Failed to initialize 7-Zip-JBinding; RAR parsing will fail", t); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify broad Throwable catches on startup paths and related parser wiring.
rg -n 'catch\s*\(\s*Throwable\s+' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
rg -n 'RarParser|RarSevenZipParser|initSevenZipFromPlatformJAR|isInitializedSuccessfully' \
packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java \
packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/ConfigBuilder.javaRepository: rocketride-org/rocketride-server
Length of output: 987
🏁 Script executed:
#!/bin/bash
# Inspect surrounding code for the Throwable catch to assess whether it's truly initialization-only and what exceptions SevenZip calls may throw.
sed -n '280,360p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
# Check if there are any other Throwable catches in this file/package.
rg -n 'catch\s*\(\s*Throwable\s+' packages/tika/lib/tika/src/main/java/com/rocketride/tika_apiRepository: rocketride-org/rocketride-server
Length of output: 2786
🏁 Script executed:
#!/bin/bash
# Inspect surrounding code for the Throwable catch to assess whether it's truly initialization-only and what exceptions SevenZip calls may throw.
sed -n '280,360p' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java
# Check if there are any other Throwable catches in this file/package.
rg -n 'catch\s*\(\s*Throwable\s+' packages/tika/lib/tika/src/main/java/com/rocketride/tika_apiRepository: rocketride-org/rocketride-server
Length of output: 2786
🏁 Script executed:
#!/bin/bash
# Find any other Throwable catches in the same area.
rg -n 'catch\s*\(\s*Throwable\s+\w+\s*\)' packages/tika/lib/tika/src/main/java/com/rocketride/tika_api
# Locate the SevenZip implementation and inspect init method.
fd -a "SevenZip.java" packages/tika/lib/tika/src/main/java | head
rg -n "class SevenZip|initSevenZipFromPlatformJAR|isInitializedSuccessfully" packages/tika/lib/tika/src/main/java -SRepository: rocketride-org/rocketride-server
Length of output: 428
🏁 Script executed:
#!/bin/bash
# Locate SevenZip implementation and inspect init method.
rg -n "initSevenZipFromPlatformJAR" packages/tika/lib/tika/src/main/java
rg -n "class SevenZip|isInitializedSuccessfully" packages/tika/lib/tika/src/main/javaRepository: rocketride-org/rocketride-server
Length of output: 324
🏁 Script executed:
#!/bin/bash
# Find where SevenZip is defined in the repository.
fd -a -t f "SevenZip.java" . || true
rg -n "class\s+SevenZip\b" . -S || true
rg -n "isInitializedSuccessfully" . -S || true
rg -n "initSevenZipFromPlatformJAR" . -S || trueRepository: rocketride-org/rocketride-server
Length of output: 328
Avoid catching Throwable during 7-Zip-JBinding initialization in TikaApi.init() (TikaApi.java:326)
Catching Throwable suppresses JVM-fatal Errors (e.g., OutOfMemoryError) and still allows initialization to proceed, leaving the subsystem potentially in an unsafe state.
Suggested fix
- } catch (Throwable t) {
+ } catch (Exception | UnsatisfiedLinkError t) {
logger.log(Level.WARNING, "Failed to initialize 7-Zip-JBinding; RAR parsing will fail", t);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/tika/lib/tika/src/main/java/com/rocketride/tika_api/TikaApi.java`
around lines 321 - 327, The TikaApi.init() block currently catches Throwable
when initializing SevenZip which hides JVM-fatal Errors; change the catch to
only handle non-fatal exceptions by catching Exception (or specific exceptions)
from SevenZip.initSevenZipFromPlatformJAR(), and do not swallow Errors—either
let Errors propagate or rethrow them (e.g., catch Exception e and call
logger.log(..., e)); update the catch around
SevenZip.initSevenZipFromPlatformJAR() / SevenZip.isInitializedSuccessfully() to
reference SevenZip and logger so only recoverable failures are logged while JVM
Errors are not suppressed.
🤖 Internal: Discord sync markerAuto-managed by the Discord notification workflow. Stores the linked Discord message ID. Do not edit or delete. |
Tika's junrar-backed RarParser throws on RAR v5 (the default WinRAR format since 2013). Add net.sf.sevenzipjbinding plus a RarSevenZipParser that auto-detects RAR4/RAR5 and routes entries through the configured EmbeddedDocumentExtractor; ConfigBuilder swaps the parser and TikaApi initializes the native binding once per JVM. Fixes rocketride-org#1163
c3665c8 to
4caed72
Compare
Summary
RarParserthrows on RAR v5 (the default WinRAR format since 2013), so modern RAR files fail to parse.net.sf.sevenzipjbindingplus aRarSevenZipParserthat auto-detects RAR4/RAR5 and routes entries through the configuredEmbeddedDocumentExtractor;ConfigBuilderswaps the parser andTikaApiinitializes the native binding once per JVM.Testing
./builder test) — relying on GitHub Actions; not runnable in the contributor's local shell (engine build / Maven / torch unavailable). Static checks (compile, no conflict markers) pass.Linked Issue
Fixes #1163