Problem
FileSignatureDetector.java contains a hardcoded Map<String, String> MIME_LABELS with 47 entries mapping Tika MIME type strings to short badge labels (e.g. "application/pdf" → "PDF", "video/x-matroska" → "MKV"). Any new file type detected by Tika but not in this map falls through to a subtype-parsing fallback.
The fallback logic is already quite good:
String subtype = mime.substring(mime.indexOf('/') + 1);
subtype = subtype.replaceAll("^(x-|vnd\\.)", "").toUpperCase();
label = subtype.length() <= 12 ? subtype : null;
However the explicit map overrides are still needed for cases where the subtype is not a usable label (e.g. "vnd.openxmlformats-officedocument.wordprocessingml.document" → "DOCX").
Proposed Solution
Replace the map with a two-step approach using Tika's own registry:
- Try
MimeType.getExtension() from Tika's MimeTypes registry — the primary extension is typically the canonical short label (.pdf → "PDF", .docx → "DOCX", .mkv → "MKV").
- Fall back to the subtype-stripping logic already in place for any type Tika does not have an extension for.
private static final MimeTypes TIKA_MIME_REPO = MimeTypes.getDefaultMimeTypes();
public static String labelFromMime(String mime) {
// Try Tika's registered extension first
try {
MimeType mt = TIKA_MIME_REPO.forName(mime);
String ext = mt.getExtension(); // e.g. ".pdf", ".docx"
if (ext != null && !ext.isEmpty()) {
return ext.substring(1).toUpperCase(); // strip leading dot
}
} catch (MimeTypeException ignored) {}
// Fallback: derive from subtype string
String subtype = mime.contains("/") ? mime.substring(mime.indexOf('/') + 1) : mime;
subtype = subtype.replaceAll("^(x-|vnd\\.)", "").toUpperCase();
return subtype.length() <= 12 ? subtype : null;
}
This approach works for any MIME type Tika recognises, not just the 47 currently listed.
Note on quality
Tika's getExtension() maps directly to the well-known file extension registered for each type, so application/vnd.openxmlformats-officedocument.wordprocessingml.document correctly returns .docx. For audio/video/image types the extension matches the common badge name exactly. This removes the need to maintain the explicit map.
Files to Change
backend/src/main/java/com/tracepcap/analysis/service/FileSignatureDetector.java
Acceptance Criteria
Problem
FileSignatureDetector.javacontains a hardcodedMap<String, String> MIME_LABELSwith 47 entries mapping Tika MIME type strings to short badge labels (e.g."application/pdf"→"PDF","video/x-matroska"→"MKV"). Any new file type detected by Tika but not in this map falls through to a subtype-parsing fallback.The fallback logic is already quite good:
However the explicit map overrides are still needed for cases where the subtype is not a usable label (e.g.
"vnd.openxmlformats-officedocument.wordprocessingml.document"→"DOCX").Proposed Solution
Replace the map with a two-step approach using Tika's own registry:
MimeType.getExtension()from Tika'sMimeTypesregistry — the primary extension is typically the canonical short label (.pdf→"PDF",.docx→"DOCX",.mkv→"MKV").This approach works for any MIME type Tika recognises, not just the 47 currently listed.
Note on quality
Tika's
getExtension()maps directly to the well-known file extension registered for each type, soapplication/vnd.openxmlformats-officedocument.wordprocessingml.documentcorrectly returns.docx. For audio/video/image types the extension matches the common badge name exactly. This removes the need to maintain the explicit map.Files to Change
backend/src/main/java/com/tracepcap/analysis/service/FileSignatureDetector.javaAcceptance Criteria
MIME_LABELSstatic map removedMimeType.getExtension()with subtype fallback