Skip to content

Replace MIME_LABELS map with Tika-derived display strings in FileSignatureDetector #153

@NotYuSheng

Description

@NotYuSheng

Problem

FileSignatureDetector.java contains a hardcoded Map<String, String> MIME_LABELS with 47 entries mapping Tika MIME type strings to short badge labels (e.g. "application/pdf""PDF", "video/x-matroska""MKV"). Any new file type detected by Tika but not in this map falls through to a subtype-parsing fallback.

The fallback logic is already quite good:

String subtype = mime.substring(mime.indexOf('/') + 1);
subtype = subtype.replaceAll("^(x-|vnd\\.)", "").toUpperCase();
label = subtype.length() <= 12 ? subtype : null;

However the explicit map overrides are still needed for cases where the subtype is not a usable label (e.g. "vnd.openxmlformats-officedocument.wordprocessingml.document""DOCX").

Proposed Solution

Replace the map with a two-step approach using Tika's own registry:

  1. Try MimeType.getExtension() from Tika's MimeTypes registry — the primary extension is typically the canonical short label (.pdf"PDF", .docx"DOCX", .mkv"MKV").
  2. Fall back to the subtype-stripping logic already in place for any type Tika does not have an extension for.
private static final MimeTypes TIKA_MIME_REPO = MimeTypes.getDefaultMimeTypes();

public static String labelFromMime(String mime) {
    // Try Tika's registered extension first
    try {
        MimeType mt = TIKA_MIME_REPO.forName(mime);
        String ext = mt.getExtension(); // e.g. ".pdf", ".docx"
        if (ext != null && !ext.isEmpty()) {
            return ext.substring(1).toUpperCase(); // strip leading dot
        }
    } catch (MimeTypeException ignored) {}

    // Fallback: derive from subtype string
    String subtype = mime.contains("/") ? mime.substring(mime.indexOf('/') + 1) : mime;
    subtype = subtype.replaceAll("^(x-|vnd\\.)", "").toUpperCase();
    return subtype.length() <= 12 ? subtype : null;
}

This approach works for any MIME type Tika recognises, not just the 47 currently listed.

Note on quality

Tika's getExtension() maps directly to the well-known file extension registered for each type, so application/vnd.openxmlformats-officedocument.wordprocessingml.document correctly returns .docx. For audio/video/image types the extension matches the common badge name exactly. This removes the need to maintain the explicit map.

Files to Change

  • backend/src/main/java/com/tracepcap/analysis/service/FileSignatureDetector.java

Acceptance Criteria

  • MIME_LABELS static map removed
  • Label derived from MimeType.getExtension() with subtype fallback
  • Existing known types (PDF, ZIP, DOCX, MP3, MKV, ELF) produce the same badge labels
  • Previously unlabelled Tika types now receive a label where Tika has an extension

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions