Skip to content

Conversation

@iachimoe
Copy link

A previous change related to the same JIRA means that INTERNAL_PATH is set using the metadata name from a gzip file. However, many gzips don't have this data. Also other archives like bz2 won't have the data. This PR does two things (1) gets the RESOURCE_NAME from the gzip metadata if possible (a change from existing behaviour) and (2), in the absence of a name in the gzip metadata (due to it not being there in a gzip, or another format such as bzip being used), sets the INTERNAL_PATH to be the same as RESOURCE_NAME

public void testTarballWithoutGzipNameMetadata() throws Exception {
List<Metadata> list = getRecursiveMetadata("test-documents-no-name-metadata.tgz");
Metadata last = list.get(list.size() - 1);
String internalPath = last.get(TikaCoreProperties.INTERNAL_PATH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of internal_path is to store what the file contained about the internal path of a resource. This metadata field should tell the user "this was the path that was literally stored in the container file. Tika did no guesswork here".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants