Skip to content

Refactor library shape#159

Draft
markpbaggett wants to merge 8 commits into
mainfrom
refactor-library-shape
Draft

Refactor library shape#159
markpbaggett wants to merge 8 commits into
mainfrom
refactor-library-shape

Conversation

@markpbaggett

Copy link
Copy Markdown
Collaborator

No description provided.

response.raise_for_status() # Raise an error for bad status codes

# Parse the XML content
djfu = ET.fromstring(response.content)

Check failure

Code scanning / CodeQL

XML internal entity expansion High

XML parsing depends on a
user-provided value
without guarding against uncontrolled entity expansion.
XML parsing depends on a
user-provided value
without guarding against uncontrolled entity expansion.

Copilot Autofix

AI 4 months ago

In general, to fix XML internal entity expansion issues in Python, avoid parsing untrusted XML with xml.etree.ElementTree or other parsers that expand entities by default. Instead, use a hardened XML library such as defusedxml, which is API‑compatible with xml.etree.ElementTree but disables dangerous constructs (including entity expansion) and mitigates XML bombs and XXE by design.

For this specific case, the best fix with minimal functional change is:

  1. Replace the import import xml.etree.ElementTree as ET with import defusedxml.ElementTree as ET in iiify/resolver/__init__.py. defusedxml.ElementTree provides fromstring, findall, etc., with the same interface, so no other code changes are required in create_annotations.
  2. Keep the rest of the logic intact: still call ET.fromstring(response.content) and then use .findall on the parsed tree, but now via the hardened parser.

Concrete changes:

  • File iiify/resolver/__init__.py:
    • At the top of the file, change the xml.etree.ElementTree import to defusedxml.ElementTree aliased as ET.
    • No changes to the create_annotations function body are strictly required, since the API is compatible. The existing except ET.ParseError will continue to work with defusedxml.ElementTree.ParseError.

No other files need editing for this fix.

Suggested changeset 2
iiify/resolver/__init__.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/iiify/resolver/__init__.py b/iiify/resolver/__init__.py
--- a/iiify/resolver/__init__.py
+++ b/iiify/resolver/__init__.py
@@ -3,7 +3,7 @@
 import os
 import re
 import requests
-import xml.etree.ElementTree as ET
+import defusedxml.ElementTree as ET
 from datetime import timedelta
 from urllib.parse import quote
 
EOF
@@ -3,7 +3,7 @@
import os
import re
import requests
import xml.etree.ElementTree as ET
import defusedxml.ElementTree as ET
from datetime import timedelta
from urllib.parse import quote

requirements.txt
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/requirements.txt b/requirements.txt
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,5 @@
 beautifulsoup4==4.13.3
+defusedxml==0.7.1
 blinker==1.9.0
 cachelib==0.9.0
 certifi==2024.8.30
EOF
@@ -1,4 +1,5 @@
beautifulsoup4==4.13.3
defusedxml==0.7.1
blinker==1.9.0
cachelib==0.9.0
certifi==2024.8.30
This fix introduces these dependencies
Package Version Security advisories
defusedxml (pypi) 0.7.1 None
Copilot is powered by AI and may make mistakes. Always verify output.

# Parse the XML content
djfu = ET.fromstring(response.content)
page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]

Check failure

Code scanning / CodeQL

XPath query built from user-controlled sources Critical

XPath expression depends on a
user-provided value
.

Copilot Autofix

AI 4 months ago

In general, to fix XPath-injection issues you should avoid interpolating user-controlled data directly into XPath strings. Either (1) use parameterized/variable-based APIs if available, or (2) keep the XPath static and apply any indexing or filtering on the returned nodes in application code, after validating or constraining user input.

Here, we don’t actually need canvas_no to be part of the XPath. We want “the N‑th <OBJECT> element”, which can be achieved by selecting all OBJECT nodes with a static XPath .//OBJECT and then indexing the Python list with canvas_no - 1. That keeps the XPath expression constant and removes the tainted value from the query, while preserving the current behavior. Concretely, in iiify/resolver/__init__.py inside create_annotations, change:

page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]

to:

objects = djfu.findall(".//OBJECT")
page_index = canvas_no - 1
page = objects[page_index]

Optionally, we can add a small bounds check on page_index to avoid IndexError and raise a controlled error if the page doesn’t exist; this doesn’t change the “successful” behavior but makes error handling more explicit and robust. No new imports are needed; we only adjust how we select the page element.

Suggested changeset 1
iiify/resolver/__init__.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/iiify/resolver/__init__.py b/iiify/resolver/__init__.py
--- a/iiify/resolver/__init__.py
+++ b/iiify/resolver/__init__.py
@@ -121,7 +121,11 @@
 
         # Parse the XML content
         djfu = ET.fromstring(response.content)
-        page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]
+        objects = djfu.findall(".//OBJECT")
+        page_index = canvas_no - 1
+        if page_index < 0 or page_index >= len(objects):
+            raise ValueError(f"Requested canvas number {canvas_no} is out of range")
+        page = objects[page_index]
         words = page.findall(".//WORD")
         count = 1
         for word in words:
EOF
@@ -121,7 +121,11 @@

# Parse the XML content
djfu = ET.fromstring(response.content)
page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]
objects = djfu.findall(".//OBJECT")
page_index = canvas_no - 1
if page_index < 0 or page_index >= len(objects):
raise ValueError(f"Requested canvas number {canvas_no} is out of range")
page = objects[page_index]
words = page.findall(".//WORD")
count = 1
for word in words:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment thread iiify/resolver/utils.py
# ...
raise ValueError("This resource has restricted access")

if not os.path.exists(path):

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.

Copilot Autofix

AI 4 months ago

In general, the fix is to ensure that any filesystem path derived from untrusted input is normalized and then verified to be within an intended root directory. For this code, we should treat media_root as the safe root. We should construct path based on the normalized identifier (and any embedded filepath component), using os.path.normpath and then verifying that the final path is inside media_root. If the check fails, we should reject the request with an error instead of reading/writing that path.

Concretely, in ia_resolver in iiify/resolver/utils.py:

  • Move the construction of path until after we’ve split identifier and filepath.
  • Add normalization for identifier (the part before $) and filepath (the part after $, if present). Strip any leading path separators from filepath so an attacker cannot inject an absolute path.
  • Rebuild path using os.path.join(media_root, clean_identifier_with_suffix) and normalize that full path with os.path.normpath.
  • Check that path is still within media_root using a robust prefix check (for example, via os.path.commonpath), raising ValueError if not.
  • Leave the rest of the logic (metadata lookup, download, file writing, and return value) unchanged so existing functionality continues to work.

We only need changes in iiify/resolver/utils.py. No new imports are necessary beyond os, which is already imported.

Suggested changeset 1
iiify/resolver/utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/iiify/resolver/utils.py b/iiify/resolver/utils.py
--- a/iiify/resolver/utils.py
+++ b/iiify/resolver/utils.py
@@ -125,8 +125,6 @@
     opposed to a remote storage host, like Internet Archive) and first
     fetches it, if it doesn't exist on disk..
     """
-    path = os.path.join(media_root, identifier)
-
     leaf = None
     if "$" not in identifier:
         filepath = None
@@ -136,6 +134,20 @@
         if os.sep not in filepath:
             leaf = filepath
 
+    # Normalize and constrain identifier part to stay under media_root
+    identifier = os.path.normpath(identifier)
+    # Disallow attempts to traverse upwards from the base
+    if identifier.startswith(os.pardir + os.sep) or identifier == os.pardir:
+        raise ValueError("Invalid identifier path")
+
+    # Build the on-disk path, including any document/leaf suffix encoded in identifier
+    path = os.path.normpath(os.path.join(media_root, identifier))
+
+    # Ensure the resolved path is within media_root
+    media_root_norm = os.path.normpath(media_root)
+    if os.path.commonpath([media_root_norm, path]) != media_root_norm:
+        raise ValueError("Invalid path outside media_root")
+
     identifier, document = identifier.split(":", 1) if ":" in identifier else (identifier, None)
 
     metadata = requests.get('%s/metadata/%s' % (ARCHIVE, identifier)).json()
EOF
@@ -125,8 +125,6 @@
opposed to a remote storage host, like Internet Archive) and first
fetches it, if it doesn't exist on disk..
"""
path = os.path.join(media_root, identifier)

leaf = None
if "$" not in identifier:
filepath = None
@@ -136,6 +134,20 @@
if os.sep not in filepath:
leaf = filepath

# Normalize and constrain identifier part to stay under media_root
identifier = os.path.normpath(identifier)
# Disallow attempts to traverse upwards from the base
if identifier.startswith(os.pardir + os.sep) or identifier == os.pardir:
raise ValueError("Invalid identifier path")

# Build the on-disk path, including any document/leaf suffix encoded in identifier
path = os.path.normpath(os.path.join(media_root, identifier))

# Ensure the resolved path is within media_root
media_root_norm = os.path.normpath(media_root)
if os.path.commonpath([media_root_norm, path]) != media_root_norm:
raise ValueError("Invalid path outside media_root")

identifier, document = identifier.split(":", 1) if ":" in identifier else (identifier, None)

metadata = requests.get('%s/metadata/%s' % (ARCHIVE, identifier)).json()
Copilot is powered by AI and may make mistakes. Always verify output.
Comment thread iiify/resolver/utils.py
url = '%s/download/%s/page/leaf%s' % (ARCHIVE, identifierpath, leaf)
r = requests.get(url)
if r:
with open(path, 'wb') as rc:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.

Copilot Autofix

AI 4 months ago

In general, to fix this type of issue we must ensure that any filesystem path derived from untrusted input is constrained to a safe root directory. The usual approach is to normalize the candidate path (for example with os.path.normpath), then verify that the normalized absolute path is inside a configured base directory (here media_root); if not, reject the request. This prevents directory traversal via .., absolute paths, or embedded separators. When the identifier contains an embedded subpath (filepath) that is used for remote fetching only, we should still ensure we aren’t accidentally creating unsafe local paths from it.

For this specific code, the simplest robust fix without changing behavior is:

  • Compute the cache file path in a safe way, based solely on a sanitized version of identifier.
    • Use os.path.abspath + os.path.join + os.path.normpath to derive an absolute path based on media_root and identifier.
    • Ensure that this absolute path starts with the absolute media_root path plus a path separator boundary, or is exactly equal to media_root. If the check fails, raise a ValueError (or similar) so the caller returns 404, matching existing error handling.
  • Do this before os.path.exists(path) and before open(path, 'wb').

We should not alter how identifier and filepath are used in the remote URL (itempath and identifierpath), as that would affect functional behavior with Archive.org; the vulnerability we care about is the local path used in open. Also, we should not assume anything new about project structure beyond using the standard library (os), which is already imported in this module. The changes are confined to iiify/resolver/utils.py around the computation of path inside ia_resolver.

Concretely:

  • In ia_resolver, replace the single line path = os.path.join(media_root, identifier) with:
    • Computation of safe_root = os.path.abspath(media_root)
    • Computation of candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier)))
    • A containment check; if candidate_path is not under safe_root, raise ValueError.
    • Assign path = candidate_path.
  • No new imports or helpers are required; os is already imported.
  • iiify/app.py does not need changes for this specific path-based issue.
Suggested changeset 1
iiify/resolver/utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/iiify/resolver/utils.py b/iiify/resolver/utils.py
--- a/iiify/resolver/utils.py
+++ b/iiify/resolver/utils.py
@@ -125,7 +125,12 @@
     opposed to a remote storage host, like Internet Archive) and first
     fetches it, if it doesn't exist on disk..
     """
-    path = os.path.join(media_root, identifier)
+    # Construct a cache path under media_root and ensure it cannot escape
+    safe_root = os.path.abspath(media_root)
+    candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier)))
+    if not (candidate_path == safe_root or candidate_path.startswith(safe_root + os.sep)):
+        raise ValueError("Invalid identifier path")
+    path = candidate_path
 
     leaf = None
     if "$" not in identifier:
EOF
@@ -125,7 +125,12 @@
opposed to a remote storage host, like Internet Archive) and first
fetches it, if it doesn't exist on disk..
"""
path = os.path.join(media_root, identifier)
# Construct a cache path under media_root and ensure it cannot escape
safe_root = os.path.abspath(media_root)
candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier)))
if not (candidate_path == safe_root or candidate_path.startswith(safe_root + os.sep)):
raise ValueError("Invalid identifier path")
path = candidate_path

leaf = None
if "$" not in identifier:
Copilot is powered by AI and may make mistakes. Always verify output.
@glenrobson

Copy link
Copy Markdown
Collaborator

Just noting down some comments before I forget them.

  1. What do we do with mixed content manifests e.g. https://archive.org/download/2025-highland-house-walkthrough-ma which has a "mediatype": "movies" but contains both images and video
  2. What do we do with v2 manifest generation alongside v3? Note v4 is coming soon too.

I wonder if we should have a manifest 'class' which builds the manifest details like metadata, seeAlso etc which are common for all and then have 'container' classes for image, audio and video and maybe texts which handle creating the canvases....

We would need some sort of controller or builder class to bring the manifest and canvas classes together...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants