Refactor library shape by markpbaggett · Pull Request #159 · internetarchive/iiif

markpbaggett · 2026-03-05T22:00:26Z

No description provided.

Not sure I like it this way. Should we do these as props?

I considered doing these as static methods on the base class but I think it's easier to have tests and such if we do it this way.

+        response.raise_for_status()  # Raise an error for bad status codes
+
+        # Parse the XML content
+        djfu = ET.fromstring(response.content)


In general, to fix XML internal entity expansion issues in Python, avoid parsing untrusted XML with xml.etree.ElementTree or other parsers that expand entities by default. Instead, use a hardened XML library such as defusedxml, which is API‑compatible with xml.etree.ElementTree but disables dangerous constructs (including entity expansion) and mitigates XML bombs and XXE by design.

For this specific case, the best fix with minimal functional change is:

Replace the import import xml.etree.ElementTree as ET with import defusedxml.ElementTree as ET in iiify/resolver/__init__.py. defusedxml.ElementTree provides fromstring, findall, etc., with the same interface, so no other code changes are required in create_annotations.

Keep the rest of the logic intact: still call ET.fromstring(response.content) and then use .findall on the parsed tree, but now via the hardened parser.

Concrete changes:

File iiify/resolver/__init__.py:

At the top of the file, change the xml.etree.ElementTree import to defusedxml.ElementTree aliased as ET.

No changes to the create_annotations function body are strictly required, since the API is compatible. The existing except ET.ParseError will continue to work with defusedxml.ElementTree.ParseError.

No other files need editing for this fix.

+
+        # Parse the XML content
+        djfu = ET.fromstring(response.content)
+        page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]


In general, to fix XPath-injection issues you should avoid interpolating user-controlled data directly into XPath strings. Either (1) use parameterized/variable-based APIs if available, or (2) keep the XPath static and apply any indexing or filtering on the returned nodes in application code, after validating or constraining user input.

Here, we don’t actually need canvas_no to be part of the XPath. We want “the N‑th <OBJECT> element”, which can be achieved by selecting all OBJECT nodes with a static XPath .//OBJECT and then indexing the Python list with canvas_no - 1. That keeps the XPath expression constant and removes the tainted value from the query, while preserving the current behavior. Concretely, in iiify/resolver/__init__.py inside create_annotations, change:

page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]

to:

objects = djfu.findall(".//OBJECT") page_index = canvas_no - 1 page = objects[page_index]

Optionally, we can add a small bounds check on page_index to avoid IndexError and raise a controlled error if the page doesn’t exist; this doesn’t change the “successful” behavior but makes error handling more explicit and robust. No new imports are needed; we only adjust how we select the page element.

+            # ...
+            raise ValueError("This resource has restricted access")
+
+    if not os.path.exists(path):


In general, the fix is to ensure that any filesystem path derived from untrusted input is normalized and then verified to be within an intended root directory. For this code, we should treat media_root as the safe root. We should construct path based on the normalized identifier (and any embedded filepath component), using os.path.normpath and then verifying that the final path is inside media_root. If the check fails, we should reject the request with an error instead of reading/writing that path.

Concretely, in ia_resolver in iiify/resolver/utils.py:

Move the construction of path until after we’ve split identifier and filepath.

Add normalization for identifier (the part before $) and filepath (the part after $, if present). Strip any leading path separators from filepath so an attacker cannot inject an absolute path.

Rebuild path using os.path.join(media_root, clean_identifier_with_suffix) and normalize that full path with os.path.normpath.

Check that path is still within media_root using a robust prefix check (for example, via os.path.commonpath), raising ValueError if not.

Leave the rest of the logic (metadata lookup, download, file writing, and return value) unchanged so existing functionality continues to work.

We only need changes in iiify/resolver/utils.py. No new imports are necessary beyond os, which is already imported.

+            url = '%s/download/%s/page/leaf%s' % (ARCHIVE, identifierpath, leaf)
+            r = requests.get(url)
+        if r:
+            with open(path, 'wb') as rc:


In general, to fix this type of issue we must ensure that any filesystem path derived from untrusted input is constrained to a safe root directory. The usual approach is to normalize the candidate path (for example with os.path.normpath), then verify that the normalized absolute path is inside a configured base directory (here media_root); if not, reject the request. This prevents directory traversal via .., absolute paths, or embedded separators. When the identifier contains an embedded subpath (filepath) that is used for remote fetching only, we should still ensure we aren’t accidentally creating unsafe local paths from it.

For this specific code, the simplest robust fix without changing behavior is:

Compute the cache file path in a safe way, based solely on a sanitized version of identifier.

Use os.path.abspath + os.path.join + os.path.normpath to derive an absolute path based on media_root and identifier.

Ensure that this absolute path starts with the absolute media_root path plus a path separator boundary, or is exactly equal to media_root. If the check fails, raise a ValueError (or similar) so the caller returns 404, matching existing error handling.

Do this before os.path.exists(path) and before open(path, 'wb').

We should not alter how identifier and filepath are used in the remote URL (itempath and identifierpath), as that would affect functional behavior with Archive.org; the vulnerability we care about is the local path used in open. Also, we should not assume anything new about project structure beyond using the standard library (os), which is already imported in this module. The changes are confined to iiify/resolver/utils.py around the computation of path inside ia_resolver.

Concretely:

In ia_resolver, replace the single line path = os.path.join(media_root, identifier) with:

Computation of safe_root = os.path.abspath(media_root)

Computation of candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier)))

A containment check; if candidate_path is not under safe_root, raise ValueError.

Assign path = candidate_path.

No new imports or helpers are required; os is already imported.

iiify/app.py does not need changes for this specific path-based issue.

glenrobson · 2026-03-19T22:32:34Z

Just noting down some comments before I forget them.

What do we do with mixed content manifests e.g. https://archive.org/download/2025-highland-house-walkthrough-ma which has a "mediatype": "movies" but contains both images and video
What do we do with v2 manifest generation alongside v3? Note v4 is coming soon too.

I wonder if we should have a manifest 'class' which builds the manifest details like metadata, seeAlso etc which are common for all and then have 'container' classes for image, audio and video and maybe texts which handle creating the canvases....

We would need some sort of controller or builder class to bring the manifest and canvas classes together...

markpbaggett added 7 commits March 5, 2026 15:31

Add anything that doesn't belong to a mediatype to api bounds.

7b00c9e

Add all contants to a shared file.

742633e

Not sure I like it this way. Should we do these as props?

Make a base for all types.

3c79566

Establish helpers for any stateless methods.

9cd8b93

I considered doing these as static methods on the base class but I think it's easier to have tests and such if we do it this way.

Add all non-Manifest stateless functions to utils.

6d88943

Establish Type Classes.

c6e2f50

Remove unneeded.

43c6d90

github-advanced-security AI found potential problems Mar 5, 2026

View reviewed changes

Remove original file.

0bc3e28

@@ -3,7 +3,7 @@
             import os
             import re
             import requests
-            import xml.etree.ElementTree as ET
+            import defusedxml.ElementTree as ET
             from datetime import timedelta
             from urllib.parse import quote

@@ -1,4 +1,5 @@
             beautifulsoup4==4.13.3
+            defusedxml==0.7.1
             blinker==1.9.0
             cachelib==0.9.0
             certifi==2024.8.30

Package	Version	Security advisories
defusedxml (pypi)	0.7.1	None

@@ -125,8 +125,6 @@
                 opposed to a remote storage host, like Internet Archive) and first
                 fetches it, if it doesn't exist on disk..
                 """
-                path = os.path.join(media_root, identifier)
                 leaf = None
                 if "$" not in identifier:
                     filepath = None
@@ -136,6 +134,20 @@
                     if os.sep not in filepath:
                         leaf = filepath
+                # Normalize and constrain identifier part to stay under media_root
+                identifier = os.path.normpath(identifier)
+                # Disallow attempts to traverse upwards from the base
+                if identifier.startswith(os.pardir + os.sep) or identifier == os.pardir:
+                    raise ValueError("Invalid identifier path")
+                # Build the on-disk path, including any document/leaf suffix encoded in identifier
+                path = os.path.normpath(os.path.join(media_root, identifier))
+                # Ensure the resolved path is within media_root
+                media_root_norm = os.path.normpath(media_root)
+                if os.path.commonpath([media_root_norm, path]) != media_root_norm:
+                    raise ValueError("Invalid path outside media_root")
                 identifier, document = identifier.split(":", 1) if ":" in identifier else (identifier, None)
                 metadata = requests.get('%s/metadata/%s' % (ARCHIVE, identifier)).json()

@@ -125,7 +125,12 @@
                 opposed to a remote storage host, like Internet Archive) and first
                 fetches it, if it doesn't exist on disk..
                 """
-                path = os.path.join(media_root, identifier)
+                # Construct a cache path under media_root and ensure it cannot escape
+                safe_root = os.path.abspath(media_root)
+                candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier)))
+                if not (candidate_path == safe_root or candidate_path.startswith(safe_root + os.sep)):
+                    raise ValueError("Invalid identifier path")
+                path = candidate_path
                 leaf = None
                 if "$" not in identifier:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor library shape#159

Refactor library shape#159
markpbaggett wants to merge 8 commits into
mainfrom
refactor-library-shape

markpbaggett commented Mar 5, 2026

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

glenrobson commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@@ -121,7 +121,11 @@
                     # Parse the XML content
                     djfu = ET.fromstring(response.content)
-                    page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]
+                    objects = djfu.findall(".//OBJECT")
+                    page_index = canvas_no - 1
+                    if page_index < 0 or page_index >= len(objects):
+                        raise ValueError(f"Requested canvas number {canvas_no} is out of range")
+                    page = objects[page_index]
                     words = page.findall(".//WORD")
                     count = 1
                     for word in words:

Uh oh!

Conversation

markpbaggett commented Mar 5, 2026

Uh oh!

Check failure

Uh oh!

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Uh oh!

Copilot Autofix

glenrobson commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants