Refactor library shape#159
Conversation
Not sure I like it this way. Should we do these as props?
I considered doing these as static methods on the base class but I think it's easier to have tests and such if we do it this way.
| response.raise_for_status() # Raise an error for bad status codes | ||
|
|
||
| # Parse the XML content | ||
| djfu = ET.fromstring(response.content) |
Check failure
Code scanning / CodeQL
XML internal entity expansion High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix XML internal entity expansion issues in Python, avoid parsing untrusted XML with xml.etree.ElementTree or other parsers that expand entities by default. Instead, use a hardened XML library such as defusedxml, which is API‑compatible with xml.etree.ElementTree but disables dangerous constructs (including entity expansion) and mitigates XML bombs and XXE by design.
For this specific case, the best fix with minimal functional change is:
- Replace the import
import xml.etree.ElementTree as ETwithimport defusedxml.ElementTree as ETiniiify/resolver/__init__.py.defusedxml.ElementTreeprovidesfromstring,findall, etc., with the same interface, so no other code changes are required increate_annotations. - Keep the rest of the logic intact: still call
ET.fromstring(response.content)and then use.findallon the parsed tree, but now via the hardened parser.
Concrete changes:
- File
iiify/resolver/__init__.py:- At the top of the file, change the
xml.etree.ElementTreeimport todefusedxml.ElementTreealiased asET. - No changes to the
create_annotationsfunction body are strictly required, since the API is compatible. The existingexcept ET.ParseErrorwill continue to work withdefusedxml.ElementTree.ParseError.
- At the top of the file, change the
No other files need editing for this fix.
| @@ -3,7 +3,7 @@ | ||
| import os | ||
| import re | ||
| import requests | ||
| import xml.etree.ElementTree as ET | ||
| import defusedxml.ElementTree as ET | ||
| from datetime import timedelta | ||
| from urllib.parse import quote | ||
|
|
| @@ -1,4 +1,5 @@ | ||
| beautifulsoup4==4.13.3 | ||
| defusedxml==0.7.1 | ||
| blinker==1.9.0 | ||
| cachelib==0.9.0 | ||
| certifi==2024.8.30 |
| Package | Version | Security advisories |
| defusedxml (pypi) | 0.7.1 | None |
|
|
||
| # Parse the XML content | ||
| djfu = ET.fromstring(response.content) | ||
| page = djfu.findall(f".//OBJECT[{canvas_no}]")[0] |
Check failure
Code scanning / CodeQL
XPath query built from user-controlled sources Critical
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix XPath-injection issues you should avoid interpolating user-controlled data directly into XPath strings. Either (1) use parameterized/variable-based APIs if available, or (2) keep the XPath static and apply any indexing or filtering on the returned nodes in application code, after validating or constraining user input.
Here, we don’t actually need canvas_no to be part of the XPath. We want “the N‑th <OBJECT> element”, which can be achieved by selecting all OBJECT nodes with a static XPath .//OBJECT and then indexing the Python list with canvas_no - 1. That keeps the XPath expression constant and removes the tainted value from the query, while preserving the current behavior. Concretely, in iiify/resolver/__init__.py inside create_annotations, change:
page = djfu.findall(f".//OBJECT[{canvas_no}]")[0]to:
objects = djfu.findall(".//OBJECT")
page_index = canvas_no - 1
page = objects[page_index]Optionally, we can add a small bounds check on page_index to avoid IndexError and raise a controlled error if the page doesn’t exist; this doesn’t change the “successful” behavior but makes error handling more explicit and robust. No new imports are needed; we only adjust how we select the page element.
| @@ -121,7 +121,11 @@ | ||
|
|
||
| # Parse the XML content | ||
| djfu = ET.fromstring(response.content) | ||
| page = djfu.findall(f".//OBJECT[{canvas_no}]")[0] | ||
| objects = djfu.findall(".//OBJECT") | ||
| page_index = canvas_no - 1 | ||
| if page_index < 0 or page_index >= len(objects): | ||
| raise ValueError(f"Requested canvas number {canvas_no} is out of range") | ||
| page = objects[page_index] | ||
| words = page.findall(".//WORD") | ||
| count = 1 | ||
| for word in words: |
| # ... | ||
| raise ValueError("This resource has restricted access") | ||
|
|
||
| if not os.path.exists(path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, the fix is to ensure that any filesystem path derived from untrusted input is normalized and then verified to be within an intended root directory. For this code, we should treat media_root as the safe root. We should construct path based on the normalized identifier (and any embedded filepath component), using os.path.normpath and then verifying that the final path is inside media_root. If the check fails, we should reject the request with an error instead of reading/writing that path.
Concretely, in ia_resolver in iiify/resolver/utils.py:
- Move the construction of
pathuntil after we’ve splitidentifierandfilepath. - Add normalization for
identifier(the part before$) andfilepath(the part after$, if present). Strip any leading path separators fromfilepathso an attacker cannot inject an absolute path. - Rebuild
pathusingos.path.join(media_root, clean_identifier_with_suffix)and normalize that full path withos.path.normpath. - Check that
pathis still withinmedia_rootusing a robust prefix check (for example, viaos.path.commonpath), raisingValueErrorif not. - Leave the rest of the logic (metadata lookup, download, file writing, and return value) unchanged so existing functionality continues to work.
We only need changes in iiify/resolver/utils.py. No new imports are necessary beyond os, which is already imported.
| @@ -125,8 +125,6 @@ | ||
| opposed to a remote storage host, like Internet Archive) and first | ||
| fetches it, if it doesn't exist on disk.. | ||
| """ | ||
| path = os.path.join(media_root, identifier) | ||
|
|
||
| leaf = None | ||
| if "$" not in identifier: | ||
| filepath = None | ||
| @@ -136,6 +134,20 @@ | ||
| if os.sep not in filepath: | ||
| leaf = filepath | ||
|
|
||
| # Normalize and constrain identifier part to stay under media_root | ||
| identifier = os.path.normpath(identifier) | ||
| # Disallow attempts to traverse upwards from the base | ||
| if identifier.startswith(os.pardir + os.sep) or identifier == os.pardir: | ||
| raise ValueError("Invalid identifier path") | ||
|
|
||
| # Build the on-disk path, including any document/leaf suffix encoded in identifier | ||
| path = os.path.normpath(os.path.join(media_root, identifier)) | ||
|
|
||
| # Ensure the resolved path is within media_root | ||
| media_root_norm = os.path.normpath(media_root) | ||
| if os.path.commonpath([media_root_norm, path]) != media_root_norm: | ||
| raise ValueError("Invalid path outside media_root") | ||
|
|
||
| identifier, document = identifier.split(":", 1) if ":" in identifier else (identifier, None) | ||
|
|
||
| metadata = requests.get('%s/metadata/%s' % (ARCHIVE, identifier)).json() |
| url = '%s/download/%s/page/leaf%s' % (ARCHIVE, identifierpath, leaf) | ||
| r = requests.get(url) | ||
| if r: | ||
| with open(path, 'wb') as rc: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix this type of issue we must ensure that any filesystem path derived from untrusted input is constrained to a safe root directory. The usual approach is to normalize the candidate path (for example with os.path.normpath), then verify that the normalized absolute path is inside a configured base directory (here media_root); if not, reject the request. This prevents directory traversal via .., absolute paths, or embedded separators. When the identifier contains an embedded subpath (filepath) that is used for remote fetching only, we should still ensure we aren’t accidentally creating unsafe local paths from it.
For this specific code, the simplest robust fix without changing behavior is:
- Compute the cache file path in a safe way, based solely on a sanitized version of
identifier.- Use
os.path.abspath+os.path.join+os.path.normpathto derive an absolutepathbased onmedia_rootandidentifier. - Ensure that this absolute
pathstarts with the absolutemedia_rootpath plus a path separator boundary, or is exactly equal tomedia_root. If the check fails, raise aValueError(or similar) so the caller returns 404, matching existing error handling.
- Use
- Do this before
os.path.exists(path)and beforeopen(path, 'wb').
We should not alter how identifier and filepath are used in the remote URL (itempath and identifierpath), as that would affect functional behavior with Archive.org; the vulnerability we care about is the local path used in open. Also, we should not assume anything new about project structure beyond using the standard library (os), which is already imported in this module. The changes are confined to iiify/resolver/utils.py around the computation of path inside ia_resolver.
Concretely:
- In
ia_resolver, replace the single linepath = os.path.join(media_root, identifier)with:- Computation of
safe_root = os.path.abspath(media_root) - Computation of
candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier))) - A containment check; if
candidate_pathis not undersafe_root, raiseValueError. - Assign
path = candidate_path.
- Computation of
- No new imports or helpers are required;
osis already imported. iiify/app.pydoes not need changes for this specific path-based issue.
| @@ -125,7 +125,12 @@ | ||
| opposed to a remote storage host, like Internet Archive) and first | ||
| fetches it, if it doesn't exist on disk.. | ||
| """ | ||
| path = os.path.join(media_root, identifier) | ||
| # Construct a cache path under media_root and ensure it cannot escape | ||
| safe_root = os.path.abspath(media_root) | ||
| candidate_path = os.path.abspath(os.path.normpath(os.path.join(safe_root, identifier))) | ||
| if not (candidate_path == safe_root or candidate_path.startswith(safe_root + os.sep)): | ||
| raise ValueError("Invalid identifier path") | ||
| path = candidate_path | ||
|
|
||
| leaf = None | ||
| if "$" not in identifier: |
|
Just noting down some comments before I forget them.
I wonder if we should have a manifest 'class' which builds the manifest details like metadata, seeAlso etc which are common for all and then have 'container' classes for image, audio and video and maybe texts which handle creating the canvases.... We would need some sort of controller or builder class to bring the manifest and canvas classes together... |
No description provided.