@copilot analyze, explain the cause and suggest a plan to fix the failing test test_pickle_core which fails for Python 3.10 on Windows - writing your insights and recommendations h...
Analysis of test_bad_cache_file[True] failure on Python 3.10 Windows
This is a pre-existing issue unrelated to the smoke-test changes (none of the files I modified touch tests/pickle_tests/test_pickle_core.py or src/cachier/cores/pickle.py).
Root cause
Where it fails: src/cachier/cores/pickle.py, line 148 — _clear_all_cache_files():
def _clear_all_cache_files(self) -> None:
path, name = os.path.split(self.cache_fpath)
for subpath in os.listdir(path):
if subpath.startswith(f"{name}_"):
os.remove(os.path.join(path, subpath)) # <-- PermissionError here on Windows
Error: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
Why it happens on Windows only: Windows implements mandatory file locking (unlike Linux/macOS which use advisory locking). portalocker uses LockFileEx on Windows, which means while any thread holds a lock on a file, os.remove() on that file raises PermissionError.
The sequence that triggers it
test_bad_cache_file[True] loops through sleep_times * 2, calling _helper_bad_cache_file() each iteration.
- Each call to
_helper_bad_cache_file() starts two daemon threads and joins them with timeout=2.
_bad_cache (the underlying function) sleeps for 1 second and internally holds a portalocker exclusive lock on the separate-files cache file while computing.
- If
thread1.join(timeout=2) returns due to timeout before thread1 finishes, thread1 is still running and still holds the portalocker lock.
- The next iteration immediately calls
_bad_cache_decorated.clear_cache() → _clear_all_cache_files() → os.remove(), which fails because thread1 from the previous iteration is still holding the lock.
Suggested fix plan
The fix belongs in _clear_all_cache_files() in src/cachier/cores/pickle.py. The os.remove() call should be made resilient to transient Windows file-lock errors by adding a small retry loop:
def _clear_all_cache_files(self) -> None:
path, name = os.path.split(self.cache_fpath)
for subpath in os.listdir(path):
if subpath.startswith(f"{name}_"):
fpath = os.path.join(path, subpath)
for attempt in range(3):
try:
os.remove(fpath)
break
except PermissionError:
if attempt < 2:
time.sleep(0.1 * (attempt + 1))
This is the right layer to fix because clear_cache() is a public API that users also call, so it should be resilient to this Windows behaviour in production too — not just in tests.
Originally posted by @Copilot in #345 (comment)
Analysis of
test_bad_cache_file[True]failure on Python 3.10 WindowsThis is a pre-existing issue unrelated to the smoke-test changes (none of the files I modified touch
tests/pickle_tests/test_pickle_core.pyorsrc/cachier/cores/pickle.py).Root cause
Where it fails:
src/cachier/cores/pickle.py, line 148 —_clear_all_cache_files():Error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another processWhy it happens on Windows only: Windows implements mandatory file locking (unlike Linux/macOS which use advisory locking).
portalockerusesLockFileExon Windows, which means while any thread holds a lock on a file,os.remove()on that file raisesPermissionError.The sequence that triggers it
test_bad_cache_file[True]loops throughsleep_times * 2, calling_helper_bad_cache_file()each iteration._helper_bad_cache_file()starts two daemon threads and joins them withtimeout=2._bad_cache(the underlying function) sleeps for 1 second and internally holds aportalockerexclusive lock on the separate-files cache file while computing.thread1.join(timeout=2)returns due to timeout before thread1 finishes, thread1 is still running and still holds the portalocker lock._bad_cache_decorated.clear_cache()→_clear_all_cache_files()→os.remove(), which fails because thread1 from the previous iteration is still holding the lock.Suggested fix plan
The fix belongs in
_clear_all_cache_files()insrc/cachier/cores/pickle.py. Theos.remove()call should be made resilient to transient Windows file-lock errors by adding a small retry loop:This is the right layer to fix because
clear_cache()is a public API that users also call, so it should be resilient to this Windows behaviour in production too — not just in tests.Originally posted by @Copilot in #345 (comment)