Skip to content

Bug fix for PMSF+I combinations#168

Merged
bqminh merged 2 commits into
iqtree:masterfrom
StefanFlaumberg:bugfix
May 26, 2026
Merged

Bug fix for PMSF+I combinations#168
bqminh merged 2 commits into
iqtree:masterfrom
StefanFlaumberg:bugfix

Conversation

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

This pull request fixes #58 by using site-specific state frequencies to compute ptn_invar for invariant patterns in the PhyloTree::computePtnInvar() function under the PMSF model.

@StefanFlaumberg
Copy link
Copy Markdown
Contributor Author

Two important notes:

  1. The PhyloTree::computePtnInvar() function still fails for PoMo polymorphic constant patterns (such as 0,0,1,1 0,0,1,1 0,0,1,1 0,0,1,1 for 4 species). Despite being constant, these patterns should be treated as variant patterns, since being polymorphic implies non-zero substitution rate. This is what both the old and the new comments say in the function. However, the Alignment::computeConst() function treats such patterns as constant, and hence they end up in the else { ASSERT(0); } case of PhyloTree::computePtnInvar().
    I haven't touched this issue here, because I'd rather address it in a comprehensive refactoring of how the Alignment class treats states (WIP for now).
  2. The PhyloTree::computePtnInvar() function treats the unobserved patterns (+ASC) as gap-only/maximum uncertainty patterns, assigning them lh = 1 under the invariant category (they are among dummy patterns getting ptn_invar[ptn] = 1 * p_invar). Is this the correct behaviour?
    Anyway, this is also a more complex issue, since Alignment::computeConst() is never called for the unobserved patterns and the behaviour of PhyloTree::computePtnInvar() should be consistent with that of phylokernelnew.h.

@bqminh
Copy link
Copy Markdown
Member

bqminh commented May 26, 2026

Claude AI review:

Code Review — PR #168 "Bug fix for PMSF+I combinations"

Author: @StefanFlaumberg · 1 commit, +71/-51, 3 files · mergeable: True / unstable · 5/7 CI green, 2 platforms failing

Overview

Fixes issue #58: under PMSF (or any site-specific-frequency model) with +I, the proportion of invariable sites was estimated as 0. Root cause: PhyloTree::computePtnInvar() (tree/phylotreesse.cpp:548) computed each invariant pattern's contribution using a single global state-frequency vector instead of the per-pattern site-specific frequencies. The fix queries the per-pattern frequencies inside the loop when model->isSiteSpecificModel() is true, plus adds a missing ModelSet::getStateFrequency override so ModelSet (the SSF model) can hand out submodel/pattern-specific freqs.

Correctness analysis

The fix itself — correct

  • tree/phylotreesse.cpp:573 adds if (model->isSiteSpecificModel()) model->getStateFrequency(state_freq, ptn); inside the per-pattern loop. This brings PMSF's per-pattern frequencies into the ptn_invar calculation. Without it, ptn_invar[ptn] = p_invar * state_freq[const_char] was using the +F averaged frequencies for every pattern ‚Üí pinv likelihood derivative was effectively flat and the optimiser parked it at zero. The math now matches what's expected for site-heterogeneous frequency models.
  • model/modelset.cpp:128 (ModelSet::getStateFrequency(buf, mixture)):
    • mixture >= 0 ‚Üí returns at(mixture)'s freq (per-pattern submodel)
    • mixture == -1 ‚Üí falls through to ModelMarkov::getStateFrequency (the +F average)
      This mirrors ModelMixture::getStateFrequency (model/modelmixture.cpp:3806) and is internally consistent.
  • The outer model->getMutationModel()->getStateFrequency(state_freq, -1) call at phylotreesse.cpp:572 is preserved, so for non-SSF models nothing changes ‚Äî the per-pattern call inside the loop is gated on isSiteSpecificModel(). ‚úì
  • Comment cleanup, variable renaming (cstate ‚Üí astate for the ambiguity-decoded state), and tighter loop scoping are pure readability ‚Äî no behavioural change.

One latent behavioural change worth flagging

ModelSet::getStateFrequency is new, with default mixture = 0. Before this PR, calling modelset_ptr->getStateFrequency(buf) (no mixture arg) fell through to ModelMarkov::getStateFrequency, which ignores mixture and returns the averaged state_freq member. After this PR, the same call dispatches to the new override and returns submodel 0's frequency, i.e. the first pattern's per-site frequencies — not the average.

That matches the existing ModelMixture convention (same default-arg footgun there), so the PR isn't introducing a new inconsistency, but it is a silent behaviour change for any in-tree caller that does m->getStateFrequency(buf) on a ModelSet object without specifying mixture. I found ~20 such call sites:

  • main/phyloanalysis.cpp:656, 698, 4413, 4441
  • model/partitionmodel.cpp:433
  • model/modelfactory.cpp:653
  • simulator/alisimulator.cpp:660, 691
  • tree/iqtree.cpp:2026
  • tree/phylokernelnonrev.cpp:656

For most of these the underlying model is unlikely to be ModelSet, but it's worth a quick audit — especially the simulator (alisimulator.cpp:660, 691), which would now produce simulations using pattern-0 frequencies if a user runs --alisim with a PMSF model.

Suggested mitigation (one-line): either change the default arg to int mixture = -1 so the no-arg call still gets the averaged behaviour, or audit the 20 call sites and add explicit -1 where averaged is intended. The former is safer and consistent with the comment in the new function: // default: return the +F freqs across all patterns.

CI

5 of 7 platforms passing. Failing:

  • Mac OS x86-64
  • Linux x86-64 (22, latest, true)

The failures are not platforms I'd expect to be sensitive to this change (Mac/Linux x86_64 should behave the same as ARM64, which passes). Possibilities:

  • Test fixture comparing logged pinv value: the bugfix legitimately changes the optimal pinv from 0.0 to something positive for any PMSF+I test case, which is exactly the intent. If a regression test was pinning the buggy value, the test needs updating, not the code.
  • Spurious infrastructure failure.

Please pull the CI logs and confirm whether the diff in any failing test is just pinv: 0 → 0.something and a corresponding logL improvement — if so, that's evidence the fix works and the baseline needs refreshing.

Style / minor

  • int state_freq[nstates] at phylotreesse.cpp:566 is a VLA (Clang/GCC extension). Same as the old code ‚Äî no regression. If you eventually switch to std::vector<double> to silence the VLA warning, do it project-wide; don't single this out.
  • ASSERT(astate <= 14) for DNA ambiguity (line 597) and ASSERT(astate <= 2) for protein (line 605) are good defensive guards that match the IUPAC encoding.

Verdict

Approve-with-comments. The fix is correct, well-scoped, and addresses the reported bug at its root. The only thing I'd ask before merging:

  1. Either change the default for ModelSet::getStateFrequency from mixture = 0 to mixture = -1, or confirm by audit that none of the 20 no-arg callers above behave badly on a ModelSet input (especially the AliSim path).
  2. Investigate the 2 CI failures — most likely a pinned-value test that needs updating because the fix legitimately produces a different (correct) pinv.

@bqminh bqminh merged commit 5f15af0 into iqtree:master May 26, 2026
2 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PMSF Step 1: '+I' Proportion of Invariable Sites Not Correctly Estimated in IQ-TREE v3.0.1

2 participants