Skip to content

Alignment refactoring and bug fixing#160

Merged
bqminh merged 5 commits into
iqtree:masterfrom
StefanFlaumberg:fixalignment
May 14, 2026
Merged

Alignment refactoring and bug fixing#160
bqminh merged 5 commits into
iqtree:masterfrom
StefanFlaumberg:fixalignment

Conversation

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

This PR introduces multiple bug fixes and design optimizations concerning the alignment classes.

Commit 81adb3a:

  • DESIGN: site_state_freq ---> ptn_state_freq, remove site_model to have more straightforward handling of site-specific frequencies.
  • DESIGN: Rewrite Alignment::readSiteStateFreq() to read the -fs file in a more controlled way.
  • BUGFIX: The design changes enable using the -fs files with missing (defaulted) site entries (was allowed before, but didn't work).
  • BUGFIX: Add a check to properly copy nullptr (default) arrays of ptn_state_freq in Alignment::createBootstrapAlignment().
  • BUGFIX: Set the arrays of ptn_state_freq to nullptr after delete[] in the ~Alignment() destructor to avoid double delete.

Commit f6017bc:

  • BUGFIX: In Alignment::extractSiteID(): add an error for empty input, use 0 and INT_MAX instead of HUGE_VALL as bound checks in the convert_site_range() static function as positions are eventually converted to int, fix the check for extracting from CODON (the length of each range should be a multiple of 3, not only the total length of the ranges).
  • DESIGN: Make getNSite() const, remove the expected_num_sites member and all related code from Alignment. The member was used to set fake lengths for empty alignments in AliSim. Now AliSimulator::initializeIQTreeFromTreeFile() fills these empty alignments with sites of a fake pattern. With almost no overhead, this makes the behavior of Alignment more consistent, as functions using the return of getNSite() to work with sites will now work properly with the sites of the fake pattern instead of crashing on empty alignments.
  • DESIGN: Move tree->aln->computeUnknownState(); from AliSimulator::initializeModel() to AliSimulator::initializeAlignment(). This is more logical and is required by the previous DESIGN point.
  • BUGFIX: Add a check in runAliSim() to ensure that either all partitions refer to some alignment file(s) (then run with Inference Mode) or none do (then run without Inference Mode). Running without Inference Mode in the mixed cased was a bad decision and is not allowed anymore, since in this case the AliSimulator::initializeIQTreeFromTreeFile() is run in which partitions with alignments were first initialized by from their respective alignments and then were reinitialized (unsafely, with memory leaks for the CODON case) by AliSimulator::initializeAlignment().
    In practice, the mixed case likely corresponds to the user intending to run in Inference Mode and forgetting to specify alignment files for some partitions -- now an informative error is printed in this case.
  • BUGFIX: Add a check in retrieveAncestralSequenceFromInputFile() to ensure that user-provided ancestral sequence is used only when all partitions come from the same file and use the same sequence type and number of states. The rationale is that partitions coming from different alignment files use position_spec ranges relative to the sequences of their respective files, these ranges cannot be properly used to extract regions from a common ancestral sequence.

Commit ab5b4eb:

  • DESIGN: Make addition of new patterns safer. Now Alignment::addPattern() expects that the pattern's frequency is preset and adds sites of the pattern according to its frequency to the end of the alignment. The sequential addition provides a cleaner design ensuring that each added site is assigned to a pattern (i.e. that there are no garbage values in site_pattern). Alignment::computeConst() is not called by the function and thus can be called anywhere; currently it is called only by Alignment::updateConstPatterns() after adding all patterns. Pattern::frequency is now set to 1 by default.
    This entails significant rewriting for Alignment::readCountsFormat() and SuperAlignment::concatenateAlignments(). But the code is simplified and the logic remains: in readCountsFormat() the sites now are added sequentially and dummy unknown states get rewritten at the end, in concatenateAlignments() (adds little overhead, but much clarity).
  • DESIGN: Both buildPattern() and convertToCodonOrAA() now use the Alignment::getCodonStateTypeFromSites() function, extended to handle the NT2AA case. A stop codon is treated as a gap by the function, as it originally was done in buildPattern().
  • DESIGN: Make the alignment-copying functions (extractSubAlignment(), extractPatterns(), extractSites(), etc.) actually allocate and return a new Alignment *. To unify their design, the Alignment::initAlignmentCopy() function is added. The rationale is that the old, alignment-mutating versions of the functions could be called on already initialized alignments causing unexpected behaviour.  
  • DESIGN: Add the group member to Pattern: by default group = -1, equal patterns have equal group and equal StateVector. This enables discerning compositionally identical patterns to regroup sites and add sites with site-specific parameters in a straightforward manner.
  • BUGFIX: Fix PMSF might not work correctly in the presence of identical seqs and a potential solution to it #139 by making the alignment-copying functions copy the arrays of ptn_state_freq. IMPORTANT: a copied pattern must inherit group of the original pattern to ensure that identical sites with different frequency profiles do not get merged.
  • BUGFIX: Now Alignment::extractSites() decides itself whether to convert the spec ranges to CODON/PROTEIN or not by checking this->genetic_code (it is not a nullptr iff the alignment is CODON or NT2AA). In particular, in SuperAlignment::readPartitionNexus() this fixes the case of applying the nt2aa bool flag inferred for global input_aln to partitions which may use their own alignment files of different data types.
  • BUGFIX: Fix a memory leak when using Alignment::removeAndFillUpGappySites() in phylotesting.cpp.

Commit 3fdf1e3:

  • DESIGN: Make the logic flow more explicit in the SuperAlignment::readPartition*() functions, merge functions readPartitionDir() and readPartitionList() into readPartitionFiles().
  • BUGFIX: Add a convertToCodonOrAA() call to readPartitionRaxml() for the case when the input is DNA and a partition is set to CODON.  
  • BUGFIX: Fix that model -m <model> was not set to partitions passed as -S <dir>/<file1,file2,...> (model vs model_name confusion in readPartitionDir() and readPartitionList()).
  • BUGFIX: Fix that model -m <model> was not set to an alignment concatenated from files -s <dir>/<file1,file2,...> (now setting aln->model_name in createAlignment()).

@StefanFlaumberg
Copy link
Copy Markdown
Contributor Author

Dear @bqminh,
The fixes haven't turned out to be short and simple, unfortunately. Fixing one bug required making design changes, those uncovered other bugs which prompted further design changes, and so on. In fact, this is a conservative take. There are still a plenty of poor architectural solutions in the alignment code, and fixing them could significantly simplify the code and make the logic more explicit and error-proof. I'm preparing further changes to propose as another PR once this one is merged.

The current PR has a couple of conflicts with the pending PRs by Huaiyan and Nhan, since the latter use the old function signatures, but these are straightforward to solve.

@bqminh
Copy link
Copy Markdown
Member

bqminh commented May 14, 2026

Review by Claude Code:

Code Review — PR #160 "Alignment refactoring and bug fixing"

Author: @StefanFlaumberg · Base: master · Branch: fixalignment · 28 files · +1758 / -2452 · 4 commits · mergeable: yes, CI: unstable

Overview

Large refactor of the alignment subsystem: consolidates site_state_freq/site_model into a single ptn_state_freq, rewrites alignment-copying helpers to return new objects (no more in-place mutation), introduces Pattern::group to disambiguate compositionally-identical patterns, and merges the partition-reading entry points. The structural cleanup is genuine and several real bugs (memory leaks, missing deep copies, -m flag not propagating to partitions) are fixed.

Correctness concerns (severity-ordered)

  1. Likely double-free in readSiteStateFreq after pattern regrouping (alignment/alignment.cpp:6166-6219). When two patterns end up sharing the same models[k] pointer, that double* gets pushed into ptn_state_freq twice; the destructor then delete[]s it twice. The new "null-after-delete" guard in ~Alignment() doesn't help because the pointers are at different indices.
  2. extractSiteID bool parameter receives char * (alignment/alignment.h:148). Every call site (alignment.cpp:3766, alignment.cpp:6110, main/alisim.cpp:609,635, simulator/alisimulator.cpp:147) passes genetic_code — a char * — which implicit-converts to bool. It only "works" because non-null implies CODON today. Latent footgun; make the parameter explicit.
  3. extractSubAlignment now returns nullptr for empty result (alignment.cpp:3679-3683), but callers like tree/phylotree.cpp:5616 (collapseStableClade) call setAlignment(pruned_aln) without a null check. New invariant, not documented.
  4. Pattern::group is in operator== but not in the hash (alignment.h:74-80, pattern.h:55-62). Functionally correct, but for SSF alignments many sites share the state vector with different group — hashmap inserts degrade to O(n). One-line fix to mix group into the hash.
  5. addPattern() now appends — site-order changes in extractPatterns/extractPatternFreqs (alignment.cpp:3693-3736). Resulting layout is pattern-major, not site-major. Verify optimizePatternRates (tree/phylotree.cpp:2566) and ratemeyerhaeseler.cpp:199 don't depend on the original order.
  6. computeConst decoupling risk in the rewritten extractPatterns/extractPatternFreqs (alignment.cpp:3693, 3713): they only call countConstSites(), which trusts Pattern::flag from the source alignment. If the source hasn't had computeConst run yet, you silently get 0 const sites. An ASSERT or unconditional updateConstPatterns() is warranted.
  7. getCodonStateTypeFromSites now outWarnings per stop codon (alignment.cpp:2089-2099), whereas the old buildPattern printed an Info: line. CI scripts that grep WARNING could flip.

Behavioral changes not flagged in the PR body

  • Alignment() always sets name = "Noname" (alignment.cpp:13). Code that checks aln->name.empty() will now be false.
  • extractSiteID now throws on empty input (was silent no-op) (alignment.cpp:3993).
  • Pattern::frequency default is now 1, not 0 (pattern.h:32) ‚Äî silently masks any caller that intended freq=0.
  • regroupSitePattern signature dropped the groups arg ‚Äî breaks out-of-tree forks.
  • runAliSim mixed-files case is now a hard error (main/alisim.cpp:36-42); scripts that mixed file + empty partitions to disable inference mode will break.
  • Alignment::extractSites now consults this->genetic_code to auto-decide CODON/PROTEIN; previously the caller drove this.
  • Several OBSOLETE blocks parked in comments (alignment.cpp:4317-4346, superalignment.cpp:153-160) ‚Äî should be deleted, not commented out.

Style / maintainability

  • C++17 init-statement-in-if used in superalignment.cpp:330, 455, 623 ‚Äî fine if -std=c++17 is set, just confirm CMake.
  • static_cast<CharSet&>(*aln) = CharSet() in concatenateAlignments (superalignment.cpp:1707) is brittle slice-assignment; prefer an explicit reset helper.

Test coverage

  • No new tests. No unit tests exist for alignment/ in this repo; cmaple/unittest/ covers a different subsystem.
  • test_scripts/test_iqtree.sh is an integration harness ‚Äî untouched in this PR. For a refactor that rewrites every alignment-reading path (Nexus, FASTA, Phylip, CountsFormat, partition-Raxml, partition-Nexus, AliSim, SSF), this is the single biggest risk in the PR.

Verdict

Request changes. Real bugs are fixed and the design is healthier, but three items block merge: (a) the potential double-free in readSiteStateFreq (alignment.cpp:6166-6219), (b) the bool-takes-char* regression in extractSiteID (alignment.h:148), and (c) the absence of any new test for a 4000-line refactor of IO read paths. Items (a) and (b) need code changes; (c) at minimum needs a smoke run added to test_scripts/test_iqtree.sh.

Copy link
Copy Markdown
Member

@bqminh bqminh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude is wrong. all good now.

@bqminh bqminh merged commit 8458845 into iqtree:master May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PMSF might not work correctly in the presence of identical seqs and a potential solution to it

2 participants