Alignment refactoring and bug fixing#160
Conversation
…r Alignment::readSiteStateFreq()
…expected_num_sites in AliSim
|
Dear @bqminh, The current PR has a couple of conflicts with the pending PRs by Huaiyan and Nhan, since the latter use the old function signatures, but these are straightforward to solve. |
|
Review by Claude Code: Code Review — PR #160 "Alignment refactoring and bug fixing"Author: @StefanFlaumberg · Base: OverviewLarge refactor of the alignment subsystem: consolidates Correctness concerns (severity-ordered)
Behavioral changes not flagged in the PR body
Style / maintainability
Test coverage
VerdictRequest changes. Real bugs are fixed and the design is healthier, but three items block merge: (a) the potential double-free in |
bqminh
left a comment
There was a problem hiding this comment.
Claude is wrong. all good now.
This PR introduces multiple bug fixes and design optimizations concerning the alignment classes.
Commit 81adb3a:
site_state_freq--->ptn_state_freq, removesite_modelto have more straightforward handling of site-specific frequencies.Alignment::readSiteStateFreq()to read the-fsfile in a more controlled way.-fsfiles with missing (defaulted) site entries (was allowed before, but didn't work).ptn_state_freqinAlignment::createBootstrapAlignment().ptn_state_freqto nullptr after delete[] in the~Alignment()destructor to avoid double delete.Commit f6017bc:
Alignment::extractSiteID(): add an error for empty input, use0andINT_MAXinstead ofHUGE_VALLas bound checks in theconvert_site_range()static function as positions are eventually converted toint, fix the check for extracting from CODON (the length of each range should be a multiple of 3, not only the total length of the ranges).getNSite()const, remove theexpected_num_sitesmember and all related code fromAlignment. The member was used to set fake lengths for empty alignments in AliSim. NowAliSimulator::initializeIQTreeFromTreeFile()fills these empty alignments with sites of a fake pattern. With almost no overhead, this makes the behavior ofAlignmentmore consistent, as functions using the return ofgetNSite()to work with sites will now work properly with the sites of the fake pattern instead of crashing on empty alignments.tree->aln->computeUnknownState();fromAliSimulator::initializeModel()toAliSimulator::initializeAlignment(). This is more logical and is required by the previous DESIGN point.runAliSim()to ensure that either all partitions refer to some alignment file(s) (then run with Inference Mode) or none do (then run without Inference Mode). Running without Inference Mode in the mixed cased was a bad decision and is not allowed anymore, since in this case theAliSimulator::initializeIQTreeFromTreeFile()is run in which partitions with alignments were first initialized by from their respective alignments and then were reinitialized (unsafely, with memory leaks for the CODON case) byAliSimulator::initializeAlignment().In practice, the mixed case likely corresponds to the user intending to run in Inference Mode and forgetting to specify alignment files for some partitions -- now an informative error is printed in this case.
retrieveAncestralSequenceFromInputFile()to ensure that user-provided ancestral sequence is used only when all partitions come from the same file and use the same sequence type and number of states. The rationale is that partitions coming from different alignment files useposition_specranges relative to the sequences of their respective files, these ranges cannot be properly used to extract regions from a common ancestral sequence.Commit ab5b4eb:
Alignment::addPattern()expects that the pattern's frequency is preset and adds sites of the pattern according to its frequency to the end of the alignment. The sequential addition provides a cleaner design ensuring that each added site is assigned to a pattern (i.e. that there are no garbage values insite_pattern).Alignment::computeConst()is not called by the function and thus can be called anywhere; currently it is called only byAlignment::updateConstPatterns()after adding all patterns.Pattern::frequencyis now set to 1 by default.This entails significant rewriting for
Alignment::readCountsFormat()andSuperAlignment::concatenateAlignments(). But the code is simplified and the logic remains: inreadCountsFormat()the sites now are added sequentially and dummy unknown states get rewritten at the end, inconcatenateAlignments()(adds little overhead, but much clarity).buildPattern()andconvertToCodonOrAA()now use theAlignment::getCodonStateTypeFromSites()function, extended to handle the NT2AA case. A stop codon is treated as a gap by the function, as it originally was done inbuildPattern().extractSubAlignment(),extractPatterns(),extractSites(), etc.) actually allocate and return a newAlignment *. To unify their design, theAlignment::initAlignmentCopy()function is added. The rationale is that the old, alignment-mutating versions of the functions could be called on already initialized alignments causing unexpected behaviour.groupmember toPattern: by defaultgroup = -1, equal patterns have equalgroupand equalStateVector. This enables discerning compositionally identical patterns to regroup sites and add sites with site-specific parameters in a straightforward manner.ptn_state_freq. IMPORTANT: a copied pattern must inheritgroupof the original pattern to ensure that identical sites with different frequency profiles do not get merged.Alignment::extractSites()decides itself whether to convert thespecranges to CODON/PROTEIN or not by checkingthis->genetic_code(it is not a nullptr iff the alignment is CODON or NT2AA). In particular, inSuperAlignment::readPartitionNexus()this fixes the case of applying thent2aabool flag inferred for globalinput_alnto partitions which may use their own alignment files of different data types.Alignment::removeAndFillUpGappySites()in phylotesting.cpp.Commit 3fdf1e3:
SuperAlignment::readPartition*()functions, merge functionsreadPartitionDir()andreadPartitionList()intoreadPartitionFiles().convertToCodonOrAA()call toreadPartitionRaxml()for the case when the input is DNA and a partition is set to CODON.-m <model>was not set to partitions passed as-S <dir>/<file1,file2,...>(modelvsmodel_nameconfusion inreadPartitionDir()andreadPartitionList()).-m <model>was not set to an alignment concatenated from files-s <dir>/<file1,file2,...>(now settingaln->model_nameincreateAlignment()).