Skip to content

Introduce --site(-float)-weights options and implement Bayesian Bootstrap#157

Open
trongnhanuit wants to merge 36 commits into
iqtree:masterfrom
trongnhanuit:site_weights_claude
Open

Introduce --site(-float)-weights options and implement Bayesian Bootstrap#157
trongnhanuit wants to merge 36 commits into
iqtree:masterfrom
trongnhanuit:site_weights_claude

Conversation

@trongnhanuit
Copy link
Copy Markdown
Member

This pull request includes:

  • Introducing the two options: --site-weights and --site-float-weights to specify site-specific (integer/floating) weights.
  • Implementing the Bayesian Bootstrap.

…outputs integer weights for the standard bootstrap and floating-point weights for the Bayesian bootstrap
# Conflicts:
#	.gitignore
#	alignment/alignment.cpp
#	utils/tools.cpp
…branches in bootstrap trees only (not the ML tree). This option is automatically set at True for Bayesian Bootstrap
…length for collapsing near-zero branches when computing Bayesian Bootstrap
@bqminh
Copy link
Copy Markdown
Member

bqminh commented May 7, 2026

Some issues below. I won't merge this yet into v3.1.2 but aiming for v3.1.3 once you solved these.

Summary

Adds --site-weights (integer), --site-float-weights (float), and -bsam BAYES (Bayesian bootstrap via Dirichlet(1,...,1) weights). Also adds -czbb to collapse near-zero branches in bootstrap trees only, and -wbsw to output per-site bootstrap weights.


Issues Found

1. getNSite() returns double as size_t (alignment.h:441-443)
total_site_float_weight is a double, but getNSite() returns size_t. Returning a truncated float as an integer is fragile — the comment on line 672 in the BAYES block even warns about this. Any caller doing for (i = 0; i < getNSite(); i++) will silently truncate.

2. Unused variable pos (alignment.cpp ~line 930)

size_t pos = 0;
weight = convert_double(token.c_str());

pos is declared but never used — compiler warning.

3. integrateSiteSpecWeights called before countConstSite but after patterns are built
The integer weight expansion duplicates site_pattern entries and bumps pattern.frequency, which is correct. But if the alignment was loaded from NEXUS (NxsDataBlock constructor, line 1092), the site weights are applied after the alignment is fully built but countConstSite() runs immediately after — so this path looks correct. Good.

4. frac_const_sites / frac_invariant_sites restoration is a workaround, not a fix (alignment.cpp ~line 4753)
Overwriting these from the source alignment after countConstSite() is fragile. If the source alignment itself had Bayesian bootstrap applied (nested bootstrap), this breaks. A cleaner approach: skip countConstSite() for Bayesian bootstrap, or pass a flag to compute fractions from pattern_weight[] instead of integer frequencies.

5. computePtnFreq checks params->bootstrap_spec globally (phylotreesse.cpp:543-545)

const bool bayes_boot = (params->num_bootstrap_samples > 0 && 
                         params->bootstrap_spec &&
                         strncasecmp(params->bootstrap_spec, "BAYES", 5) == 0);

This checks global params, not whether this specific alignment is a bootstrap replicate. During standard tree search (before bootstrap), this condition is false, but if someone runs -b 100 -bsam BAYES, the main tree also hits this path because params are global. The check should be on aln->pattern_weight.size() == nptn alone, without the global params check.

6. readSiteSpecFloatWeights is called lazily from computePtnFreq (phylotreesse.cpp:551-553)
Reading a file inside a likelihood computation function is a side effect that will surprise anyone. It should be called once during alignment initialization, like integrateSiteSpecWeights is.

7. collapse_zero_branch_boot defaults to true for -bsam BAYES (tools.cpp:3524)
This is set automatically without the user asking for it. The -czbb flag exists separately, but -bsam BAYES forces it on. This should either be documented prominently or not be auto-enabled.

8. pars_scale parsing (alignment.cpp ~line 4641)

int pars_scale = (spec[5] == ':') ? max(1, convert_int(spec + 6)) : 10;

If spec is "BAYES:" (colon but no number), convert_int("") will either throw or return 0, then max(1, 0) = 1. The validation in tools.cpp catches !isdigit(s[6]), but convert_int is called here without that guard — a TOCTOU gap if the validation is ever bypassed.

9. Duplicated weight-writing code
The "write site weights to file" logic is copy-pasted ~4 times (standard boot in phyloanalysis.cpp, UFBoot standard in iqtree.cpp, UFBoot Bayesian in iqtree.cpp, partition in superalignment.cpp). Should be a helper function.

10. pattern_weight name collision risk
Adding DoubleVector pattern_weight as a new member alongside the existing Pattern::frequency (integer) creates two parallel weight systems. The name pattern_weight is very generic and could easily be confused with ptn_freq in PhyloTree. Consider a more specific name like pattern_float_freq or dirichlet_pattern_weight.


Minor Issues

  • Debug commented-out code left in phylotreesse.cpp (lines 562-566) ‚Äî should be removed before merge
  • Whitespace-only change removing a blank line in phylotree.h ‚Äî unnecessary noise
  • .gitignore change (softwipe_build ‚Üí softwipe_build/) is unrelated to this PR

Verdict

The core Bayesian bootstrap implementation (Dirichlet sampling, weight integration) is correct. The main concerns are: (1) getNSite() silently truncating a float, (2) file I/O inside computePtnFreq, (3) duplicated weight-writing code, and (4) the global params check in computePtnFreq affecting non-bootstrap trees. These should be addressed before merge.

# Conflicts:
#	alignment/alignment.cpp
#	alignment/alignment.h
@StefanFlaumberg
Copy link
Copy Markdown
Contributor

Dear Nhan @trongnhanuit,

I haven't dug deeply into your changes, but I'm pretty sure that your "Merge master" commit is going to significantly mess up bootstrapping, unless you:

  • (Critical!) in Alignment::createBootstrapAlignment(), remove the following chunk:
    // For Bayesian bootstrap we keep the original site→pattern mapping unchanged;
    // all other modes start with site_pattern filled with -1 (unassigned).
    if (spec && strncasecmp(spec, "BAYES", 5) == 0)
        site_pattern = aln->site_pattern;
    else
        site_pattern.resize(aln->getNSite(), -1);

Due to my recent changes, site_pattern and pattern_index are fully managed by the addPattern() function. One should not modify these members by hand, unless adding patterns with push_back() for some reason. If using addPattern(), the pattern must have its frequency set (used to auto-fill site_pattern) and these members must initially be empty -- that's why the else case is incorrect.
The if case is redundant, since you refill site_pattern in the "BAYES" case anyway.

  • (Optional) in Alignment::createBootstrapAlignment() in the "BAYES" case, substitute this:
            push_back(pat);
            pattern_index[pat] = ptn;

with this:

            addPattern(pat);   

and remove the following chunk:

        // Rebuild site_pattern to be consistent with scaled integer frequencies so that:
        // (a) PLL's concatenateAlignments assertion holds for partition models, and
        // (b) parsimony correctly uses Dirichlet-weighted frequencies.
        site_pattern.clear();
        for (size_t ptn = 0; ptn < nptn; ++ptn)
            for (int rep = 0; rep < at(ptn).frequency; ++rep)
                site_pattern.push_back(ptn);

Nothing deeply wrong here, but since you've set pattern's frequency before adding the pattern, by using addPattern() you would automatically fill pattern_index and site_pattern correctly, as I mentioned above. The code becomes much simpler.

  • (Critical!) in SuperAlignment::createBootstrapAlignment() in the if block, substitute this:
        taxa_index = super_aln->taxa_index;
        countConstSites();

with this:

        seq_names = super_aln->seq_names;
        taxa_index = super_aln->taxa_index;
        buildPattern();

The reason is clear: the bootstrap alignment is a new alignment, it doesn't have seq_names set or patterns built yet.

Actually, I was expecting that your pull request would be merged before mine, so that it would be my headache to resolve merging conflicts, since you've introduced far fewer changes than I did. So I'll be happy to help if needed!

Best,
Stefan

@trongnhanuit
Copy link
Copy Markdown
Member Author

Hi @StefanFlaumberg ,

Thanks a lot for checking this and for the very helpful information. I was halfway through merging your changes and revising my code accordingly yesterday, and I planned to continue working on it today. Your message came at the right time and will save me a lot of time and effort. I really appreciate it.

I’ll let you know once I’ve finished the revision.

Many thanks,
Nhan

@trongnhanuit
Copy link
Copy Markdown
Member Author

Hi @StefanFlaumberg,
I've completed the merging. If you find any issues, please let me know. Many thanks for your help and contributions.

Cheers,
Nhan

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

Hi Nhan @trongnhanuit,

From what I see, now the changes do not interfere with anything in standard analyses (i.e. when run without the new options) -- thanks for applying the amendments!

Regarding the new options themselves, there clearly exist some problems: site weights should be banned/made compatible with partitions and AliSim, site weight members should be properly copied by alignment copying functions, the logic of when to use raw site number and when to use weight-corrected site number is likely not well-considered. But since this seems to be only a test version yet and the code may still evolve, it might be easier to solve these problems later, based on more mature code.
For that matter, it seems to me that the code logic can be significantly simplified first:
I'm not sure that modifying the real pattern frequencies and, accordingly, site_pattern is the right way. We could store both integer and float site weights in DoubleVector pattern_weight only. Then for ML, PhyloTree::computePtnFreq() would use pattern_weight to set ptn_freq for the downstream analysis. With correct normalization (I guess, something like ptn_freq[ptn] = pattern_weight[ptn] * aln->getNSite() / sum_pattern_weight), it would be possible to use getNSite() directly in all cases (without ad hoc corrections, such as in rategammainvar.cpp). For parsimony, we could convey the weights to the downstream analysis by setting the corresponding frequencies to the patterns of ordered_pattern in Alignment::orderPatternByNumChars(). Note, that ordered_pattern is used only for parsimony, so changing the frequencies here would not affect anything else.
This way, we could decouple pattern weights from pattern frequencies everywhere except for likelihood and parsimony computations. This would make code much more simple and robust. The price would likely be incompatibility with PLL, but from the user point of view it's almost nothing.

It is easier to put this all in code, than in words, but I hope my logic is clear. This is only my superficial analysis, so maybe my suggestions are nonsense. What do you think? I'm interested in proceeding the discussion.

Best regards,
Stefan

@bqminh
Copy link
Copy Markdown
Member

bqminh commented May 23, 2026

Thanks for feedback!
@trongnhanuit what do you think? It's probably best if @StefanFlaumberg makes the changes and we'll review them.

This option will be quite useful for the community, both users and developers (I'm aware of several asking about this)

@trongnhanuit
Copy link
Copy Markdown
Member Author

trongnhanuit commented May 25, 2026

Hi @StefanFlaumberg and @bqminh,

Stefan’s comments sound reasonable to me, and I could directly apply some of them (e.g., banning site weights for AliSim, properly copying site weights in alignment copy functions, and improving the logic for using raw versus user-specified site weights). However, I am a bit concerned about simplifying the code logic by storing both integer and floating-point site weights in pattern_weight, given the current design of the two options:

  • --site-weights: specifies site-specific weights (integers), which are used throughout the entire phylogenetic inference process, from building parsimony trees to ML tree search.
  • --site-float-weights: specifies site-specific weights (floating-point numbers), which are only used for likelihood maximizations.

These two options can be used simultaneously, where the integer weights are applied throughout the analysis except during likelihood maximizations, which instead use the floating-point weights.

In addition, as Stefan mentioned, the proposed approach is not compatible with PLL. Therefore, to fully apply all the proposed changes, we may need to consider:

  1. Change the design so that users can specify only one of --site-weights or --site-float-weights.
  2. Ignore site-specific weights in building parsimony trees or force a switch from PLL to the IQ-TREE parsimony kernel - however, this requires further code modifications.

What do you think, @bqminh and @StefanFlaumberg?

@bqminh
Copy link
Copy Markdown
Member

bqminh commented May 25, 2026

It'd be good to have both options working together in a single run, they are not mutually exclusive. If both options are specified, then --site-float-weights will be used for likelihood calculation, whereas --site-weights are used for others like parsimony calculation.

I'd recommend to turn off PLL with --site-weights, just switch to our own parsimony kernel, like with -t PARS option, if that simplifies things.

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

@trongnhanuit,
I didn't realize we were considering simultaneous usage of these options, thanks for the clarification!

There are two core ideas that should apply whatever logic we choose to follow:

  1. The sum of ptn_freq must equal the number of sites. Otherwise, we would have to reconsider throughout the code for each call of the getNSite() function whether we should use the real site number or the sum of ptn_freq. This would be quite hard to implement correctly and might complicate the implementation of other features in the future. Since the site weight options are simple add-ons, they should not affect the code this dramatically. The idea can be implemented just by using the appropriate normalization.
  2. Assigning different float weights to the sites of a pattern must split the pattern. This is important for the alignment-copying functions, which have to copy alignments on a per-site basis. A site is copied by adding the site's pattern with its frequency set to 1. This trick is only possible if all sites of a pattern are totally identical by all parameters. I implement this idea for filling ptn_state_freq by calling Alignment::regroupSitePattern() in the Alignment::readSiteStateFreq() function. Things must be similar for pattern weights. In fact, this simplifies the downstream logic -- after regrouping, the input weights are not properties of sites, but of patterns (each distinct weight has its own pattern now).

The issue is that Nhan's idea:

These two options can be used simultaneously, where the integer weights are applied throughout the analysis except during likelihood maximizations, which instead use the floating-point weights.

is incompatible with my idea 1. If "the integer weights are applied throughout the analysis" means adding site repeats to the alignment, then it increases the number of sites. This, in turn, should affect ptn_freq if we want the sum of ptn_freq to equal the number of sites. It is possible to make ptn_freq elements be proportional to the input site float weights, but the alignment likelihood would still be affected by the integer weights.

I can envision the following 3 directions:

  1. Mostly keep up with the current design. Integer weights are interpreted as site multiplicities, setting the weights means adding copies of the existing sites to the alignment. This by itself decreases the alignment likelihood. If float weights are given, ptn_freq should follow the float weights, but the sum of ptn_freq should equal the number of sites, so it would depend on the integer weights as well. Both options can be used simultaneously, no problems with PLL.
    However, I am not sure that adding new sites to an alignment is a good idea if we really only want to add weights.
  2. Do not add site copies to the alignment, but keep both options. Use integer weights for parsimony by setting the appropriate pattern frequencies in the Alignment::orderPatternByNumChars() function. Use float weights for setting ptn_freq. The sum of ptn_freq should equal the number of sites, but there'd be no dependency on integer weights.
    This is a simpler design. The cost is that we would have to fall back to the non-PLL parsimony kernel or implement integer weights separately for PLL. But many other options in IQ-Tree don't work with PLL as well...
  3. Treat integer weights just as float weights. This is what I originally proposed in the previous comment. Then setting weights would similarly affect parsimony (through setting pattern frequencies to the rounded weights in Alignment::orderPatternByNumChars()) and likelihood (through setting ptn_freq to the normalized weights).
    This is the simplest design. It uses a single option for both integer and float weights. However, it would have to use the non-PLL parsimony kernel.

I still think that direction 3 is the best solution because it doesn't changes the alignment (in contrast to 1), it treats parsimony and likelihood in the same way (in contrast to 1 and 2), and it gives weights a single universal meaning (much easier to explain and use than in the directions 1 and 2!).

It might be easier for me to implement the pattern regrouping and weight copying features myself, as I've recently done this for ptn_state_freq; moreover I've spotted a couple of other inconsistencies in the merging. I could make a pull request to Nhan's fork-branch if needed.
However, I think the above directions should be discussed first.

@trongnhanuit
Copy link
Copy Markdown
Member Author

Hi @StefanFlaumberg,
Thanks a lot for your thoughtful consideration.
At the beginning, we chose direction 1 because it is compatible with the PLL (but at the cost of needing to extend the alignment).

If allowing switching from PLL to IQ-TREE's parsimony kernel, we can consider directions 2 and 3.
Regarding direction 3 (using a single option for site weights), we actually allow small floating weights (e.g., 0.01). Directly rounding these weights will lead to many zero frequencies. More importantly, we decided to have two options for flexibility: users can apply their own scaling scheme, e.g., max(1, float_w*100), or use their own ways to generate integer weights for building parsimony trees, and then use a separate set of floating weights for likelihood calculations. With that, direction 2 seems more appropriate.

Regarding direction 2, if I understand correctly, it keeps the code for "--site-float-weights" from direction 1 (but normalizing ptn_freq sums to the number of sites). It will change the code for "--site-weights" so that we don't extend the alignment, but we need to switch to IQ-TREE's parsimony kernel. Am I right? If yes, would you mind making the changes, and we'll review them.

Btw, if you find any inconsistencies in the merging, please make a pull request to my fork-branch. Highly appreciate that.

Cheers,
Nhan

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

@trongnhanuit,

Regarding direction 2, if I understand correctly, it keeps the code for "--site-float-weights" from direction 1 (but normalizing ptn_freq sums to the number of sites). It will change the code for "--site-weights" so that we don't extend the alignment, but we need to switch to IQ-TREE's parsimony kernel. Am I right?

Yes, correct. Once we consider the following details, I will try to implement it. Within a week, I shall provide a pull request or shall report if insurmountable problems appear.

The implementation details:
In your current implementation integer weights alone are applied both to parsimony and likelihood, float weights alone are applied only to likelihood, and the combined usage applies integer weights only to parsimony and float weights only to likelihood. In contrast, the Bayesian bootstrap float weights are applied both to parsimony and likelihood (much like integer weights despite being float!).
The float weights overriding integer weights is tricky and might be somewhat confusing. If we choose not to modify the alignment, we can follow this design (let's call it direction 2 originial) or consider adopting more user-friendly designs:

  • direction 2 simplified: Use two compatible options. Use --site-pars-weights to pass integer weights applied to parsimony only. Use --site-lh-weights to pass float weights applied to likelihood only. The Bayesian bootstrap float weights are used in the same way they are used now.
    This provides the maximal usage flexibility by allowing independent tuning of parsimony and likelihood weights. To use a set of integer weights throughout the entire run, one would have just to pass the same file to the both options: --site-pars-weights <int_weights.txt> --site-lh-weights <int_weights.txt>.
  • direction 3: Use a single option --site-weights to provide float (or integer) weights. The weights are rounded towards the closest integer for parsimony and are used as is for likelihood. With the Bayesian bootstrap float weights, I would also use rounding towards the closest integer for parsimony.
    This provides the maximal simplicity: a single option sets weights for the entire run, the weights for parsimony and likelihood are predictably correlated, no need for users to manually round the weights for parsimony. And using a single set of weights significantly simplifies the code and makes it more robust!

I think either of these variants makes usage much more understandable and predictable than direction 2 original or the current version. Since I like direction 3 the most, I will put a couple of additional arguments for it:
Why rounding towards the closest integer is fine:
Since parsimony costs are additive, parsimony should work alright with zero pattern frequencies (hope I'm right, it's the key premise). Then, if we normalize site weights so that they sum to the number of sites, rounding towards the closest integer becomes sensible. What could be another solution anyway? If the normalized weight of a site is 0.01, the site surely should just be excluded from the parsimony analysis. The same applies to the Bayesian bootstrap site weights; what's the rationale behind using max(1, round(weight)) in the current version?
Why using a single set of weights is fine:
In practice, I can imagine none of the usage cases where one would want to use totally different sets of site weights for parsimony and likelihood, or would want to use site weights only for parsimony, or would want to use site weights only for likelihood. They all look unrealistic. Not much more realistic is the need for using a custom weight-rounding scheme for parsimony -- the parsimony analysis is just a heuristic step to build the starting tree, so fine-tuning the weight-rounding scheme has likely almost no impact on the subsequent ML analysis results (unless one uses some very weird rounding).
Providing more flexibility is great, but not when it complicates the code in favour of some unrealistic usage cases.

I hope we could choose the best design before applying any actual changes. What do you think of this?

Best regards,
Stefan

@trongnhanuit
Copy link
Copy Markdown
Member Author

Hi @StefanFlaumberg,

Thank you so much for your willingness to make changes and your thoughtful consideration.

Regarding the Bayesian bootstrap, we want to improve the support for short branches that contain a very limited number of substitutions. So, we use max(1, round(float_weight × scale_factor)) (default scale_factor = 10) rather than the simpler round(weight) for two reasons. First, the max(1, ...) floor ensures that no site is completely excluded from parsimony - even a site with a very small float weight still carries a real substitution signal, and silencing it entirely could bias the starting tree. Second, the scale factor preserves relative distinctions between small weights (e.g., 0.9 -> 9, 1.4 -> 14 with scale_factor = 10). The preliminary results are good and similar to the PhyML implementation, so we should keep the current Bayesian bootstrap behavior.

Regarding the two site weight options, we insist on keeping them separate for flexibility. In particular, in another project, we only want to apply float weights for likelihood calculations while preserving all other steps unchanged. So direction 2 seems more appropriate than direction 3.

I'll leave it to @bqminh to decide the directions. To save your time, here is a summary, Minh.

In the current implementation, integer weights alone are applied both to parsimony and likelihood; float weights alone are applied only to likelihood; the combined usage applies integer weights to parsimony and float weights to likelihood. The current implementation is compatible with PLL, but at the cost of extending the alignment. Stefan is happy to update the implementation to switch to IQ-TREE's parsimony kernel, removing the need to extend the alignment. Three potential directions:

  • Direction 2 original: Keeps the current behavior of both options. Switches to IQ-TREE's parsimony kernel and removes alignment extension. Integer weights alone -> parsimony and likelihood. Float weights alone -> likelihood only. Combined → integer weights for parsimony, float weights for likelihood.
  • Direction 2 simplified: Renames and clarifies the two options. Use --site-pars-weights for integer weights applied to parsimony only, and --site-lh-weights for float weights applied to likelihood only. This removes the current confusion where integer weights behave differently depending on whether float weights are also provided.
  • Direction 3: Merges the two options into a single --site-weights accepting float or integer weights. Weights are rounded to the nearest integer round(weight) for parsimony and used as-is for likelihood. This will also change the current Bayesian bootstrap behavior (replacing the current max(1, round(weight × scale_factor)) with plain rounding of normalized weights). The advantages are maximal simplicity in code and usage. The concerns are: (1) rounding can produce zero frequencies for small weights and lose distinctions between similar weights (e.g., 0.9 and 1.4 both round to 1) unless a fixed scaling scheme is applied internally; (2) it is not possible to apply weights to parsimony alone or to likelihood alone; (3) it changes the existing Bayesian bootstrap behavior.

For me, either Direction 2 original or Direction 2 simplified is more favorable given the current designed features. Direction 3 is a simpler option if we are willing to change the designed features and the Bayesian bootstrap behavior.

@StefanFlaumberg, pls correct me if anything is incorrect or unclear.

Thanks a lot @StefanFlaumberg and @bqminh.

Cheers,
Nhan

@StefanFlaumberg
Copy link
Copy Markdown
Contributor

Hi @trongnhanuit, thanks for the summary! Everything's accurate here.
I'd like to add only a small clarifying note:
For Direction 3 we can implement any weight-rounding behaviour we want and so can keep the Bayesian bootstrap intact as well. Using round(weight) for general parsimony and round(weight * scale_factor) for Bayesian bootstrap parsimony just seems to align well with the central idea that the weights used for parsimony and for likelihood are expected to be highly correlated in 99% of real life usage cases. It is this idea that defines Direction 3, whereas the exact implementation of weight-rounding for parsimony can be considered separately. Therefore, concerns 1 and 3 can be further discussed and solved and I suspect concern 2 to be irrelevant for most usage cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants