Site concordance factor values are radically different between TreeSearch and IQ-TREE #175

HS6986 · 2025-02-06T11:14:45Z

HS6986
Feb 6, 2025

Dear Professor Smith,

I have been analyzing multiple morphological datasets with TreeSearch and have realized that when I try to estimate the support for each clade with QuartetConcordance() (I understand that it is equivalent to the parsimony-based site concordance factor in IQ-TREE (Minh et al., 2020) and different from what Lanfear & Hahn (2024) calls the quartet concordance factor, correct?), almost always most clades show very high support (mostly >0.9).

I thought this might be a problem, so I also calculated (parsimony-based) site concordance factor values in IQ-TREE, which were quite different from those in TreeSearch and generally showed lower values. Is this because TreeSearch's site concordance factor possibly considers inapplicable characters? Or is there something wrong with the implementation of TreeSearch, or IQ-TREE?

Below are the results of my tests using the dataset of Aria et al. (2015), which is a sample file in TreeSearch.

Strict consensus tree with site concordance factor values calculated in TreeSearch

Strict consensus tree with site concordance factor values calculated in IQ-TREE

Here's the folder used for the analysis. I would appreciate it if you could review this.

Answered by ms609

Feb 10, 2025

I've looked back over Minh et al. 2020, and think I see where the difference in behaviour lies.

The interpretation of a branch in Minh et al. (2020) is shown in their Fig. 1:

They define interpret a branch as defining four clades, and thus only consider quartets that contain one representative of each surrounding subtree, A, B, C and D.

My implementation interprets a branch as defining two clades – i.e. I consider any quartet that contains two taxa from (AB), and two taxa from (CD).

Since we are sampling from a different subset of quartets, it makes sense that we obtain different values.

I haven't yet wrapped my head around why the Minh et al. implementation would be preferable. I have s…

View full answer

ms609 · 2025-02-06T15:59:17Z

ms609
Feb 6, 2025
Maintainer

I've checked the maths in the TreeSearch QuartetConcordance() implementation, and worked through a couple of examples, and this seems to be behaving as I'd expect it to.

My guess would be that, because IQTREE uses a random subsample of quartets where TreeSearch enumerates all quartets, its figures reflect the quartets it happened to sample. If IQTREE is configured with much larger datasets, it may sample a very small proportion of quartets, which in a small dataset like this could lead to statistically misleading numbers (e.g. the alarmingly low '7' near the base of the leanchoiliids). I don't often use IQTREE – perhaps you could see whether this is plausible, perhaps by setting different random seeds or replicating the analysis?

Another possibility that I ruled out is that this reflects a difference in how the QC value is averaged across sites. I've introduced this as an option to the user in #176.

2 replies

HS6986 Feb 6, 2025
Author

Thank you for your prompt response and addressing my question.

I calculated sCF values in IQ-TREE five times with different seeds using the dataset, with the option --scf increased from 100 to 100000, meaning that IQ-TREE randomly samples 100000 quartets around each internal branch (see the Concordance Factor documentation of IQ-TREE). The values were nearly identical and generally in agreement with the figure above, including the clade around leanchoiliids. Here's the folder used for the analysis.

I don't know so much about the theoretical details, but I think perhaps there is some significant difference between IQ-TREE and TreeSearch's implementation of the sCF.

P.S. I've understood by looking at the code that the sCF in TreeSearch does not consider ambiguous or inapplicable states. I present the results of my experiments with a larger dataset, that of Shao et al. (2020). The sCF values in IQ-TREE were almost identical over several replicates with different seeds, and were generally much lower and perhaps more convincing than those in TreeSearch.

HS6986 Feb 10, 2025
Author

This is just an intuitive guess, but I think TreeSearch's QuartetConcordance() may underestimate discordant.

Perhaps the proper means would be to consider two alternative topologies for each quartet and calculate DF1 and DF2, and then use the sum of the two as the discordance factor (?).

ms609 · 2025-02-10T16:50:08Z

ms609
Feb 10, 2025
Maintainer

I've looked back over Minh et al. 2020, and think I see where the difference in behaviour lies.

The interpretation of a branch in Minh et al. (2020) is shown in their Fig. 1:

They define interpret a branch as defining four clades, and thus only consider quartets that contain one representative of each surrounding subtree, A, B, C and D.

My implementation interprets a branch as defining two clades – i.e. I consider any quartet that contains two taxa from (AB), and two taxa from (CD).

Since we are sampling from a different subset of quartets, it makes sense that we obtain different values.

I haven't yet wrapped my head around why the Minh et al. implementation would be preferable. I have some scripts lying around that I used to compare the performance of different edge support metrics – it would be interesting to see how the two implementations compare. I'll see whether I can implement the Minh et al. approach, and update the TreeSearch documentation accordingly.

8 replies

ms609 Feb 17, 2025
Maintainer

Ah, good catch, thanks for pointing this out!
Tests added to the test suite so I won't overlook this.

ms609 Feb 17, 2025
Maintainer

Now I come to think about this, I don't see how multifurcating trees can be handled – as the Minh et al. definition requires us to select a leaf from each of the four subtrees defined by a branch: if there is a polytomy, then a branch cannot be interpreted in this manner.

HS6986 Feb 17, 2025
Author

That's certainly true. However, IQ-TREE's sCF at least superficially supports multifurcating trees. We may need to ask the IQ-TREE developers or check their implementation.

HS6986 Apr 22, 2025
Author

Dear Professor Smith @ms609,

Although this matter may have already been resolved, the sCF function in IQ-TREE seems to be able to handle multifurcating trees correctly (iqtree/iqtree2@cd41adf).

ms609 Apr 23, 2025
Maintainer

OK – come to think of it I suppose the presence of a polytomy does not make the sCF incalculable elsewhere in the tree. I'll return to this once we have a clear picture of how the IQ-TREE sCF is calculated in practice.

Site concordance factor values are radically different between TreeSearch and IQ-TREE #175

Uh oh!

Uh oh!

HS6986 Feb 6, 2025

Strict consensus tree with site concordance factor values calculated in TreeSearch

Strict consensus tree with site concordance factor values calculated in IQ-TREE

Replies: 2 comments · 10 replies

Uh oh!

Uh oh!

ms609 Feb 6, 2025 Maintainer

Uh oh!

Uh oh!

HS6986 Feb 6, 2025 Author

Uh oh!

Uh oh!

HS6986 Feb 10, 2025 Author

Uh oh!

ms609 Feb 10, 2025 Maintainer

Uh oh!

Uh oh!

ms609 Feb 17, 2025 Maintainer

Uh oh!

ms609 Feb 17, 2025 Maintainer

Uh oh!

HS6986 Feb 17, 2025 Author

Uh oh!

Uh oh!

HS6986 Apr 22, 2025 Author

Uh oh!

ms609 Apr 23, 2025 Maintainer

HS6986
Feb 6, 2025

Replies: 2 comments 10 replies

ms609
Feb 6, 2025
Maintainer

HS6986 Feb 6, 2025
Author

HS6986 Feb 10, 2025
Author

ms609
Feb 10, 2025
Maintainer

ms609 Feb 17, 2025
Maintainer

ms609 Feb 17, 2025
Maintainer

HS6986 Feb 17, 2025
Author

HS6986 Apr 22, 2025
Author

ms609 Apr 23, 2025
Maintainer