Replies: 4 comments 1 reply
-
|
Hello, Complicated questions! I would say generally that you are looking for convergence in parameter values (like lambda) and not necessarily likelihoods. You could also report a mean and range of parameter values among your runs, if you wanted to be more transparent. As for failures, this generally means that a particular parameter value used with a particular gene family results in an incalculable number. I would only really worry if your estimated lambda is near a region of parameter space with many failures. Matt |
Beta Was this translation helpful? Give feedback.
-
|
Hi, Sorry for the delay in answering. This seems like a rather complicated dataset, so it's a bit hard to know what's going on from here. One thing I'm confused about is k and lambda: for a gamma model with k categories, there will also be k mean lambda values, one for each category. Are these the lambdas you mean, or are you also fitting varying numbers of branch-specific lambdas? Regardless, I would be interested to know what lambda you are getting without the gamma model. Is it near the limit? Matt |
Beta Was this translation helpful? Give feedback.
-
|
Hi Matt, Thank you so much for your response. I previously misunderstood and thought I could not see the output for the k mean lambda values for each gamma rate category and could only obtain the branch-specific lambdas via the Gamma_results.txt file. Your response to issue #246 helped me understand that the Gamma Cat Mean column of Gamma_family_likelihoods.txt represents these k mean lambda values. If I am now understanding correctly, k mean lambda values from Gamma_family_likelihoods.txt represent multiplier values for rates of gene family evolution and would be what I am more interested in (in addition to which gene families have expanded and contracted). How many discrete rate categories (or k) you define will determine how the model discretizes the gamma curve (or in other words, changes how many mean rates of gene evolution are tested to fit each gene family). For example, a k=2 would give you two mean rates with one high and one low rate of gene evolution and whichever gamma cat mean fits the gene family better, will be significant. Since I now understand better, I can better answer your question about the limit. Without the gamma model, the base lambda is well below the limit: With the gamma model (k=2 & a single branch lambda not specified by a tree), the single global lambda is again not near the limit: Increasing the number of gamma rate categories (for example, k =3 with a single branch lambda), follows this same pattern. The provided single global lambda is 0.0067072431833528 while the k mean lambda values are 0.00443491, 0.288718, and 2.70685. When you multiply 1.92109 and 0.0067072431833528, you again get the limit. My questions:
Thank you for reading my long discussion post. I write a lot, in part, to hopefully help others follow along in the future if they encounter similar questions. I look forward to hearing your response. You have been incredibly helpful while I learn and I cannot thank you enough. Best, |
Beta Was this translation helpful? Give feedback.
-
|
Hi Tatum, First, a clarification: there is no sense of "significance" with the gamma categories, either for families within a category or in the number of k. That is, families do not fit one category or another "significantly" better--they just have posterior probabilities of membership. Likewise, there is no statistical way to pick the best k, with simulations or otherwise. As to your other questions:
cheers, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello CAFE5 Forum,
I am new to using CAFE5 and have a few questions about evaluating convergence.
My plan is to follow previous discussion forum post methods to determine which model best fits my data as follows:
In order to complete the second step and compare models, I want to use the appropriate lambda values. My understanding is you want to use ones that have converged, but I am confused about the definition of convergence and what to do if across ten runs, the values do not converge.
In the “Caveats” section of the CAFE5 tutorial, it is advised that one should always perform multiple runs to ensure convergence. I was wondering if you could clarify this caveat. Is convergence considered a prerequisite for using a model? If so, what is considered an appropriate amount of convergence? All the same value? Majority of runs have the same value? Runs being within a certain percentage of each other?
For example, I ran my global model 10 times, each writing to a separate output directory. All of my -lnL values were the same value with essentially the same resulting lambda. On the other hand, once I introduce multiple lambdas with multiple rate categories, there is greater variation between -lnL. As I increase my lambda and/or my rate categories, I see less and less convergence of my -lnL values. The tutorial also states that “For all but the simplest of data sets, searching for multiple lambdas with multiple rate categories will result in a failure of convergence to a single optimum between runs”. If failure of convergence is expected, then why test for it in more complex models in the first place? If it fails to converge, then how do you select lambda values to test on a simulated dataset?
Relatedly, what if the -lnL are more varied, but the lambda values seem to generally converge?
Example dataset:
*All done using root uniform distribution, an ultrametric tree in millions of years (~20mya), no error model
K=2
Global
For each of the 10 runs, -lnL and lambda are the same.
Lambda = 2
Of the 10 runs, 7 of them have basically the same lambda values but -lnL values are within 25 of each other (209495-209520, no families with failures); these also have very similar lambda values. The remaining 3 -lnL values are much more variable (216712-238671) with lambda values dissimilar to each other and the majority of the runs (and have at least one gene family with failures).
Lambda = 3
-lnL values range from 203703 to 224745. Seven of the ten runs have no gene family failures, while 3 of the runs have one family with failure (and attempted optimizer values rejected - up to 11%). The third lambda result is essentially zero.
K=3
Global
All of the 10 runs, half have at least one gene family with failures and the other half have many more. -lnL range from 212107 to 216571 (with 6 of the runs converging on the same number of 212107).
Multiple Lambda
Lambda = 2 and Lambda = 3 also progressively increase the variance in -lnL within this rate category of k = 3.
K=4
Most of the runs, regardless of lambda have some degree of attempted optimizer values rejected
Lastly, what relationship do rejected attempted optimizer percentages have with gene family failures? Is there a threshold of concern for either? I've seen previous posts on gene family failures, but less on rejected optimizer values and the implications of the combination of both.
Thank you in advance for your guidance and patience.
Beta Was this translation helpful? Give feedback.
All reactions