What to count? Prelude and fugue


You do not have to understand it in details to understand the main part of this post. Just enjoy the music of it 🙂

Meidanis and Feijao introduced the SCoJ model, the Single Cut or Join model in genome rearrangement. A model where the small parsimony problem is in P, wow, nice, isn’t it? Eric Tannier drew my attention to this paper and suggested we should work on approximative sampling of most parsimonious SCoJ scenarios on a rooted binary tree. It looked pretty similar to our DCJ sampling and counting work. In case of DCJs, we had an auxiliary graph in which vertices represents AA and BB paths (W-shaped and M-shaped paths), edges represent joint sorting of these components. Each 1-factor represents a subset of DCJ scenarios, naturally, which components are sorted together and which one independently, and we defined a rapidly mixing Markov chain on the 1-factors converging to the distribution proportional to the number of corresponding DCJ scenarios. In case of SCoJs, we can define some auxiliary graph in which vertices represents extremities and edges possible adjacencies in ancestral genomes. For a given 1-factor, it is easy to count the number of SCoJ scenarios, so we can define a Markov chain, the task is to prove its rapid mixing.

After a while we started scratching our heads, and later on it turned out that it is very unlikely that there is any such rapidly mixing Markov chain, since it would imply that RP = NP, in which we do not believe. So good I moved to Theoretical Computer Science, the negative result is a result in mathematics, even if it is not as groundbreaking as the Galois theorem…

But I hoped that at least we can have some positive results on medians. The SCoJ median of odd number of genomes is unique, actually, as Meidanis and Feijao already showed. It is not unique for even number of genomes, but hopefully the solution space is not as complicated as for arbitrary binary trees. At least I hoped, and I gave it as a summer project to Heather Smith, a PhD student at University of South Carolina. Poor girl… But at least she had the opportunity to show how clever she was: she also started to scratch her head that cannot copy our Lemma 22 (in this paper), and indeed, she gave a counterexample that the analogue lemma is simply not true. And after some careful inferring, it turns out that problems start even if we drop the SCoJ model, and consider only independently evolving characters, namely, classical phylogeny.

So this is the Prelude, and now I try to write down the remaining in an understandable way…


The fugue will be understandable, but something very peculiar, in its most complicated, complex, baroque style. If I want to introduce it in a more raw and profane way, I would say, “what the bloody hell is going on???”. Or if you would like to learn some Hungarian cursing: “mi a halál redvás fasza folyik itt???”. Definitely the strangest thing I ever see in mathematical phylogeny.

Imagine a star tree, the leaves are labelled with sequences over a finite alphabet. For sake of simplicity, we will work with the easy {0,1} alphabet until we say something else. Each string is equally long and we are looking for the median of them, namely, a string that minimizes the sum of Hamming distances from these strings. Trivial problem, isn’t it? Just take the majority of the characters in each position of the median string. We would like to  to label the center of the star with it, and consider how this median is transformed into the sequences at the leaves.

If the number of strings are even, and the alphabet size is 2, as we fixed it, then there might be multiple solutions in each position. No problem, the solutions might be combined in an arbitrary way, so still it is easy to count how many medians we have. However, assume that we would like to count more: we also would like to count how many ways this median can be transformed into the sequences given at the leaves of the star tree. And here comes something which does not seem to be a big thing but makes really-really a difference.

We can look at the star tree as an unresolved rooted tree, and count how many way the mutations might come in time. See this example:


We will call such a solution as a most parsimonious scenario. There are 6 leaves and six mutations necessary to transform the median sequence at the root to the sequences at the leaves. These mutations might come in any order, so there are 6! possible most parsimonious scenario. This is a general phenomena: whatever the median is, the number of mutations are always the same, so the number of most parsimonious scenarios are the same for each median.

One can also imagine the usual star tree, and then count how many ways the mutations might come on each edges of the star tree, see this:


The sequences at the leaves are the same as in the previous case, just we have a different median. There are still 6 mutations, but now they fall onto 4 edges in a 2, 2, 1, 1 split, and hence the possible most parsimonious scenarios for this median is 2! 2! 1! 1! = 4.

Got it? So in the later case, we distinguish in which order the mutations come at each edge, but does not compare the time of the mutations at different edges.

Also note that different medians have different number of most parsimonious scenarios. Consider the previous median, 1001, that would have only 2 most parsimonious scenario since the 6 mutations would fall onto 5 edges in a 2, 1, 1, 1, 1 split.

And this makes the difference if we would like to count the number of most parsimonious scenarios summed over all possible medians: in the first case it is simply 2k (k n/2+c)!, where k is the number of positions with ambiguous ancestral characters, n is the number of leaves and c is the number of other mutations. In the second case… I really, really have no idea… Different medians have different most parsimonious scenarios and it is unclear if this can be counted quickly or even approximately.

There are the following open questions:

  • What is the median that maximizes the number of most parsimonious scenarios?
  • It is possible to calculate the number or most parsimonious scenarios in polynomial time or is it a #P-complete counting problem?
  • Is it possible to count/sample the most parsimonious scenarios approximately in a stochastic manner? Namely, is there any FPRAS and/or FPAUS algorithm for them? Or is it a problem that cannot be approximated under some reasonable assumption (assuming that RP is not NP)?

What we know is the mixing behavior of the simple Markov chain that walks on the possible median genomes by changing one position in the ancestral sequence at a time and converges to the distribution proportional to the number of most parsimonious scenarios belonging to the actual median. It is torpidly mixing even if the number of leaves is constant 4. Ouch… But this does not prove that there is no FPAUS for this problem…

And please note if we go back from the second model to the first one, the problem is almost trivial, we have a closed form for the number of most parsimonious scenarios that is easy to calculate.

When we wrote our SCoJ paper I give a seminar on it with the title: “How can an easy counting problem be transformed into a hopelessly hard one? Well, very easily…”. Well, indeed, and here something similar happens again: the difference between the two problems is a looking innocent multinomial factor which stands for how many ways we can merge the mutation events on the edges of the tree if we consider they order in a time scale.

Have you ever seen anything like this?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s