This is a pretty technical post seasoned with some personal memory. I hope I added sufficiently many references such that an eager reader without background in genome rearrangement can understand this post, even if it needs quite a lot of effort.
The Braga-Stoye Markov chain
Sophia Yancopoulos is a very interesting person. She visited RECOMB-CG ’09 in Budapest literally for one coffee break (coming from the U.S…). She arrived at the beginning of the break, pulling her cabin luggage and laughing. Spent a few words with a few of us, and took a taxi to catch the flight back to the U.S. It was a very busy period of her life, but she couldn’t resist the temptation to attend to this annual workshop, she explained.
Her double cut and join model (DCJ) revolutionized the research on genome rearrangement. Although the DCJ model is not quite realistic from a biological point of view, it is similar to the more realistic reversal (or Hannenhalli-Pevzner) model, and computationally much simpler. And at least we know something about its counting complexity. In fact, we do know that the number of most parsimonious DCJ solutions between two genomes is in FPRAS and FPAUS.
Having a trilateral grant with Eric Tannier and Haris Gavranovic, we had a meeting in Sarajevo in 2009 December to work on genome rearrangement problems. We were in a pub on Friday evening, the 11th of December, 2009, when I realized the fact what is now Theorem 15 in this paper. “It must be a rapidly mixing Markov chain” I said enthusiastically. However, it took a while to work out the details since we failed at the point that looked the simplest at first glance.
(Cultural intermezzo: Sarajevo is a beautiful city, a rendezvous of western and eastern culture. The coffee were served with sugar cubes during the coffee breaks, and the Bosnian students attending to our brainstorming sessions proffered the sugar cubes to each other with the question: “kocku?” Kocka means cube in Hungarian, here ‘c’ should not be pronounced as for example in ‘cigarette’ but with a sound similar to what a cricket gives. The conjugation a -> u is the objective case of female words in Slavish languages, recalling my weak memories of Russian language learning. So I could understand ‘kocku?’, and although I am not much of a linguist, I can pretty much enjoy such discoveries.)
So the simplest case (on which we worked in Sarajevo between coffee breaks) is when the adjacency graph is a single cycle. We have a closed formula for the number of DCJ scenarios in this case and we also know the structure of the solution space, thanks to Aïda Ouangraoua and Anne Bergeron. We know that small alterations are sufficient to get an irreducible Markov chain, ie. to get from any solution to any other solution with a series of such small alterations. What we do not know whether or not this Markov chain is rapidly mixing even in this simplest case. That’s why eventually we came up with a different solution: we used the fact that the solutions of the simple cases can be counted and sampled from the sharp uniform distribution in polynomial time, we factorized (partitioned) the state space of an arbitrary case into such simple cases and worked out a rapidly mixing Markov chain for the general case walking on the partitions. It was sufficient to prove that the number of most parsimonious DCJ scenarios is in FPRAS and FPAUS.
Marilia Braga and Jens Stoye proved that whatever the two genomes are, small alterations are sufficient to transform any solution to any other solution. In fact, the small alterations are the change of two (!!!) consecutive steps. It is easy to engineer a Markov chain Monte Carlo converging to the uniform distribution of the solutions using the small alterations that Marilia and Jens described, and we call this Markov chain the Braga-Stoye Markov chain. Although we already know that the number of most parsimonious DCJ scenarios are in FPRAS and FPAUS (and any proof or disproof on the rapid mixing of the Braga-Stoye Markov chain will not change this fact), I would go mad to see a proof (or disproof…) that this Markov chain is rapidly mixing. In the next section, I will tell you why.
The big project
Timothy Brooks Paige was my summer student in 2005. We discovered the ‘big islands’ in the solution spaces of reversal sorting scenarios. Consider two hurdles separated by a small oriented cycle (for a definition of hurdles and oriented cycles, see this presentation). The possible first sorting steps fall into three categories:
- merging the two hurdles
- cutting a hurdle
- sorting the oriented cycle
It is easy to show that any solution starting with the merge of two hurdles must finish with the sorting of the oriented cycle. This oriented cycle cannot be sorted earlier after the hurdle merge: either it is an unoriented cycle or the sorting of it would create a new hurdle. On the other hand, this cycle can anytime be sorted after a hurdle cut. What follows is that the solutions starting with a hurdle merge do not share any intermediate genome with the solutions starting with other sorting steps, except the last intermediate genome containing only the small oriented component. Namely, if we represent the solutions with a directed graph, in which the vertices are the start, the end and the possible intermediate genomes, and there is an edge from v to w iff there is a reversal transforming v into w, then there is a ‘big empty island’ between the two flows of solutions. With other words, if one would like to design an MCMC on the most parsimonious reversal sorting scenarios that cut out a part of the scenario, and draw a new path between the two intermediate genomes at the two ends of the resampled window, then this Markov chain will not be irreducible unless an arbitrary large window can be cut out. Compare this with the Braga-Stoye chain where a size two window resampling is sufficient to get an irreducible Markov chain.
A PhD student of Bernard Moret, Krister Swenson visited me in 2007 autumn, and he showed me a much simpler case having this big empty island property. Following his ideas, I was able to find the simplest possible case: the solution space consists of only two paths not sharing any intermediate genome. This was the key to our theorem proving the following. If one creates a Markov chain that walks on the most parsimonious reversal sorting scenarios by cutting out an arbitrary window and resampling it using sequential importance sampling, then this Markov chain will be torpidly mixing. And nobody has any better idea than using sequential importance sampling to resample a window…
So the only known simple (ie. not applying parallel chains) and irreducible Markov chain walking on the most parsimonious reversal scenarios is also known to be torpidly mixing. The big project is to design an irreducible and rapidly mixing Markov chain on the most parsimonious reversal scenarios. I have a few ideas how to design such Markov chains, and one of them is related to the Braga-Stoye Markov chain.
Cooling down DCJ sorting scenarios to Hannenhalli-Pevzner scenarios
When an MCMCer bumps into a problem of slow mixing, a handsome solution is Parallel Tempering. But when the target distribution is the uniform one, there is no use of heating in the usual sense. However, we had the following idea: consider the DCJ sorting scenarios. Some of them will have many circular chromosomes in the intermediate genomes, and some of them will have none. These later are the Hannenhalli-Pevzner solutions, having only reversals in case of unichromosomal genomes and only reversals, reciprocal translocations, end-chromosomal translocations, chromosome fusions and fissions, ie. Hannenhalli-Pevzner type rearrangements in case of multichromosomal genomes. So we can work out an energy function that counts the number of circular chromosomes along the DCJ sorting path, and then the corresponding Boltzmann distribution at low temperatures will be dominated by the HP solutions (and restricting the distribution to it is the uniform distribution), and at infinite temperature, it will be the uniform distribution of DCJ solutions. We proved the following (Andrew Wei Xu also had a remarkable contribution on these proofs):
- Having a topology defined by small alterations, the energy landscape is such that all local minima are global, ie. Hannenhalli-Pevzner solutions. Namely, the cooling process cannot be trapped into a minimum which is not global.
- a polynomial number of chains is sufficient with the following properties:
- the coldest chain is dominated by the HP solutions
- the hottest chain has infinite temperature
- the exchange probability (expressed as a Metropolis-Hastings ratio) is at least 1/2 between any two neighbor chain.
I remember this was the first case we had some computer aided proof. Eric Tannier did not believe in the first property, he gave me sort of pathological cases, and I wrote a Monte Carlo program to search for local minimum which happened to be always global. Looking at the particular ways how the global minimum can be reached, we managed to understand the background structure and work out a (now pure mathematical) proof.
These are pretty promising properties, but they do not prove rapid mixing in any sense. Just one thing: if we apply small perturbations in the individual chains, then the hottest chain will be the Braga-Stoye Markov chain. Shall I explain more why I am keen to know the rapidness of the Braga-Stoye chain?
MCMC stands for Markov chain Monte Carlo. MCMCMC is simply parallel tempering, but funny bioinformaticians call it Metropolis Coupled Markov chain Monte Carlo (bioinformaticians love to give funny names). Together with Aaron Darling, we managed to sneak in one more MC to get MCMCMCMC: Model Changing Metropolis Coupled Markov chain Monte Carlo. It is a parallel Markov chain Monte Carlo, ie. MCMCMC, with the additional complication that the Markov chains do not walk on the same state spaces, so the states does not simply exchanged, but transformed. In our particular case we developed for reversal sorting scenarios, the target distributions were the uniform distributions of k-long prefixes of most parsimonious reversal sorting scenarios, one chain for each k. The exchange between two Markov chains means the (random) elongation of the shorter prefix and the (deterministic) chop of the longer prefix, applying the appropriate Metropolis-Hastings ratio. At least it is an idea. If all moves are exchanges, then this Markov chain remain torpidly mixing, in spite of looking sophisticated, the same counterexample we developed to the window sequential importance sampling suffices to prove it. However, with exchange and small alterations in the individual chains, this MCMCMCMC is rapidly mixing on the same particular example. An open question if it is always rapidly mixing.
And the famous four reversal conjecture
As I mentioned, small alterations do not make a Markov chain irreducible on the most parsimonious reversal scenarios. As long as small alterations means changing a few consecutive steps, and the solutions are represented by the series of intermediate genomes. I have a conjecture that small alterations might suffice in the following way:
Let Sigma be the alphabet representing possible DCJ operations on k long permutations in the following way: a character (a b | c d) in Sigma means that extremities a and b form an adjacency, c and d also form an adjacency, and the DCJ operation shuffles b and c. With this notation, (a b | c d) ~ (b a | d c) ~ (d c | b a) ~ (c d | a b). Equivalent characters are treated as the same characters.
Let Pi be a k long, hurdle-free signed permutation. Construct a graph whose vertices are the possible most parsimonious reversal scenarios represented as sequences over Sigma. Connect two vertices iff the length of their longest common subsequence is at least the length of the individual sequences minus four. The conjecture is that this graph will be connected whatever Pi is. With other words, the conjecture is that the following Markov chain is irreducible: start with an arbitrary most parsimonious reversal scenario, represent it as a series of DCJ operations, rule out 4, not necessarily consecutive steps, add 4 new DCJ operations, not necessarily to the same place, not necessarily to consecutive places, and keep it if the resulting sequence is a meaningful, most parsimonious reversal scenario (with appropriate Metropolis-Hastings ratio), otherwise the Markov chain remains in the same place. The permutation -1 -2 -3 -4 is a counterexample that less connection is not sufficient for irreducibility.
Good luck proving this conjecture… and if this conjecture is true, then we still do not know the mixing time of this Markov chain, so some proof of rapid mixing is needed. Or disproof, and then this is another dead end in this project…
There is a simpler version of the four reversal conjecture on black and white graphs, so far we have partial results on linear graphs with two undergrad students, Eliot Bixby and Toby Flint. Some hope.
Isn’t the Braga-Stoye Markov chain more promising?
So we need a proof (or disproof) of rapid mixing of the Braga-Stoye Markov chain.
Furthermore, let SBR denote the problem of sorting by reversals (finding a most parsimonious reversal scenario), and let #SBR denote the corresponding counting problem (counting the number of most parsimonious reversal scenarios). We are looking for the computational complexity of #SBR. The possibilities are:
- it is in FP. Very unlikely. Adam Siepel had a paper on the algorithmics of looking one step ahead, it is already messy, and there is a clear combinatorial explosion as we go further, unclear how to handle.
- It is in FPAUS (and thus FPRAS since the problem is self-reducible). I am keen to prove this, but you can see that it seems to be extremely hard.
- It can happen that it is not in FPAUS (and thus not in FPRAS), at least under some reasonable computational complexity assumptions, ie. assuming that RP is not NP. I am planning to have a post on speculating that there seems to be always some NP-complete problem in the background of a non-approximability result, and well, there are plenty of NP-complete problems in genome rearrangement. It could be the case that all of our previous effort were just waste of time and energy, but if this is the case, then at least I would like to see some mathematically rigorous proof of it…
- Oh, and if somebody could prove that it is in #P-complete, I would pretty much appreciate it…