CSTB team: Complex Systems and Translational Bioinformatics

Theoretical Bioinformatics

From CSTB team: Complex Systems and Translational Bioinformatics
Jump to navigation Jump to search


INTRODUCTION

Since their beginnings in 1983, the research work in theoretical bioinformatics carried out by Christian Michel has positioned itself in the identification of properties in genes. In particular, they were the subject of 71 international papers with refereed journals in two disciplines: bioinformatics-biomathematics and theoretical-combinatorial computer science, including 12 articles as the sole author and 39 articles with two authors.

Three results are considered by the bioinformatics community to be major: (i) the discovery of circular codes in the genes and their statistical and combinatorial study; (Ii) stochastic models of gene evolution by substitution of genetic patterns for linear evolution (extension of classical nucleotide models) and its generalizations to non-linear and pseudo-chaotic evolutions; And (iii) stochastic models of gene evolution by substitution, insertion and deletion of genetic motifs. This theory of circular code is currently the subject of numerous developments in combinatorics, bioinformatics and biology by different groups of researchers.

However, results in other disciplines of bioinformatics have also been obtained: identification of signals in genes; Computer models of gene evolution (rational languages, stochastic automata, Markov mixtures); Phylogenetic distances and their methods of inference; And the development of research software in bioinformatics.

SCIENTIFIC ACTIVITIES

Research in theoretical bioinformatics is currently focused on circular codes, from bioinformatics to combinatorics; Stochastic models of gene evolution by substitution, insertion and deletion of genetic motifs; and inference of genetic networks.

Combinatorial study of the circular codes of dinucleotides and trinucleotides (Christian Michel)
A new concept in the so-called "collar" theory allows us to describe varieties of commas-free codes and circular codes. Its generalization then makes it possible to make a theoretical bridge between the codes commas-free and the circular codes, two classes of codes considered until now as disjoint. We identify a new class of codes, strong circular codes that are more constrained than commas-free codes. Dinucleotide circular codes (2 letter words on an alphabet of 4 letters) are identified and defined by properties for their prefixes and suffixes. Recently (2016), an approach by the theory of graphs allows to obtain new theorems with circular codes formed of words of any length (finite) on a finite alphabet.

Probabilistic models of gene evolution by substitution of genetic motifs (Emmanuel Benard, Christian Michel)
The classical evolution models of nucleotides (Jukes and Cantor, 1969, Kimura, 1980, 1981) are generalized to genetic patterns of any size (finite) with a mathematical approach based on Kronecker operators (product and sum). These extended models thus make it possible to determine the probability of exact occurrence (analytical solution) of a genetic pattern of any size (dinucleotides, trinucleotides, etc.) over time as a function of substitution parameters (transition and transversions) associated with Each site of the studied patterns. Evolution can be in the direct sense (from the past to the present) but also in the opposite direction (from the present to the past).

Probabilistic models of gene evolution by substitution, insertion and deletion of genetic motifs (Sophie Lèbre, Christian Michel)
There are only two or three classes of probabilistic models of gene evolution involving both substitution, insertion and deletion of nucleotides. One of the reasons is the mathematical difficulty, from a modeling point of view, but also in the determination of analytical solutions. We develop a new more general class of evolution models in which the insertion and deletion parameters are explicit parameters independent of the substitution parameters with, in addition, an insertion rate which decreases as the length of the sequence grows . '

The idea of ​​this approach is based on the introduction of a concept derived from population dynamics to obtain a system of differential equations combining the classical substitution process with the insertion / deletion process. By deriving a general solution verified for any diagonalizable substitution matrix, we obtain an analytic expression of the probability of occurrence of the nucleotides as a function of time, the eigenvalues ​​and eigenvectors of the substitution matrix, the vector of the insertion rates of Nucleotides, the total insertion rate, the initial and maximum lengths of the sequence and the vector of the initial probabilities of the nucleotides. The analytic solutions are nontrivial with Gaussian hypergeometric functions and Kronecker operators (product and sum). Various mathematical properties are obtained: time scale, time decomposition, time inversion and time transformation as a function of the length of the sequence.

Stochastic models for the inference of genetic networks (Sophie Lèbre)
Other stochastic approaches concern the reconstruction of genetic regulation networks. We have thus developed the ARTIVA (Auto Regressive TIme VArying) network model which has the particularity of proposing a variable dependency structure over time for continuous data. A Monte Carlo Method using Markov Chains (MCMC) with reversible jumps has been specifically adapted for the inference of this model from time series of gene expression. This approach has proved to be more efficient than the latest on several datasets. We then refined the model by introducing an exchange of information between the successive structures of the network. Different adaptations of this model allow to modulate the type of information sharing (inter or intra genes), thus bringing a clear improvement in the quality of the estimation.

ARTICLES IN INTERNATIONAL JOURNALS

Articles in pdf format can be downloaded from Christian MICHEL

RESEARCH SOFTWARE

GETEC (Genome Evolution by Transformation, Expansion and Contraction) (Emmanuel Benard, Sophie Lèbre, Christian Michel)
GETEC (written in Mathematica and webMathematica) makes it possible to model the evolution of genes by determining the exact probabilities of occurrence (analytical solutions) of genetic motifs of finite length, the implementation being carried out for lengths from 1 to 5 (nucleotides, Dinucleotides, trinucleotides, quadrinucleotides and pentanucleotides), over time according to (i) substitution parameters (from 1 to 3 levels per site of the motifs), (ii) a pattern insertion rate vector, Iii) a deletion rate and (iv) an initial probability vector of the patterns. Mathematical modeling uses in particular Gaussian hypergeometric functions and Kronecker operators (sum and product). Mathematical formulas can have several thousand terms. This site allows the community of bioinformaticians and biologists to realize their own model of evolution of the genes according to their biological application.

To date, such a research software has no equivalent in bioinformatics since it makes it possible to model the evolution of patterns and genes over time, in the direct sense (from the past to the present ) Or inverse (from the present to the past) and according to parameters of substitution, insertion and deletion of genetic patterns. This direct approach differs from phylogenetic and alignment methods.

ONGOING THESIS

The thesis of Karim El Soufi deals with algorithms of search and visualization of patterns of circular codes. Patterns of the circular code X, units X in abbreviation, are identified in the 5 'and / or 3' regions of the tRNAs of prokaryotes and eukaryotes and 16S ribosomal RNAs (16S rRNAs) of prokaryotes and eukaryotes , In particular in the decoding center of the ribosome. Unexpectedly, the nucleotides A1492 and A1493 universally conserved in the prokaryotes and eukaryotes and the nucleotide G530 conserved in the prokaryotes belong to motifs X. 3D visualization of the X motifs in the ribosome shows several spatial configurations involving motifs X of l MRNA, tRNA motifs X and 16S rRNA motifs. This work has been the subject of two articles of international journals with a reading committee and an article in submission.

RESEARCH NETWORKS

  • European research network composed of computer scientists, mathematicians, physicists and biologists (French, German, Italian, Spanish) whose research theme is the coding of genes. This young community met for the first time in Mannheim (Germany) in 2013. It is at the origin of various scientific collaborations, publications (eg Michel and Seligmann, 2014, Fimmel, Michel and Strüngmann, 2016; Fimmel, Giannerini, Gonzalez and Strüngmann, 2014, 2015, etc.) and a special issue in Philosophical Transactions A (February 2016) containing 21 articles on genetic information coding.
  • Member of the GdR "Molecular Bioinformatics" and the GdR "Mathematical Informatics" in the working group "Combinatorics of words, text and genome algorithms" for several years.

THEORY OF CIRCULAR CODES IN GENES

The search for a code in genes is a very old problem that was initiated in 1957 by Crick et al. With the codes comma-free (codes without punctuation). The objective was to explain how a set of 20 trinucleotides out of 64 could code the 20 amino acids constituting the proteins. In 1958, the mathematicians Golomb et al. Obtain some theoretical results on this class of codes. The explosive combinatorics with 3 20 (3.5 billion) possible codes and the discovery of the genetic code led to the abandonment of the concept of comma-free codes.

This theory remained silent for 40 years. In 1996, we discovered in the genes a more general code class, the circular codes. In 2012, a second major step is obtained with the identification of circular code motifs in the transfer and ribosomal RNAs, particularly in the ribosome decoding center. Indeed, AA dinucleotide (A1492 and A1493) and nucleotide G530 which are universally conserved in the ribosome decoding center of species (eukaryotes, prokaryotes) belong to circular code motifs.

Gonzalez, Giannerini and Rosa ("Circular codes revisited: A statistical approach", J. Theor. Biol, 2011, 275, 21-28) state in the abstract of the article

« In 1996 Arquès and Michel [...] discovered the existence of a common circular code in eukaryote and prokaryote genomes. Since then, circular code theory has provoked great interest and underwent a rapid development. »
« The results [obtenus par les auteurs dans leur article] indicate that, on average, the code proposed by Arquès and Michel has the best covering capability ... »

Gladstone ("Autocorrelation genetic syntax of eukaryotic protein-coding sequences", 2013) cites in his work

« Michel has theorized that two codes, the genetic code and the circular code, are used together as key components of the functioning of the ribosomal complex. He has proposed that while the genetic code conveys what amino acids to recruit to the ribosomal complex during translation, the circular code is used for frame identification and synchronization of the ribosomal complex with the ORF. Evidence has been provided that shows circular codes most likely play a role in ribosome synchronization with the ORF (Frey and Michel 2006). A recent analysis of frameshift genes found in eukaryotes and prokaryotes has found a significant correlation between frameshift signals and Michel’s proposed circular code (Ahmed, Frey et al. 2007). »
« … and our understanding of the role these circular codes play in vivo is largely a mystery. »

Fimmel and Strüngmann ("Codon distribution in error-detecting circular codes", Life, 2016, 6, 14) write in the abstract of their article

« In 1957, Francis Crick et al. suggested an ingenious explanation for the process of frame maintenance. The idea was based on the notion of comma-free codes. Although Crick’s hypothesis proved to be wrong, in 1996, Arquès and Michel discovered the existence of a weaker version of such codes in eukaryote and prokaryote genomes, namely the so-called circular codes. Since then, circular code theory has invariably evoked great interest and made significant progress. »
« In 2015, by quantifying the approach used in 1996 and by applying massive statistical analysis of gene taxonomic groups, the circular code detected in 1996 was rediscovered extensively in genes of prokaryotes and eukaryotes and now also identified in the genes of plasmids and viruses (Michel, 2015). The codes discovered by Arquès and Michel in nature have even more interesting properties [par rapport aux codes comma-free]. With each codon, its anticodon is also in the code (self-complementarity), and they also have the error detection property in frame 1 and 2 (C3-property). »

This circular code theory proposes that genes are constituted by two codes:

(i) the universal genetic code and its variant genetic codes that allow the coding of 61 trinucleotides of the genes into 20 amino acids of the proteins;

(ii) the universal circular code "X" and its variant circular codes (Michel, 2015, Arquès and Michel, 1996) which allow to automatically synchronize and retrieve each of the three phases of the genes on the direct strand of The DNA (the reading phase and its two phases shifted thanks to the property C 3); (Iib) pairing to synchronize and automatically retrieve each of the three phases of the genes on the complementary strand of the DNA (complementarity property); And (iic) encode 20 "X" trinucleotides into 12 amino acids of the proteins.