Download Advances In Genomic Sequence Analysis And Pattern Discovery 2011
The simplest solution is to generate all possible motifs up to a maximum length l , and then to search separately for the approximate occurrences of each motif in the set of sequences. Once a list of candidate patterns is obtained, the ones with the highest significance scores are selected.
This approach guarantees to find all motifs that satisfy the input constraints. Moreover, the sequences can be organized in suitable indexing structures, such as suffix trees, etc. This simplistic approach has an evident disadvantage: the number of candidate motifs, and therefore the time required by the algorithm, grows exponentially with the length of the sequences. Computing a significance score for each motif further increases the time required by the algorithm.
A number of more efficient tools have been developed to address these issues and in the next chapter, we will discuss some of the more widely used ones. In this section, we will present of the programs that are specifically designed to search for motifs in protein sequences that are biologically significant. The search for motifs in a set of unaligned sequences is a complex problem because many factors come into play, such as the precise start and end boundaries of the motif, the size variability presence of gaps or not , or stronger or weaker motif conservation during evolution.
Such exhaustive motif finding approaches are guaranteed to report all instances of motifs in a set of sequences. However, the exponential complexity of such searches means that the problem quickly becomes intractable for large alphabets.
- Rorschach Assessment of the Personality Disorders (Personality and Clinical Psychology)!
- Identity, Motivation and Autonomy in Language Learning (Second Language Acquisition)?
- Sophie Schbath: Publications.
- rowsmillfuncdeword.tk: Advances in Genomic Sequence Analysis and Pattern Discovery Makeup?
- Account Options;
- Sophie Schbath: Publications.
The first uses a set of parameters to reconstruct the hidden motif structure. The second uses this structure to reestimate the parameters. This method allows finding alternate sequences representing the motif and updating the motif model. Probabilistic optimization is an iterative method in which a random subsequence is extracted from each sequence to build an initial model. In each subsequent iteration, the i th sequence is removed and the model is recalculated.
Then, a new motif is extracted from the i th sequence. This process is repeated until convergence. Below, and in Table 2 , we present the most used motif discovery programs and discuss their advantages and limitations.
rowsmillfuncdeword.tk: Advances in Genomic Sequence Analysis and Pattern Discovery Makeup
Teiresias [ 11 ] is based on an enumeration algorithm. It operates in two phases: scanning and convolution. During the scanning phase, elementary motifs with sufficient support are identified.
These elementary motifs constitute the building blocks for the convolution phase. They are combined into progressively larger motifs until all the existing maximal motifs are generated. MEME [ 12 ] is an example of a deterministic optimization algorithm. MEME discovers at least three motifs, each of which may be present in some or all of the input sequences. With default parameters, only motif widths between 6 and 50 are considered, but the user have the possibility to change this as well as several other parameters options of the motif discovery.
Pratt [ 23 ] is based on probabilistic optimization. If the user has not switched off the refinement, these motifs will be input to one of the motif refinement algorithms. The most significant motifs resulting from this are then output to a file. The program searches for motifs in either DNA or protein sequences. It uses the l , d motif search algorithm known as the planted motif search. SLiMs are microdomains that have important functions in many diverse biological pathways. Finally, motifs that are overrepresented in a set of unrelated proteins are identified. Dilimot [ 25 ] proceeds as follows: in the first step, a user provided set of protein sequences is filtered to eliminate repetitive sequences as well as the regions least likely to contain linear motifs.
In the second step, overrepresented motifs are identified in the nonfiltered sequences and ranked according to scores that take into account the background probability of the motif, the number of sequences containing the motif, the size of the sequence set, and the degree to which the motif is conserved in other orthologous proteins. MotifHound [ 26 ] is suitable for the discovery of small and degenerate linear motifs. The method needs two input datasets: a background set of protein sequences and a subset of this background set that represents the query sequences.
MotifHound first enumerates all possible motifs present in the query sequences, and then calculates the frequency of each motif in both the query and the background sets. Its main goal is to discover protein motifs that correlate with the biological behavior of the corresponding proteins. Most of these programs need prior knowledge about either the input sequences or the motif structure.
Furthermore, they are generally designed to discover frequent motifs that occur in all or most of the sequences. Intelligent algorithms include optimization and nature inspired algorithms. Among these, artificial immune systems are especially adapted to pattern discovery, and have been used recently for motif discovery in DNA sequences. The high complexity and dimensionality of the problems in bioinformatics are an interesting challenge for testing and validating new computational intelligence techniques.
Similarly, the application of AIS to bioinformatics may bring important contributions to the biological sciences, providing an alternative form of analyzing and interpreting the huge volume of data from molecular biology and genomics [ 28 ]. Artificial immune systems are a class of computationally intelligent systems inspired by the principles and processes of the vertebrate immune system. The algorithms typically apply the structure and function of the immune system to solving hard computational problems. Since their introduction in the s, a number of common techniques have been developed, including: Clonal selection algorithms model how antibodies of the immune system adaptively learn the features of the intruding antigen and defend the body from it.
The algorithms are most commonly applied to optimization and pattern recognition domains.esapsiode.tk
Ebook Advances In Genomic Sequence Analysis And Pattern Discovery 2011
The algorithms are typically used for classification and pattern recognition problems, especially in the anomaly detection domain. Immune network algorithms focus on the network graph structures involved where antibodies represent the nodes and the training algorithm involves growing or pruning edges between the nodes based on affinity.
The algorithms have been used to solve clustering, data visualization, control, and optimization problems. Dendritic cell algorithms are inspired by the danger theory algorithm of the mammalian immune system, and particularly the role and function of dendritic cells, from the molecular networks present within the cell to the behavior exhibited by a population of cells as a whole. Although a number of these different AIS can be used for pattern recognition, the clonal selection algorithm seems to be particularly well suited for protein motif discovery in large sets of sequences. In addition, the system does not require outside intervention and so it can automatically classify pathogens motifs and it can react to pathogens that the body has never seen before.
Another advantage of AIS is the fact that there are varying types of elements that protect the body from invaders, and there are different lines of defense, such as innate and adaptive immunity. These features can be abstracted to model the diverse types of motifs found in protein molecules see Section 1. These different mechanisms are organized in multiple layers that act cooperatively to provide high noise tolerance and high overall security.
The use of such intelligent algorithmic approaches should improve the whole motif discovery process: from the selection of suitable sets of sequences, via data cleaning and preprocessing, motif identification and evaluation, to the final presentation and visualization of the results. Nevertheless, a number of issues remain to be addressed before such systems can be applied to the very large datasets produced by NGS technologies. In particular, the substantial time and memory requirements of AIS are a limiting factor, although these can be significantly reduced thanks to the inherently parallel nature of the algorithms.
Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3. Help us write another book on this subject and reach those readers. Login to your personal dashboard for more detailed statistics on your publications. Edited by Srinivasan Ramakrishnan. We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. Downloaded: Keywords motif discovery bioinformatics biological sequences protein sequences bioinspired algorithms.
Introduction Biology has been transformed by the availability of numerous complete genome sequences for a wide variety of organisms, ranging from bacteria and viruses to model plants and animals to humans.
Pattern recognition applications follow a pattern recognition pipeline, a number of computational analysis steps taken to achieve the goal [ 11 ]. Figure 2 illustrates this for classification. The starting point of any application is the collection of a set of training objects, assumed to be representative of the problem at hand and thus for new objects to which the system will be applied later.
The first stages then consist of translating raw measurements into data usable for further processing. Some pre-processing steps A—B in Figure 2 is handled by measurement devices or accompanying software itself: next-generation sequencers deliver base calls and quality estimates extracted from raw trace data, and microarray scans are often normalized and summarized using device-specific algorithms, etc.
- On the Formation of Marxism: Karl Kautsky’s Theory of Capitalism, the Marxism of the Second International and Karl Marx’s Critique of Political Economy.
- Whos Been Sleeping in My Bed?.
Further pre-processing is usually specific to the problem at hand and depends on available prior knowledge. Quality inspection is also important: to avoid problems in subsequent analyses, erroneous measurements outliers should be detected and removed [ 19 ], and missing values imputed [ 20 ]. After pre-processing, measurements are adequately represented, usually as features C. Then a subset of informative features is selected D and used to train a classifier, i.
Finally, the performance of the classifier is evaluated on test data not used before F. In the remainder of this review, we will focus on steps C—F.
- Pump Users Handbook: Life Extension, Third Edition!
- The Wilderness and other Poems.
- The Transformation of Intimacy: Sexuality, Love and Eroticism in Modern Societies.
- Political Keywords: A Guide for Students, Activists, and Everyone Else.
- Ebook Advances In Genomic Sequence Analysis And Pattern Discovery .
- Biochemic drug assay methods;
- Principles of Geology?
A The pattern recognition pipeline for classification. B and C Two examples in bioinformatics. In step C, the measurements are represented in a format suitable for further processing. Having a good representation is perhaps the most important step toward satisfactorily solving a pattern recognition problem; no matter how advanced the algorithms in later stages, if too much information is lost in this step, good performance will be impossible to obtain.
Most pattern recognition algorithms assume that an object is represented by a feature vector x of real-valued numbers. Often, this is straightforward: for microarray data, a vector of gene expression levels represents an object. Additional real-valued data, for example, clinical data, can easily be added to such a feature vector. In some cases, prior knowledge on the problem or characteristics of the classifier make it useful to scale e. It is often natural to use a relative rather than an absolute representation of such data [ 21 ]. For example, while it is hard to represent a gene sequence by a vector of real values, it is quite natural to represent it by a vector of dissimilarity measures e.
Likewise, gene expression profiles can be represented as correlations to expression profiles of other genes. Some algorithms, such as the nearest neighbor classifier, which assigns a test object the label of the most similar objects in the training set, can directly use such dissimilarities as input. A relatively new development, driven to a large extent by applications in bioinformatics, is that of representing object pairs X , Y by kernels K X,Y [ 23 ].
Related Advances in Genomic Sequence Analysis and Pattern Discovery
Copyright 2019 - All Right Reserved