Review of Different Sequence Motif Finding Algorithms




Hashim, Fatma

Department of Biomedical Engineering, Helwan University, Cairo, Egypt

Mabrouk, Mai

Department of Biomedical Engineering, Misr University for Science and Technology (MUST), Cairo, Egypt

AlAtabany, Walid

Department of Biomedical Engineering, Helwan University, Cairo, Egypt
Abstract: The DNA motif discovery is a primary step in many systems for studying gene function. Motif discovery plays a vital role in identification of Transcription Factor Binding Sites (TFBSs) that help in learning the mechanisms for regulation of gene expression. Over the past decades, different algorithms were used to design fast and accurate motif discovery tools. These algorithms are generally classified into consensus or probabilistic approaches that many of them are timeconsuming and easily trapped in a local optimum. Natureinspired algorithms and many of combinatorial algorithms are recently proposed to overcome these problems. This paper presents a general classification of motif discovery algorithms with new subcategories that facilitate building a successful motif discovery algorithm. It also presents a summary of comparison between them.
Introduction :
Introduction
Motif discovery is one of the sequence analysis problems under the application layer and it is one of the significant difficulties in bioinformatics applications. A DNA sequence motif is a subsequence of DNA sequence that is a short similar recurring pattern of nucleotides, and it has many biological functions ^{1}. A DNA motif refers to a short similar repeated pattern of nucleotides that has biological meaning. Sequence motifs also called regulatory elements exist in Regulatory Region (RR) in eukaryotic gene ^{2}.
Sequence motifs have constant size and are often repeated and conserved, but at the same time, they are tiny (about 612 bp) and the intergenic regions are very long and highly variable that make motif discovery a problematic task. These patterns play an essential role in recognizing Transcription Factor Binding Sites (TFBSs) that help in learning the mechanisms for regulation of gene expression ^{3}. Different types of motifs are planted motifs, structured motifs, sequence motifs, gapped motifs and network motifs ^{4}. Motif discovery problem in a simple form can be formulated as in figure 1 where the input is DNA sequence with unknown motifs at different unknown positions with various lengths and the output is the DNA motifs. The motif discovery technique consists of three main stages ^{5}:
A. Preprocessing: It is preparing the DNA sequences for accurate motif discovery by assembling and clean steps. In assembling step, it is advised to select as many target sequences as possible that may contain motifs, try to keep sequences as short as possible, and remove sequences that are unlikely to contain any motifs. Assembling step is done by clustering the input sequences based on some information and then extracting the desired sequences in an appropriate sequence database. Then, cleaning the input sequences to mask or remove confounding sequences is necessary.
B. Discovering: The middle stage is the motif discovery approach that begins by representing the sequences. There are two ways to represent the motifs: consensus string and Positionspecific Weight Matrices (PWM). Consensus string has the same length of DNA sequence motif; it allows to degenerate symbols in a string using IUPAC code while PWM is a matrix of 4xm where m is the motif length. Every position in the matrix represents the probability of each nucleotide at each index position of the motif. After motif representation, the suitable objective function is determined and finally appropriate search algorithm is applied. There are hundreds of algorithms for motif extraction that most of them are listed in table 1.
C. Postprocessing: Postprocessing evaluates the resultant motifs. This paper presents a more general classification of the sequence motifs extraction methods. Most of them are mentioned with a comparison among them.
Literature Review :
There are two principal types of motif discovery algorithms; i.e. enumeration approach and probabilistic technique. Enumeration approach searches for consensus sequences; motifs are predicted based on the enumeration of words and computing word similarities so this approach is sometimes called the word enumeration approach to solve Panted (l, d) Motif Problem (PMP) with motif length (l) and a maximum number of mismatches (d). The algorithms based on the word enumeration approach exhaustively search the whole search space to determine which ones appear with possible substitutions and therefore it typically locates the global optimum. However, this also means that they are exponentialtime algorithms that require a long time to detect the larger l and inefficient for handling dozens of sequences, so they are only suitable for short motifs ^{6}. Moreover, these algorithms require many parameters determined by the users such as motif length, the number of mismatches allowed, and a minimum number of sequences that the motif has to appear in ^{7}.
The word enumeration approach can be accelerated by using specialized data structures such as suffix trees or parallel processing ^{8}. Popular algorithms based on this approach are DREME ^{9}, CisFinder ^{10}, Weeder ^{7}, FMotif ^{11}, and MCES ^{12}.
A second group is a probabilistic approach. It constructs a probabilistic model called positionSpecific Weight Matrix (PSWM) or motif matrix that specifies a distribution of bases for each position in TFBS to distinguish motifs vs. nonmotifs and it requires few search parameters ^{13}. The most popular methods based on probabilistic approach are MEME ^{14}, STEME ^{15}, EXTREME ^{16}, AlignACE ^{17}, and BioProspector ^{18}.
Recently, new algorithms inspired from nature are presented that solve complex and dynamic problems with appropriate time and optimal cost. These algorithms simulate the behavior of insects or other animals for problemsolving. Evolutionary algorithms can overcome the disadvantages of local search and synthesize local search and global search ^{19}. Examples of evolutionary algorithms are: Genetic Algorithm (GA) ^{20}, Genetic Programming (Special type of GA) ^{21}, Differential Evolution (DE) ^{22}, Evolution Strategy ^{23}, Multimodal Optimization ^{24}, CuckooSearch (CS) ^{25}, Levy flight ^{26}, Bacterial Colony Optimization ^{27}, and Intelligent Water Drops algorithm ^{28}.
Swarm intelligence is a special class of evolutionary algorithm including Particle Swarm Optimization (PSO) ^{111}, Artificial Bee Colony (ABC) algorithm ^{127}, and Ant Colony Optimization (ACO) algorithm ^{128}.
The beauty of natureinspired algorithms is that they provide flexibility in evaluating the solutions by using fitness functions that score the solutions. These functions vary from problem to another and evaluate using different information types as biological information, functional information, etc. Moreover, these algorithms provide flexibility in motif representation ^{129}.
Finally, the last category is a combinatorial algorithm that mixes multiple algorithms. The classification of motif discovery algorithms is shown in figure 2. This paper presents a classification of motif discovery algorithms and gives an overview of the most common algorithms with many examples; also, the main features while designing a new algorithm and future work are proposed.
A. Enumerative approach: The enumerative approach can be classified into many classes.
1. Simple word enumeration:
The first class is based on simple word enumeration. Some existing algorithms in this class are YMF ^{134 }and DREME ^{9,29}. Sinha et al ^{29 }developed YMF (Yeast Motif Finder) algorithm that detects short motifs with a small number of degenerate positions in yeast genomes using consensus representation. YMF enumerates all motifs in the search space approach and calculates the zscore to produce those motifs with greatest zscores. Bailey et al ^{9 }proposed DREME (Discriminative Regular Expression Motif Elicitation) algorithm that also calculates the significance of motifs using Fisher’s Exact test. The algorithm starts with generating a set of short kmers, followed by applying Fisher’s Exact test on two sets of DNA sequences (Input set and background set) using a significance threshold to calculate the significance of each word (No wildcards) of length three to eight that occurs in the positive sequences and select the 100 most significant words for being used in the inner loop where they passed as "seed" REs to perform a beam search that determines the most significant generalizations of them (One wildcard). To ﬁnd multiple, nonredundant motifs in a set of sequences, outer loop determines the most significant motif using the heuristic search of RE motifs and the best motif found replaces its occurrences by a special letter; then the search process is repeated again many times until Evalue of the new motif is less than the determined significance threshold.
2. Clusteringbased method
Instead of using two loops for finding multiple motifs, the second class was proposed. Sharov et al ^{10} proposed word clustering method called CisFinder to detect short motif with high processing speed in large sequences (up to 50 Mb). Firstly, one should define nucleotide substitution matrix for each nmer word, then calculate Position Frequency Matrices (PFMs) for nmer word counts with and without gaps in both test and control sets. To generate nonredundant motifs, PFMs are extended over flanking and gap regions followed by means of clustering. Thomas et al ^{31} extended the CisFinder technique to deal with whole ChIPseq peak data sets.
3. Treebased method
The third class is a tree based search to accelerate the word enumeration technique. Pavesi et al ^{7} presented Weeder algorithm based on count matching patterns with specific and most extreme mismatches. At first, the motifs are represented using consensus sequence and based on the difference between the kmers of the input sequences and the consensus under a limited number of substitutions, kmers are assembled and each group is evaluated with a specific measure of significance. The constructed suffix tree was proposed in FMotif ^{11} algorithm for finding long (l, d) motifs in large DNA sequences under the ZOMOPS (Zero, one or multiple occurrence(s) of the motif instance(s) per sequence) constraints. This proposed tree is faster than standard suffix tree as it avoids a large number of repeated scans of sequences on the suffix tree. Afterward, Qiang et al ^{12} proposed MCES algorithm for a PMP that used both suffix tree and parallel processing to deal with large datasets. MCES algorithm starts with mining step that constructs the Sufﬁx Array (SA) and the Longest Common Preﬁx array (LCP) for the input datasets. At that point, combining step clusters substrings of various lengths to get predicted motifs.
4. Graph theoreticbased method
The graphtheoretic method represents a motif instance, as a clique; the graph G is built by representing each lmer in the input sequences by vertex and the edge between a pair of vertices representing a pair of lmer in different input sequences having the Hamming distance between the substrings which is less than or equal to 2d. Then, cliques of size N are searched for in this graph. Popular graphtheoretic methods are WINNOWER ^{39}, Pruner ^{40}, and cWINNOWER ^{41}.
5. Hashingbased method
Buhler et al^{ 64} developed random projection algorithm for a PMP that projects every lmer in the input data into a smaller space by hashing. Initially, a projection of ldimensional space onto a kdimensional subspace for all subsequences in the input set is developed, and random projection is constructed by choosing random k positions from l position. Using this projection, each lmer is hashed to its corresponding bucket. After m projections, each bucket contains lmer more than a threshold and this is called qualified bucket. Random hashing is repeated n times to ensure the qualified bucket at least more than once. Finally, profile for each of them should be computed to get the most probable lmer in the sequence that was represented as consensus sequences. In previous studies, random projection was developed using uniform projection and lowdispersion projection algorithms, respectively ^{65,66}.
6. Fixed candidates and modified candidatebased methods
The sixth class is fixed candidates that select candidate motifs from input sequences and use them for motif scanning while the seventh class is modified candidate that selects one candidate from the input sequence and modifies it letter by letter.
Finally, there is a proposed algorithm called RefSelect ^{69} to select reference sequences for PMP. The reference sequences are the sequences that don’t contain motif instances, so, this method tries to select the reference sequences that generate a small number of candidate motifs as possible. The algorithm consists of two steps; firstly, for every two sequences in input dataset D, the number of candidate motifs generated from them should be computed using the Hamming distance between every two lmers. Then, the set with candidate motifs, as small as possible, is selected as a reference set.
B. Probabilistic approach
Deterministic approach: ExpectationMaximization (EM) ^{130} is the famous example of deterministic approach. EM for motif finding was first introduced by Lawrence et al ^{132} and it consists of two main steps, the first called "Expectation step" that estimates the values of some set of unknowns based on a set of parameters. The second step is "Maximization step" that uses those estimated values to refine the parameters over several iterations. EM is used to identify conserved areas in unaligned DNA and proteins with an assumption that each sequence must contain one common site, the parameters; in this case, they are the entries in the PWM and the background nucleotide probabilities while our unknowns are the scores for each possible motif position in all of the sequences.
There are several algorithms based on EM. MEME (Multiple EM for Motif Elicitation) ^{14} is a popular motif recognition program that optimizes PWMs using the EM algorithm. It has several versions ^{132136}. The idea of MEME algorithm is to find an initial motif and then use expectation and maximization steps to improve the motif until the values in the PWM do not improve or the maximum number of iterations is reached. The MEME algorithm starts from a single site, i.e. kmer (Random or specified) and estimates motif model (PWM). For every possible location in every input sequence, the probability, given the PWM model, should be identified to detect examples of the model; then, the motif model should be reestimated by calculating a new PWM. EM alternates between examples of the model step and reestimates the motif model step. A single iteration for each kmer in target sequences should be performed and the best motif from this site needs to be selected and then only the one to converge should be iterated. The algorithm searches for new motifs after erasing the old discovered motif. It defines all three types of motif discovery sequence model: OOPS, ZOOPS, and TCMs corresponding to one occurrence per sequence, zero or one occurrence per sequence, and zero or more occurrences per sequence, respectively. Reid et al ^{15} presented STEME (Suffix Tree EM for Motif Elicitation) algorithm to accelerate the MEME algorithm and the first application of suffix trees to EM algorithm was considered. Quang et al ^{16} also tried to accelerate the MEME algorithm by developing EXTREME (Online EM algorithm for motif discovery) algorithm. EXTREME is an online web server that implements the MEME algorithm and can work on a large dataset without discarding any sequences.
2. Stochastic approach
Gibbs sampling ^{72} is a famous stochastic approach, similar to EM algorithm. Pseudocode of the Gibbs sampling algorithm for motif detection follows these steps ^{130}:
• Random initializing of motif positions in the input N sequences with an assumption of the presence of one motif per sequence,
• Choosing one sequence at random,
• Computing PWM for the other N1 sequences using staring positions of motifs and background probabilities for each base using the nonmotif positions,
• Calculating probability of each possible motif location in the removed sequence using PWM and background probabilities,
• For the removed sequence, choosing a new starting position based on step 4.
Steps 25 should be iterated until the values in the PWM do not improve or the maximum number of iterations has been reached.
Many methods ^{17,18} have been developed that implement the concept of Gibbs’ sampling to extend its functionality. Hughes et al ^{17} proposed Align ACE (Aligns Nucleic Acid Conserved Elements) algorithm based on Gibbs’ sampling with some improvements: (1) The motif model was changed to fit the source genome because the base frequencies for nonsite sequence is fixed, (2) Both strands of the input sequence are considered and no circumstance overlapping is allowed, (3) Iteratively, aligned sites were masked out to find multiple different motifs, and (4) It uses an improved nearoptimum sampling method.
BioProspector ^{18} algorithm is also based on Gibbs’ sampling with several improvements: (1) It uses a Markov model estimated from all promoter noncoding sequences to represent the nonmotif background in order to improve the motif specificity, (2) It can find twoblock motifs with variable gap, and (3) Sampling with two thresholds allows every input sequence to include zero or multiple copies of the motif.
3. Advanced approach
Different algorithms were proposed based on Bayesian approach ^{137}. Jensen et al ^{13} proposed an algorithm based on Bayesian approach with Markov chain Monte Carlo. Xing et al ^{77} proposed LOGOS (Integrated LOcal and GlObal motif sequence model) algorithm that combines between HMDM (Hidden Markov DirichletMultinomial) for local alignment model for each different motif and HMM (Hidden Markov model) for global motif distribution model for the occurrence of multiple motifs.
Recently, Siebert et al ^{78} developed a Bayesian Markov Model (BaMM) approach that trains higher order Markov models to build the dependency model. BaMM algorithm is more complex than PWMs wherein the PWMs cannot model correlations among nucleotides because PWMs nucleotide probabilities are independent of nucleotides at other positions. In the proposed algorithm, Bayesian approach using Markov models makes optimal use of the available information while avoids training by the decrease in number of parameters.
Other different algorithms ^{79,80} were presented like clustering methods based on Bayesian approach.
4. Others
EPP (Entropybased position projection)^{ 6} algorithm was proposed to escape from local optima. This algorithm based on projection process depends on the relative entropy in each position of motif instead of random projection.
C. Natureinspired algorithms
Natureinspired algorithms are classified according to the sources of inspiration into three main categories ^{138}. They are swarmintelligence, non swarmintelligence and physics/chemistry as shown in figure 2. Popular evolutionary algorithms used in motif discovery are GA and PSO and few of them are ABC, CS, and ACO.
1. GA
GA is a probabilistic optimization algorithm based on evolutionary computing. GA is inspired from biological evolution processes like selection, crossover, and mutation. The motivation for using GA comes from the idea of reducing the number of searches in a high number of DNA sequences. The basic structure of GA consists of a population of candidate solutions throughout several generations to find the best solution or set of possible solutions. The algorithm starts with the random generation of individuals that are then evaluated by a fitness function. At that point, a selection process selects new individuals called offspring that have some features of the parents and the others are discarded, then the genetic operators are applied on offspring. The parents are selected based on a probabilistic process biased by their fitness value using specified selection techniques that keep the diversity of the population large and prevent premature convergence on poor solutions. The most common technique for parent selection is the roulettewheel method. In a roulette wheel selection, based on the number of individual say n, the circular wheel is divided into n pies. The size of each pie is proportional to the fitness of the element. The roulette wheel is spun randomly and the element where it stops is chosen as the parent. The selected two individuals are used to produce offspring; this means the elements of higher chance for selection have higher fitness.
After two selections of parents, a crossover is applied to select a random site, and the rightmost strings are swapped to produce new children, followed by applying the mutation process by changing the value of some selected position. The process is iteratively repeated until some stop criterion is reached on the satisfactory fitness level ^{139}. Some methods 89,90 use standard GA. Liu et al ^{89} introduced FMGA algorithm that is based on simple GA. The genetic operators were specialized for motif discovery problem; the mutation operator was applied using PWM to reserve the completely conserved positions and the crossover operator was implemented with speciallydesigned gap penalties that optimize a scoring function. Che et al ^{90 }proposed MDGA algorithm that is also based on simple GA. Other methods were proposed and most of them ^{9193 }used population clustering technique that partitions population into subpopulations before mating. Clustering is done by the similarity between solutions using data clustering algorithms.
Using population clustering technique, Paul et al ^{91} proposed an algorithm for PMP to detect multiple and weak motifs. Firstly, population initialization by random selecting of subsequences of motif length used to form a candidate consensus motif is done and then all input sequences are scanned to detect all similar substrings followed by sorting them according to a number of mismatches of each substring from the candidate motif. Next, scoring function is applied on them. This method was extended to detect weak motifs using alignment score metric and clustering technique. Finally, fitness of an individual (Cluster) is calculated to select the parents for using in GA. Vijayvargiya et al ^{92} proposed an algorithm with the position based representations of individual and clustering of population scheme. Gutierrez et al ^{93} proposed a new statistical GA that mixes GA structure with several statistical coefficients. In this method, all the input sequences are joined in a single supersequence and the individuals are represented by a single position value; at that point, the supersequence is separated into subsequences of any length regardless of the length of each sequence. For each generation of the population, the fitness value will be calculated against one of the subsequences. The fitness function is a combination of different functions applied at different moments of the process. The algorithm is started by checking if they are overrepresented in the given subsequence by calculating the difference of similar words between the candidate motif and the background motif. Then, the selection process is applied consisting of two steps; the first is using the Fluffiness Coefficient by the mean and standard deviation of the simple overrepresentation values and the individuals with the lowest Fluffiness value are eliminated from the population. The second step is using different coefficients to decide if the candidates are the final solutions of the problem or not for an individual has survived for at least 10 generations. The first coefficient is Thinness coefficient where the individuals with a Thinness coefficient above 0.6 are eliminated from the population. MannWhitney is the second coefficient for two populations to quantify, if one of them has a tendency to have larger values than the other. The first sample corresponds to the vector with the values stored for the number of similar words for the candidate motif in each generation, and the second sample is formed by the same values for the background motif. If the probability of both data samples coming from the same population is lower than 0.05, at that point, the motif is considered as a possible final solution. The third step is the creation of offspring by applying genetic operators, onepoint crossover, and random mutation, on the selected parents. Finally, filtering and clustering of solutions is another method for every given motif width to generate the final solutions.
Some methods enhance the GA by using hybrid methods that combine GA with another technique ^{9496,140 }or propose additional operators in addition to basic genetic operators ^{19,102}.
Huo et al ^{94} proposed a new algorithm (GARP) that optimizes GA based on the random projection strategy (RPS) to identify planted (l, d) motifs. The idea behind using RPS before GA is to find good starting positions for being used in simple GA as an initial population instead of random population. Wang et al 95 presented GAEM algorithm that combines GA and EM for planted edited (l, d)motif finding problem. In planted edited (l, d) motif finding problem, the mutation in motifs due to substitution, deletion, and insertion, causes the motifs to have different lengths ranging from ld to l+d. The EM algorithm is used after the random initial population to get the best starting positions to be used as a seed to GA. Wei et al ^{19} introduced GAME (Genetic Algorithm for Motif Elicitation) that proposes two operators called ADJUST and SHIFT to escape from local optima. The ADJUST operator escapes from a local optimum that occurs if any of the motif sites has not been aligned correctly by checking every possible site position in a sequence and choosing the best match to the sites in the other sequences. But SHIFT operator escapes from local optima that occur when all motif sites are slightly misaligned by shifting the subsequences in the direction which gives the best fitness. Yetian et al ^{102} also presented a new operator called addition to the standard GA operators. The algorithm started with a motif length of three and then the proposed operator is used until the length of the optimal motif reaches to the standard level. The result achieves a higher score than the other three methods: Gibbs Sampler ^{141}, GA ^{142} and GARPS ^{94} algorithm.
2. PSO
PSO is a new global optimization technique ^{143} for solving continuous optimization problems. PSO algorithm is characterized by its simple computations and information sharing within the algorithm. It simulates the social behaviors of organisms movements in flocks of birds or schools of fish to find food sources and defenses against predators. Each particle uses its own flying experience and flying experience of other particles to adjust its "flying" so it combines selfexperiences with social experiences. In selfexperiences, the particle tries to get local best particle position and this is done by the particle stores. The best solution visited so far in its memory is called pbest, and it has an attraction towards this solution as it navigates through the solution search space. Social experiences are utilized to get best global particle position through the particle stores and the best solution visited by any particle and attraction towards this solution is called gbest. For an ndimensional search space, the position and velocity of the ith particle are represented by Yi= (yi1, yi2, …, yin) and Vi= (vi1, vi2, vi3 …, vin), respectively. The previous best position is denoted as Pi= (pi1, pi2, …, pin). Pg is the global best particle in the swarm. For the swarm S, the new velocity of each particle is calculated according to the following equation:
Vin (t+1) = vin (t)+c1 r1 (pinyin) +c2 r2 (pgyin)
(1) The position is updated using:
Yin (t+1) = Yin (t)+ Vin (t+1)
(2) Where i = 1, 2… S represents the particle index and n=1, 2… N represents the dimension. c1 and c2 are cognitive and social scaling parameters, respectively.
At each iteration, pbest and gbest are updated for every particle as per their fitness values. The procedure is iteratively repeated until some stop criterion is reached or satisfactory fitness level has been reached.
PSO has wide applications and has been proven to be effective in motif finding problems ^{112}. In recent years, there are few numbers of researches which utilized PSO to solve different types of motif finding problems. Chang et al ^{144} proposed a modified PSO algorithm where the position and velocity are adjusted to escape from local optima. Hardin et al ^{112} proposed a hybrid motif discovery approach based upon a combination of PSO and EM algorithm and PSO was utilized to generate a seed for the EM algorithm. In previous studies, modified PSO algorithm was proposed based on word dissimilarity graph ^{107,108}. Modified PSO algorithm began to break all input sequences into lmers and develop a novel mapping scheme that was used in the fitness function. Each particle kept track of a vector of locations in each given sequence and formed a consensus sequence. Then, the updated policy of PSO was modified where the new and current motif positions must be in the upper and lower bounds of the velocity. To escape from local optima, the algorithms scan all input sequences after gbest value reaches to a certain threshold to check if lmer has a fitness value in comparison to gbest. Also, the method used "reset move" that moves all the current solution, pbest and gbest by a random distance. Finally, the termination criteria are determined automatically using repeat based method. Reddy et al ^{4} developed PMbPSO (PSObased algorithm for Planted Motif Finding) algorithm. PMbPSO algorithm selects initial positions for all motifs by random and generates ten children for each parent (Motif) and then computes the fitness function for each parent and its children to get the best position; at that point, the best position from all particles is got followed by updating velocity and position for each particle for a number of iterations. Haruna et al ^{105 }integrate LinearPSO with binary search technique (LPBS) to minimize the execution time and increase the validity in motif discovery of DNA sequence for specific species. LPBS algorithm starts with initializing the population by selecting the target motif from the reference set and searching for similar motifs using the binary search, then applying the standard PSO algorithm to discover motifs. Recently, Ebtehal et al ^{113} introduced a hybrid GSAPSO algorithm that combines local search capabilities of the Gravitational Search Algorithm (GSA) and global search capabilities of PSO algorithm.
3. ABC algorithm
ABC algorithm is a type of swarmbased algorithm proposed by Karaboga ^{145}. It simulates the behavior of honey bees to find a food source. Two fundamental properties to obtain swarm intelligent behavior in honey bee colonies are selforganizing and division of labor. The bee colony contains two main groups which are employed and unemployed foragers. Employed bees are going to the food source which is visited previously and they are responsible for giving information to unemployed foragers about the quality of the assigned nectar supply. Many factors determine the value of a food source like its distance from the nest, its energy concentration and the degree of difficulty to extract this energy. Unemployed bees are categorized to scout and onlooker bees. Scout bee searches around the nest randomly to find new food sources while onlooker bee uses the information shared by employed foragers to establish a food source.
Employed bees’ numbers are the same for food sources’ numbers around the hive. At first, scout bees initialize all positions of food sources that represent possible solutions to the problem. Then, employed bees and onlooker bees exploit the nectar of food sources that corresponds to the quality (Fitness) of the associated solution, and this continual exploitation will finally cause them to become exhausted. The employed bee which turned into exploiting the exhausted food source turns into a scout bee looking for other food sources. The ABC algorithm involves four fundamental phases:
A neighbor food source FS is determined and calculated by the following equation:
(3)
Where i is a randomly selected parameter index, is a randomly selected food source, is a random number between 1, 1.
The fitness of this new food source is calculated and a greedy selection is applied between it and its parent i.e. between FS and F. From that point, employed bees share information about their food source via dancing on the dancing area with onlooker bees waiting inside the hive.
GonzalezAlvarez ^{et al 115 }developed multiobjective ABC algorithm that aims to adapt the ABC algorithm to multiobjective context. The multiobjective optimizes more than one objective function at the same time to get a set of optimal solutions known as Pareto set. This algorithm defines three conflicting objectives as motif length, support, and similarity and multiobjective adaptations of ABC including multiterm fitness function, ranking, and sorting methodology are used. González ^{et al 116} proposed Multiobjective Artificial Bee Colony with Differential Evolution algorithm (MOABC/DE) that combines the general schema of ABC with Differential Evolution. Recently, Karaboga et al ^{8} proposed consensus ABC algorithm. This is a discrete model based on a similarity value between consensus sequences of the ABC algorithm. The algorithm starts with random initialization, and then calculates the similarity value of the resultant consensus sequence followed by a new neighborhood selection method that is based totally on the similarity values of the consensus sequences.
4. ACO algorithm
The ACO algorithm ^{128} is a metaheuristic optimization technique that mimics the behavior of real ants, which try to find the shortest path to the food from their nest. The ants explore randomly the area surrounding their nest, and while moving, they leave a chemical pheromone trail on the ground that helps them to go to the nest. Ants interact with each other through this chemical component. The quantity of pheromone is proportional to the quantity and the quality of the food and this pheromone will be guided to other ants for the food source. When evaporation occurs, it reduces the attractive strength of pheromone. The evaporation takes a long time in the shorter path than the longest. The ants choose their way with strong pheromone concentrations. The characteristics of ACO algorithm are: (1) It depends on two variables including the amount and the evaporation of pheromone. The amount of pheromone is directly proportional to the richness of the food and inversely to evaporation; evaporation factor avoids the convergence to a locally optimal solution, (2) Ants act concurrently and independently, and (3) The behavior is stigmergy i.e. the interaction between ants is indirect, (4) Ants can explore vast areas without global view of the ground, (5) Starting point is selected at random.
The pheromone probability in terms of motif discovery is given by:
(4)
Where is the probability of ant a choosing c in position l, is the heuristic information to measure the frequency of letter c in input sequences, i.e. the weight of the letter c, β and α represent the influence of the pheromone trails, c is the character set of input sequences, is the amount of pheromone on the character c at position l at time t, and is the amount of pheromone on neighborhood at position l which ant a has not visited yet at time t. An updated pheromone trial is:
(5)
Where yic is the total number of ants, which carry the character c at position l, p is the rate of the pheromone trails evaporation (0<p<1), and is the variable of pheromone deposited by kth ant on the character c at position l. A lot of methods used ACO algorithm in motif discovery. Machhi et al ^{109} used ACO algorithm with a Gibbs sampling algorithm. An ACO algorithm finds better starting positions of the sequences provided as starting position for the Gibbs sampler method instead of random initialization. This algorithm starts with each ant choosing the path to construct a sample with motif length (m) and that depends on pheromone probability. Then, each ant is compared between the selected sample (m) and each substring in input sequences to get the set that represents the best matching substrings. Next, fitness function for each selected set is calculated. After that, the amount of pheromone is updated and finally iterated until no change.
5. CS algorithm
CS is a new simple heuristic search algorithm that is more efficient than GA and PSO ^{146148}. CS is inspired from brood parasitism reproduction behavior of some cuckoo species in combination with Lévy flight behavior ^{149}. The cuckoos lay their eggs in nests of the other birds with the abilities of selecting the lately spawned nests and removing existing eggs to increase the hatching probability of their eggs. If host birds discover these eggs, they either throw them away or abandon the nest and build a new nest. CS is characterized by the subsequent rules:
Each cuckoo lays one egg at a time and disposes its egg in a random selected nest.
The nests that have high quality of eggs (Solutions) are the best and will continue to the following generations.
The number of accessible host nests is fixed, and a host bird can discover a parasitic egg with a probability Pa E [0,1].
To simulate the behavior of cuckoo reproduction, each egg in a nest is a solution and each cuckoo’s egg is a new solution. The aim is to supplant a notso good solution in the nests with newer and better solutions by Lévy flights:
(6)
Where is a new solution, is the current location, α is the step size and Levy(λ) is the transition probability or random walk based on the Lévy flights. CS is an effective global optimization algorithm and has many applications in different fields 150. Ebtehal et al 120 applied CS and Modified Adaptive Cuckoo Search (MACS) algorithm on PMP. The MACS algorithm enhances the basic CS algorithm by grouping parallel, incentive, information and adaptive strategies.
DNA Motif Databases
Synthetic DNA sequence
The simulated data set was first introduced by Pevzner et al ^{39} to create planted (l, d) motif in n sequences selected randomly from a set of N sequences where n≤N. There are online sites 1^{51,152} to automatically create planted (l, d) motif.
Real DNA sequences
There are several online databases of DNA motifs listed in table 2 with a short description of each one.
Discussion :
As sequencing technology has improved, the volume of biological sequence data in public databases increases and this increases the importance of motif discovery in computer science and molecular biology ^{153}. Motif discovery has some difficulties: (1) Motifs are not identical to each other, (2) The motif sequence is unknown, (3) Motif location is unknown, (4) Existing of random motifs, and (5) The location of a motif in each sequence is unrelated to other ones.
The motif discovery algorithms are classified into two major groups as enumerative approach and probabilistic approach. As the name of the first class imparts, it is counting and comparing oligonucleotide frequencies for all possible motifs, based on specific motif model description. It has some advantages: (1) Global optimality, (2) It is appropriate for short motifs; therefore, it is useful for motif finding in eukaryotic genomes, (3) With optimized data structures, it becomes fast, and (4) Can find totally constrained motifs. However, the problems of this approach are: (1) It needs to be postprocessed with some clustering systems as the typical transcription factor motifs often have several weak constrained positions, (2) It suffers from the problem of producing too many spurious motifs, (3) It represents certain numbers of motifs (11 wild cards), and (4) Long time processing is another problem as it checks every possible substring in the input dataset.
There are many algorithms based on subcategories of this approach. The first subcategory is based on the enumerative approach without any enhancement; this leads to getting all possible motifs but at the same time there are many disadvantages which were mentioned above. YMF is designed for yeast genomes, it can’t detect motifs with large length or the number of degenerate positions is significant. DREME is a discriminative motif discovery tool to discover multiple, short, nonredundant, statistically significant motifs in short runtime using simplified form of regular expression words (11 wildcard characters). DREME algorithm was tested on ChIPseq datasets [13 mouse ES Cell (mESC)], 3 mouse erythrocytes and one human cell line (ChIPseq datasets)). DREME is compared to MEME algorithm and the results show that DREME algorithm can correctly predict motifs on ChIPseq experiment sequences in a shorter runtime than MEME. However, it can only find short motifs (from 4 to 8 bp) and an occurrence of the motif is well defined, as is the number of possible motifs of a given width.
The second subcategory is based on simple enumerative approach but it can discover multiple and weak motifs at the same time so, it is considered as the small enhancement of simplebased method. CisFinder technique tested on ChIPseq data of TFs was expressed in ES cells. CisFinder can accurately identify PFMs of TF binding motifs and it is faster than MEME ^{133}, WeederH ^{154}, and RSAT ^{31}. CisFinder can find motifs with a low level of enrichment, but it does not support outputting motifs of a speciﬁed length.
The algorithm proposed is similar to CisFinder algorithm, but it supports outputting motifs of a speciﬁed length ^{31}. They reported that this algorithm combined accuracy and low computational time, but it is limited to short (l, d) motifs.
The first two subcategories have nearly the same disadvantages, but they are timeconsuming. There are many suggestions to improve this problem; in treebased methods, it is also based on enumerating all possible motifs but it is timeconsuming using suffix tree. Weeder algorithm accelerates the word enumeration technique by using a suffix tree, but it operates with a low efficiency for long motifs ^{11}. FMotif algorithm could identify unknown motif lengths on ChIPenriched regions. Next, MCES algorithm is a more powerful algorithm and there are two contributions in the miming step; it uses an adaptive frequency threshold for each possible length and it is based on Map Reduce strategy to deal well with very large datasets. The MCES is tested on simulated data and real data (ChIPseq) and the results show that MCES can ﬁnd the motif like the published one and run in a short time. MCES has many advantages like identifying motifs without OOPS constraint, handling very large data sets, handling the fullsize input sequences and making full use of the information contained in the sequences, completing the computation with a good time performance and good identiﬁcation accuracy.
Graphbased techniques are the same simplebased techniques but they represent the motifinstance by a graph to facilitate the search strategy.
There are many search strategies to enhance time operator.
Sub categories from five to seven try to enhance time operator using random concept whether random searching about the motif like hashing strategy or selecting random candidate motif or many candidate motifs.
However, the random projection algorithm takes long time operations as it depends on random initialization and it repeats the process for n times.
In fixed candidates and modified candidatebased techniques, the technique scans all input sequences to get the matched motifs.
In the probabilistic approach, the probability of each nucleotide base to be present in that position of the sequence is multiplied to yield the probability of the sequence. PWM is an appealing model due to its simplicity and wide application and it can represent an infinite number of motifs 15 but it has some problems 155: (1) It scales poorly with dataset size, (2) PWM representation assumes the independence of each position within a binding site, while this may be not true in reality, and (3) It converges to locally optimal solution.
EM algorithm is a popular example of probabilistic approach, but it has some limitations: (1) It converges to a local maximum, (2) It is extremely sensitive to initial conditions, (3) It assumes one motif per sequence, and (4) The running time of EM is linear with the length of the input sequences. MEME added several extensions to overcome these limitations where MEME runs the EM algorithm many times from different starting points using every existing lmer in the sequence dataset. However, the running time scales poorly with the size of the dataset so, MEME algorithm can discover motifs in a satisfactory time by a compromising strategy such as discarding a majority of the sequences but discarding data is far from ideal as it can decrease the chance of discovering motifs corresponding to infrequent cofactors ^{156}. STEME runs faster than the MEME algorithm, but with a large dataset, it finds motifs up to width 8 as its efficiency decreases quickly as the motif width increases. EXTREME achieves better running time, but at the same time, it requires too much storage space for processing large data. Gibbs sampling algorithm is another example of a probabilistic approach. It converges to local optimum, and is less dependent on initial parameters, but more dependent on all sequences exhibiting the motif. Advanced methods based on Bayesian technique are a subclass of probabilistic approach; examples of this class are the speedy algorithms with better objective function and BaMM algorithm ^{13}. BaMM was tested on 446 human ChIPseq datasets and the results show that the precision increases by 3040% compared to PWM. However, BaMM algorithm has some limitation: (1) It uses EM algorithm which has some disadvantages as described above, (2) The running time is longer than EM algorithm which makes it difficult to apply on a big data set, (3) It defines ZOOPS model only, and (4) It requires motif length as an input parameter. Finally, EPP algorithm can be applied to OOPS, ZOOPS, and TCM sequence models and the results indicate that it can efficiently and effectively recognize motifs.
Evolutionary algorithms have been recognized due to their advantages of synthesizing local search and global search ^{94}. GA and PSO are the famous evolutionary algorithms. GA is a discrete technique, but PSO is a continuous technique that must be modified to handle discrete design variables. GA and PSO are similar in the presence of interaction between the population’ members, but unlike GA that changes the population from generation to another, PSO keeps the same population; moreover, PSO does not have genetic operators and no notion of the "survival of the fittest".
Based on GA, a lot of algorithms are proposed. The simple GA favors selection of the fittest, which may be a biologically meaningless solution and this tends to remove the diversity of the population. Simple GA identifies OOPS model only and ignores ZOOPS and TCM models which are not correct because some sequences contain multiple or other weak motifs that also need to be identified 157. FMGA and MDGA algorithms are examples on simple GA. FMGA performs better than MEME and Gibbs sampler algorithms. MDGA algorithm is compared with a Gibbs sampling algorithm when tested on real datasets ^{158} and the results showed that it achieves higher accuracy in short computation time; the computation time does not explicitly depend on the sequence length. To overcome simple GA problems, clustering approach was presented. Clustering scheme enables to retain the diversity of population over the generations and it can find various motifs. Based on clustering technique, a new scoring function was developed that takes some consideration like, the number of mutations, and the number of motifs per sequence ^{91}. The authors reported that it gives effective result when it’s applied to simulated and real data. Previous methods presented by Vijayvargiya et al ^{83} could identify multiple motifs of the same length and discover long motifs when tested on synthetic and real datasets^{ 92}. Gutierrez et al ^{93} presented a new algorithm that has many advantages; there are no assumptions about the presence of the motifs in the input sequences, it is a heuristic algorithm, variablelength gaps can be predicted, and the background set of sequences is generated by shuffling the candidate motifs instead of shuffling the sequences themselves. The algorithm was tested on 52 data sets of four different organisms and the efficiency was compared with 14 methods using eight statistics. The proposed method gives its best results with fly and mouse data sets. However, when connecting all input sequences into one sequence and then dividing it, it may cause loss of motifs. Based on hybrid methods, GARP algorithm is better than the projection algorithm when tested on both simulated and real biological data, while GAEM algorithm tested on a simulated dataset, gives a success rate higher than HIGEDA ^{140} and when tested on a real dataset, it performs better than HIGEDA, GLAM2, and MEME. Finally, based on algorithms that enhance genetic operators, the GAME algorithm performs better than MEME and BioProspector when applied on simulation and realdata; the algorithm presented by Fan et al ^{102} has a higher score than the other three methods of Gibbs Sampler ^{141}, GA ^{142}, and GARPS ^{94}.
From the proposed algorithms based on GA, it can be concluded that the methods presented previously ^{19,90,92,94,102} used simple fitness function although there are a lot of suggestions to improve this function. Though the methods proposed by Paul et al, Vijayvargiya et al, and Gutierrez et al ^{9193} can identify ZOOPS model, they cannot identify multiple motifs of variable lengths instead of multiple motifs of the same lengths. Methods by Paul et al and Vijayvargiya et al ^{91,92} ignore the TCM model, while a method by Gutierrez et al ^{93} identifies it. The methods of Paul et al, Vijayvargiya et al, and Gutierrez et al ^{91,93} enhance the selection strategy by proposing new fitness function. The methods proposed by Huo et al and Wang et al ^{94},95 enhance the GA algorithm by combining it with another algorithm to get the starting positions to be used as seed to the GA algorithm, but this increases the computational time also. The selection of EM in the method by Wang et al ^{95} is a bad choice as EM has many limitations as mentioned above. The proposed methods by Wei et al and Fan et al ^{19,102} try to enhance the GA algorithm in another aspect by adding operators to escape from local optimum, but they also didn’t overcome the limitations of the simple GA algorithm. All discussed methods based on the GA algorithm require some parameters determined by the user as motif length. It can be said that the GA algorithm can be enhanced by using a new method that can identify OOPS, ZOOPS, and TCM models, escape from local optimum, improve the fitness function, have good starting positions instead of random initialization, detect multiple motifs with variable lengths, and have intelligent operators in addition to selection, crossover and mutation operators.
PSO is an explorationexploitation trade off. Exploration is the ability to get the global optimum by search in various regions while the exploitation is the ability to locate the optimum by concentrating the search around a promising candidate solution ^{159}. PSO is a simple concept, easily programmable 160, but it can easily fall in a local optimum and low convergence rate ^{161}.
Based on PSO algorithm, Chang et al ^{144} obtained global optimum in protein sequences when the results were compared with the PROSITE database while Hardin et al’s algorithm ^{112} suffers from local solution. The methods in Lei et al ^{107,10}8 were tested on both simulated (PMP) and real biological data (E. coli); they are efficient and accurate in motif discovery, but they suffer from a long time delay due to full scan on all sequences to check the value of gbest and the repeat based method was used to automatically terminate the program. PMbPSO algorithm is tested on simulated and real biological datasets ^{162}; the results indicate that PMbPSO algorithm performs better than MbGA and PbGA ^{163} and it is able to find longer size motifs with a minimum number of mismatches. LPBS algorithm searches for motifs based on reference set. LPBS was tested on Genbank (COI) and there were only two selected species; BosTaurus (Cow10 DNA sequences with lengths between 658 and 715 bp) and GallusGallus (Chicken9 DNA sequences with lengths between 537 and 699 bp). The longest motifs which are able to discover LPBS algorithm are 294 and 261 bp in BosTaurus and GallusGallus, respectively. The authors did not evaluate the LPBS method with other previous algorithms of motif discovery. GSAPSO algorithm was tested on synthetic ^{39} and real data (TRANSFAC) 164 and the authors reported that the GSAPSO algorithm performs better than AlignACE, MEME, and GALF ^{165}. GSAPSO algorithm was compared with old algorithms, although there are a lot of recent algorithms for motif discovery. The presented methods based on the PSO algorithm have some limitations and the methods proposed by Reddy et al and Elewa et al ^{4,113 }used very simple fitness function and the program terminated manually by determining the numbers of iterations by the user. All the PSObased methods start with random initialization except the method proposed by Abdullah et al ^{105}; this leads to time consuming operations and may not provide any correct solution. Moreover, PSObased algorithms detect the motifs with OOPS model and ignore ZOOPS, and TCM models, and require motif length as the user input.
Few methods based on ABC, ACO, and CS are the most recent techniques in motif discovery.
ABC algorithm is a popular stochastic algorithm due to it is simple mechanism, and it requires few parameters which is easy to implement 166. Based on ABC algorithm, MOABC/DE algorithm is the same as the multiobjective ABC algorithm except for the generation of new candidate solutions; the DE operator was used to generate new candidates that combine existing ones according to a set of simple crossovermutation schemes. The consensus ABC algorithm, another example of ABC technique, was tested on three different data sets and compared with a GAbased technique (MOGAMOD), a DEbased technique (DEPT) ^{167}, and a genetic operatorbased ABC algorithm (MOABC) ^{168}. Consensus ABC algorithm achieves the highest similarity values of the motifs with different lengths; it used the neighbor selection strategy based on similarity values between consensus sequences and this improved the probability of the selection of a food source.
ACO algorithm has some advantages as; it usually avoids fault convergence due to distributed computation and can be used in dynamic applications.
From ACO algorithm, it can be concluded that there are some disadvantages: (1) It can easily trap local optimum, (2) The consuming time is uncertain, i.e. it may take short/long time to converge, (3) The code is not obvious, (4) The ants must visit all points to get a good result that is unsuitable for motif discovery as it takes long time to check every substring in the input sequences and finally, (5) ACO requires many parameters. Based on ACO, Machhi et al ^{117 }reported that the total required computing time is reduced.
Finally, there are several advantages for CS algorithm: (1) It usually converges to the global optimality, (2) It combines local and global capabilities and local search takes a quarter of the total search time and the remaining time is for global search which makes CS algorithm more efficient on the global scale, (3) Levy flight is used in its global search instead of standard random walks, so the CS can explore the search space more efficiently, and (4) It is easy to implement, compared with another metaheuristic search which essentially depends on only a single parameter (pa). In a previous study, it was reported that the CSO and MACS can be effectively used to obtain global optimum motif patterns for all used input sequences and they are faster than other nature inspired algorithms ^{120}. Recently, Grey Wolf Optimization for motif finding (GWOMF) ^{169} was applied to get motifs in DNA input sequences.
Conclusion :
The motif discovery algorithms are classified into four classes of enumerative, probability, nature inspired and combinatorial ones and each one has many subclasses. The comparison between them is listed in table 3. The enumerative technique is an exhaustive search with a simple concept, and it is the only technique that ensures to find all motifs (Except weak motifs). However, it is very slow, and requires a lot of parameters; as a result, it becomes difficult to deal with either long motifs or big data. Moreover, the degenerative positions are limited because of restricted representation of motifs.
Probability approach overcomes many weak points of the enumerative approach like speed, dealing with long motifs and big data, numbers of required parameters and degenerated positions, and can find weak motifs. But the probability approach is a complex concept and can’t find all motifs.
The third category, natured inspired approach, combines the main features of the first two approaches.
This approach is a simple concept and it is a global search but at the same time can deal with the big data and long motifs. It has a flexible representation of motifs and this lead to an unlimited number of degenerated positions.
The last category is the combinatorial approach; its ability depends on the hybrid algorithms that combine to form the required algorithm.
The common features of all algorithms are the flexibility of objective function.
The presented classification of motif discovery algorithms is useful to get a general overview and to build a good motif discovery algorithm.
From various suggested methods for motif discovery problem, a good tool for motif discovery can be built. The tool must contain these features: (1) It should identify all models, i.e. OOPS, ZOOPS, TCM, (2) It should possess global search ability, (3) It should optimize the scoring function, (4) Parallel processing ability is a necessity, (5) It should have optimized data structures, (6) It should be able to detect long and short motifs, (7) It should have the capacity of multiple motif discoveries at the same time, i.e. without removing the discovered motif to find the next, (8) And multiple motifs discovery with variable lengths, and (9) It needs to have an automatic system by decreasing the number of required parameters determined by the user. The next phase of this work is to develop a new motif discovery algorithm that combines the main features of enumerative and probabilistic approaches and to use it as a seed to a natureinspired algorithm by considering the factors mentioned above.
