In order to assist the task of detecting swine flu origin, we have developed a prediction tool for human influenza A virus hosts. The prediction tool has been developed using LAMP technology. The tool allows user to submit sequences, choose the input segment, and select options to view the result. The given sequence is aligned and the informative positions are extracted to compare with the HMM profiles. The result of prediction will be displayed based on matching scores.
A decision tree is a simple but a powerful machine learning algorithm that has been successfully used for classification problems. The decision tree technique employs a supervised approach for classification, where the leaves on the tree represent classifications and the branches represent conjunctions of features that lead to classification. A series of decisions were made when classifying an instance from root to leaf nodes and the instance was classified to the one associated with the leaf node at the end of the traversal. Each internal node is a decision node and a value of given instance is compared to the decision function to decide which branch to follow. A decision tree is built using a training data set so as to reduce the average depth of each path from root to leaf node. Decision tree classification as a standard machine learning technique has been used for a wide range of applications in bioinformatics. The software package Weka (Waikato Environment for Knowledge Analysis) (http://www.cs.waikato.ac.nz/ml/weka/), consisting of a number of machine learning algorithms, was used for decision tree analysis of aligned sequences. The decision trees are generated using the C4.5 algorithm; Weka has its own version of C4.5 known as J48. We used aligned sequences of each host to train decision tree classifiers in the J48 program of Weka, which allows the most informative nucleotide positions to be found. In each of the iteration steps, one or more critical positions, in which different subtypes can be most likely identified, were determined. These positions were collectively utilized to build HMMs for further host prediction. We applied the cross validation technique for testing. The three groups Human, Swine and outbreak strains are trained and then classified.
Hidden Markov Model (HMM) is used for modeling the informative positions generated from the decision trees. An HMM is a statistical model representing sequences from a gene family. HMMs have a formal probabilistic basis, which is their advantage over other methods. An HMM profile includes more flexible information on a given set of sequences than a single sequence. Therefore, database search methods using profiles is more sensitive to remote similarities than those based on pair wise alignments (e.g., regular BLAST). HMMER, a package that uses Hidden Markov models (HMMs) for sequence database searching, was used to build HMM models based upon the most informative sites determined by the decision tree method. We have built one HMM profile for each influenza A virus host (e.g., Human, Swine and outbreak) using the program hmmbuild in HMMER, along with a multiple sequence alignment of the most significant sites. These profiles are used in the prediction system to determine the host of the strain.