Chat room

Create a Meebo Chat Room

Saturday, December 18, 2010

Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning:

Last edited by
Quratt ul ain Siddique

Summary:

Among the well-known techniques of DNA-string matching are the Smith-Waterman algorithm, for local alignment, the Needleman-Wunsch algorithm for global alignment, Hidden Markov’s model, matrix model, evolutionary algorithms for multiple sequence alignment etc. These works, though extremely valuable, have their limitations.
Principal Component analysis is then employed on the DNA-descriptors for N sampled instances. The principal component analysis yields a unique feature descriptor for identifying the species from its genome sequence. The variance of the descriptors for a given genome sequence being negligible, the proposed scheme finds extensive applications in automatic species identification. PCA is actually used to find the structural signature within a sequence or species which is used to differentiate the specie without the loss of accuracy. Since PCA is a well-known tool for data reduction without loss of accuracy, we claim that our results on feature extraction from the genome database are also free from loss of accuracy.
It is quite evident that the feature descriptors provide a more unique identifier for the species from its genomic data. Thus we have certainly gained an advantage by incorporating the data reduction tool PCA into our search for an effective identifier for a species. It has been found out that the DNA-descriptors obtained from different samples of the same species contain wide disparities. But the Feature Descriptors obtained after processing a different set of DNA-descriptors are unique and present absolutely no significant disparities. Hence the Feature Descriptor Diagrams can be used as the unique representation of the genomic characteristics of the different species.
Feature Discriptors are more accurate in identifying different because they are obtained from Mitochondrial genome not from the whole genome sequence and by applying PCA on them gave accurate results.
If only the frequency count is plotted then we do get some difference from species to species but it is not enough to distinguish between them. This is where PCA comes in. When we applied PCA to data to get Features Descriptors Diagrams of different species then we are able to differentiate species. Moreover when feature descriptor vectors for similar species are calculated, they are effective in bringing out the similarities in the species though they still retain their individual distinguishing features. By constructing the Feature Descriptor Diagram for the species we get best identifier for the particular specie.

An alternative approach to automatic species classification and identification of species using Self-Organizing Feature Map is also discussed in the paper. The computational map is trained by using the DNA-descriptors from different species as the training inputs. The maps for different dimensions are constructed and analyzed for optimum performance. The scheme presents a novel method for identifying a species from its genome sequence with the help of a two dimensional map of neuronal clusters, where each cluster represents a particular species. The map is shown to provide an easier technique for recognition and classification of a species based on its genomic data. Maps of different dimensions are constructed and analyzed on the basis of their efficiency in clustering the extracted features from genomic data of different species.
Also the SOFM can help us demonstrate homology between new sequences and existing phyla. When a new sequence is obtained then its DNA Descriptor is computed and distance between existing neurons is calculated. Then winning neuron from which it has very low distance declares to which species it belongs or it belongs to a new specie. Then to which phylum this specie belongs.
Currently, works in Bioinformatics and biological data mining are aimed at discovering the parts of the DNA sequence that are translated to proteins and to which functions they are involved in forming different parts of the body, ie to identify the genes and their functionality. Another trend to predict structure and functions of these sequences. But the novelity of this work is automatic species identification from genomic data.




0 comments:

Post a Comment

Pages 381234 »
Twitter Delicious Facebook Digg Stumbleupon Favorites More