This is featured post 1 title
Replace these every slider sentences with your featured post descriptions.Go to Blogger edit html and find these sentences.Now replace these with your own descriptions.This theme is Bloggerized by Lasantha - Premiumbloggertemplates.com.
This is featured post 2 title
Replace these every slider sentences with your featured post descriptions.Go to Blogger edit html and find these sentences.Now replace these with your own descriptions.This theme is Bloggerized by Lasantha - Premiumbloggertemplates.com.
This is featured post 3 title
Replace these every slider sentences with your featured post descriptions.Go to Blogger edit html and find these sentences.Now replace these with your own descriptions.This theme is Bloggerized by Lasantha - Premiumbloggertemplates.com.
Monday, December 20, 2010
Saturday, December 18, 2010
User-defined Functions:gcufbioinfo
2:53 PM
GCbioinfo
Let's create another function. How about
You can see that the
reverse
? >>> def reverse(s):
... """Return the sequence string in reverse order."""
... letters = list(s)
... letters.reverse()
... return ''.join(letters)
...
>>> reverse('CCGGAAGAGCTTACTTAG')
'GATTCATTCGAGAAGGCC'
There are a few new things in this function that need explanation. First, we've used an argument name of "s
" instead of "dna
". You can name your arguments whatever you like in Python. It is something of a convention to use short names based on their expected value or meaning. So "s
" for string is fairly common in Python code. The other reason to use "s
" instead of "dna
" in this example is that this function works correctly on any string, not just strings representing dna sequences. So "s
" is a better reflection of the generic utility of this function than "dna
". You can see that the
reverse
function takes in a string, creates a list based on the string, and reverses the order of the list. Now we need to put the list back together as a string so we can return a string. Python string objects have a join()
method that joins together a list into a string, separating each list element by a string value. Since we do not want any character as a separator, we use the join()
method on an empty string, represented by two quotes (''
or ""
).User-defined Functions:gcufbioinfo
2:29 PM
GCbioinfo
Here is the process for creating your own function in Python. The first line begins with the keyword
Let's define some functions in the PyCrust shell. Then we can try each function with some sample data and see the result returned by the function.
def
, is followed by the name of the function and any arguments (expected input values) surrounded by parentheses, and ends with a colon. Subsequent lines make up the body of the function and must be indented. If a string comment appears in the first line of the body, it becomes part of the documentation for the function. The last line of a function returns a result.Let's define some functions in the PyCrust shell. Then we can try each function with some sample data and see the result returned by the function.
>>> def transcribe(dna):
... """Return dna string as rna string."""
... return dna.replace('T', 'U')
...
>>> transcribe('CCGGAAGAGCTTACTTAG')
'CCGGAAGAGCUUACUUAG'
Python Functions:gcufbioinfo
1:41 PM
GCbioinfo
Functions perform an operation on one or more values and return a result. Python comes with many pre-defined functions, as well as the ability to define your own functions. Let's look at a couple of the built-in functions:
len()
returns the number of items in a sequence; dir()
returns a list of strings representing the attributes of an object; list()
returns a new list initialized from some other sequence. >>> dna = 'CTGACCACTTTACGAGGTTAGC'
>>> bases = ['A', 'C', 'G', 'T']
>>> len(dna)
22
>>> len(bases)
4
>>> dir(dna)
['__add__', '__class__', '__contains__', '__delattr__',
'__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',
'__getslice__', '__gt__', '__hash__', '__init__', '__le__',
'__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__',
'__repr__', '__rmul__', '__setattr__', '__str__', 'capitalize',
'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper']
>>> dir(bases)
['__add__', '__class__', '__contains__', '__delattr__',
'__delitem__', '__delslice__', '__doc__', '__eq__', '__ge__',
'__getattribute__', '__getitem__', '__getslice__', '__gt__',
'__hash__', '__iadd__', '__imul__', '__init__', '__le__', '__len__',
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',
'__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__',
'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>> list(dna)
['C', 'T', 'G', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T',
'A', 'C', 'G', 'A', 'G', 'G', 'T', 'T', 'A', 'G', 'C']
Python Lists:gcufbioinfo
1:20 PM
GCbioinfo
Where Python strings are limited to characters, Python lists have no limitations. Python lists are ordered sequences of arbitrary Python objects, including other lists. In addition, you can insert, delete and replace elements in a list. Lists are written as a series of objects, separated by commas, inside of square brackets. Let's look at some lists, and some operations you can perform on lists.
>>> bases = ['A', 'C', 'G', 'T']
>>> bases
['A', 'C', 'G', 'T']
>>> bases.append('U')
>>> bases
['A', 'C', 'G', 'T', 'U']
>>> bases.reverse()
>>> bases
['U', 'T', 'G', 'C', 'A']
>>> bases[0]
'U'
>>> bases[1]
'T'
>>> bases.remove('U')
>>> bases
['T', 'G', 'C', 'A']
>>> bases.sort()
>>> bases
['A', 'C', 'G', 'T']
In this example we created a list of single characters that we called bases. Then we added an element to the end, reversed the order of all the elements, retrieved elements by their index position, removed an element with the value'U'
, and sorted the elements. Removing an element from a list illustrates a situation where we need to supply theremove()
method with an additional piece of information, namely the value that we want to remove from the list. As you can see in the picture below, PyCrust takes advantage of Python's ability to let us know what is required for most
operations by displaying that information in a call tip pop-up window.
Python Strings:gcufbioinfo
1:05 PM
GCbioinfo
'
), double ("
) or triple ('''
or """
) quotes. In the example we assigned the string literal CTGACCACTTTACGAGGTTAGC
to the variable named dna
. >>> dna = 'CTGACCACTTTACGAGGTTAGC'
Then we simply typed the name of the variable, and Python responded by displaying the value of that variable, surrounding the value with quotes to remind us that the value is a string. >>> dna
'CTGACCACTTTACGAGGTTAGC'
A Python string has several built-in capabilities. One of them is the ability to return a copy of itself with all lowercase letters. These capabilities are known as methods. To invoke a method of an object, use the dot syntax. That is, you type the name of the variable (which in this case is a reference to a string object) followed by the dot (.
) operator, then the name of the method followed by opening and closing parentheses.>>> dna.lower()
'ctgaccactttacgaggttagc'
You can access part of a string using the indexing operator s[i]
. Indexing begins at zero, so s[0]
returns the first character in the string, s[1]
returns the second, and so on.>>> dna[0]
'C'
>>> dna[1]
'T'
>>> dna[2]
'G'
>>> dna[3]
'A'
The final line in our screen shot shows PyCrust's autocompletion feature, whereby a list of valid methods (and properties) of an object are displayed when a dot is typed following an object variable. As you can see, Python lists have many built-in capabilities that you can experiment with in the Python shell. Now let's look at one of the other Python sequence types, the list.Beginning Python for Bioinformatics:gcufbioinfo
12:36 PM
GCbioinfo
Bioinformatics, the use of computers in biological research, is the newest wrinkle on one of the oldest pursuits--trying to uncover the secret of life. While we may not know all of life's secrets, at the very least computers are helping us understand many of the biological processes that take place inside of living things. In fact, the use of computers in biological research has risen to such a degree that computer programming has now become an important and almost essential skill for today's biologists.
The purpose of this article is to introduce Python as a useful and viable development language for the computer programming needs of the bioinformatics community. In this introduction, we'll identify some of the advantages of using Python for bioinformatics. Then we'll create and demonstrate examples of working code to get you started. In subsequent articles we'll explore some significant bioinformatics projects that make use of Python.
What I concluded:
6:57 AM
GCbioinfo
Conclusion
From all these research papers I have concluded that from all the data mining techniques PCA is considered to be the best because it reduces large data set into smaller ones and narrow down our research that we can useful information from small data sets. Dimension of a large dataset can be reduced by using principal component analysis which is considered as one of the most popular and useful statistical method. This method transforms the original data in to new dimensions.
Predicting Breast Cancer Survivability Using Data Mining Techniques
6:55 AM
GCbioinfo
Last edited by
Quratt ul ain Siddique
Summary:
This paper presents the prediction of the survivability rate of cancer using data mining techniques. In this paper scientists investigated three data mining like Naïve Bayes, the back-propagated neural network, and the C4.5 decision tree algorithms. C4.5 decision tree algorithm is considered to best from remaining two methods. In this study SEER data is used and introduced a pre-classification approach that take into account three variables: Survival Time Recode (STR), Vital Status Recode (VSR), and Cause of Death (COD).
In this paper three data mining techniques are used to find which one is best to find breast cancer survivability rate. In this research Weka toolkit is used for experimentation that used three data mining algorithms in this research raw SEER data is get before processing using different tools.
In this approach missing information from SEER data is completely ignored and included three approaches like STR, VSR, COD. In this study three data mining techniques is compared. The goal is to attain high precision and accuracy from these techniques. These are actually the matrices which are mostly used for the retrieval of the information but there they are considered to be related to the other existing metrics such as specificity and sensitivity.
This paper discussed and resolved the issues, algorithms and techniques and problems related to predict Breast Cancer Survivability using SEER database. It also discussed that among three data minig techniques the C4.5 decision tree is considered to be the best because it shown maximum accuracy ,precision and recall metrics.
Combined Supervised and Unsupervised Learning in Genomic Data Mining
6:50 AM
GCbioinfo
Last edited by
Quratt ul ain Siddique
Summary:
In this paper they introduced the most comprehensive method for predicting the function of proteins. Their approach differs in several respects from the earlier work in that it uses a multistage decomposition that makes use of both unsupervised and supervised machine learning techniques; they refer to this as Unsupervised-Supervised Tree (UST) algorithm.
The typical first stage (optional) of the UST uses clustering algorithms such as neural network self organizing maps (SOMs) and K-means; this is the unsupervised stage. Subsequent indispensable stages typically involve constructing a Maximum Contrast Tree (MCT) so that protein functional relationships can be mapped onto the relational tree structure.
The MCTs are a family of completely independent algorithms that can be used alone. Testing is based on a newly developed MLIC (Multiple-Labeled Instance Classifier) based on supervised K nearest neighbor classifier on the tree structure. Performance has been compared with the decision tree C4.5 and C5 programs and with support vector machines.
Based on the experiments, UST algorithms appear to perform considerably better than decision tree algorithms C4.5 and C5, and support vector machines, and can provide a viable alternative to supervised or unsupervised methods alone. In addition, UST and MLIC classifiers are capable of handling protein functional classes with a small number of proteins (rare events), and also handle multifunctional proteins. The abilities of the USTs and MLICs to handle such cases means that a larger dataset can be used, which may provide deeper insight into protein functional relationships at the genomic level, and thus may lead to a better understanding of evolution at a molecular and genomic level.
Acute Coronary Syndrome Prediction Using Data Mining Techniques- An Application
6:45 AM
GCbioinfo
Last edited by
Quratt ul ain Siddique
Summary:
In this research paper data mining techniques are used to investigate the factors that are responsible for enhancing the risk of acute coronary syndrome. They have applied binary regression to factors that effecting the dependent variable. For the better performance of regression model in predicting coronary syndrome the reduction technique which is principle component analysis is used and applied. Based on results of data reduction, they have considered only 14 out of sixteen factors.
In this research paper logistic regression model is used to find the factors which are responsible for this Acute Coronary Syndrome (ACS). For the analysis of this problem data mining technique is used for the comparasion of the persons who have ACS or who don’t have.
In this paper first data reduction techniques are applied that reduce the dimensions. After data reduction, the fourteen independent variables are age, gender, smoke, hypertension, family history, diabetics mellitus, fasting blood sugar, random blood sugar, cholesterol, streptokinase, blood pressure (systolic), blood pressure (diastolic), heart rate and pulse rate. After the calculation of corresponding significance of smoking which is “0” indicating that it has a high prevalence in the risk of ACS. The calculation of wald statistics indicates positive coefficients of HR, RBS, and BPS revealed that the risk of ACS increases with the increasing value of these factors.
The negative coefficients of BPd and PR revealed that the more the negative these values the more the increase in the risk of this disease. They observed that smoking is considered to be the worst cause of this Acute Coronary Sysndrome.
Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning:
6:40 AM
GCbioinfo
Last edited by
Quratt ul ain Siddique
Quratt ul ain Siddique
Summary:
Among the well-known techniques of DNA-string matching are the Smith-Waterman algorithm, for local alignment, the Needleman-Wunsch algorithm for global alignment, Hidden Markov’s model, matrix model, evolutionary algorithms for multiple sequence alignment etc. These works, though extremely valuable, have their limitations.
Principal Component analysis is then employed on the DNA-descriptors for N sampled instances. The principal component analysis yields a unique feature descriptor for identifying the species from its genome sequence. The variance of the descriptors for a given genome sequence being negligible, the proposed scheme finds extensive applications in automatic species identification. PCA is actually used to find the structural signature within a sequence or species which is used to differentiate the specie without the loss of accuracy. Since PCA is a well-known tool for data reduction without loss of accuracy, we claim that our results on feature extraction from the genome database are also free from loss of accuracy.
It is quite evident that the feature descriptors provide a more unique identifier for the species from its genomic data. Thus we have certainly gained an advantage by incorporating the data reduction tool PCA into our search for an effective identifier for a species. It has been found out that the DNA-descriptors obtained from different samples of the same species contain wide disparities. But the Feature Descriptors obtained after processing a different set of DNA-descriptors are unique and present absolutely no significant disparities. Hence the Feature Descriptor Diagrams can be used as the unique representation of the genomic characteristics of the different species.
Feature Discriptors are more accurate in identifying different because they are obtained from Mitochondrial genome not from the whole genome sequence and by applying PCA on them gave accurate results.
If only the frequency count is plotted then we do get some difference from species to species but it is not enough to distinguish between them. This is where PCA comes in. When we applied PCA to data to get Features Descriptors Diagrams of different species then we are able to differentiate species. Moreover when feature descriptor vectors for similar species are calculated, they are effective in bringing out the similarities in the species though they still retain their individual distinguishing features. By constructing the Feature Descriptor Diagram for the species we get best identifier for the particular specie.
An alternative approach to automatic species classification and identification of species using Self-Organizing Feature Map is also discussed in the paper. The computational map is trained by using the DNA-descriptors from different species as the training inputs. The maps for different dimensions are constructed and analyzed for optimum performance. The scheme presents a novel method for identifying a species from its genome sequence with the help of a two dimensional map of neuronal clusters, where each cluster represents a particular species. The map is shown to provide an easier technique for recognition and classification of a species based on its genomic data. Maps of different dimensions are constructed and analyzed on the basis of their efficiency in clustering the extracted features from genomic data of different species.
Also the SOFM can help us demonstrate homology between new sequences and existing phyla. When a new sequence is obtained then its DNA Descriptor is computed and distance between existing neurons is calculated. Then winning neuron from which it has very low distance declares to which species it belongs or it belongs to a new specie. Then to which phylum this specie belongs.
Currently, works in Bioinformatics and biological data mining are aimed at discovering the parts of the DNA sequence that are translated to proteins and to which functions they are involved in forming different parts of the body, ie to identify the genes and their functionality. Another trend to predict structure and functions of these sequences. But the novelity of this work is automatic species identification from genomic data.
Data Mining and Visualization of Mouse Genome Data:
6:35 AM
GCbioinfo
Last edited by
Quratt ul ain Siddique
Quratt ul ain Siddique
Summary:
This paper discusses the data mining of the genomics of the mouse that is an area of importance because of its relationship to understanding of basic genetics of other mammals and in particular the human as well as livestock genetics and its breeding.
The data mining tools of multiplot, data partition, clustering, self-organized maps (SOM), regression, association, and neural networks were all used in this research The paper has demonstrated the data mining and visualization results including virtual gene map, mouse genomic features on chromosome, clustering, cluster proximity, T-Scores effect, self-organizing map, and regression analysis. One of the novelties of this research is that the data mining is performed at the genomic level of a mammal that is commonly used as prototype testings for humans.
The data mining performed on the mouse genome data indicated a linearity of regression for the B16F0 Chromosone, significant reduction in the average error upon using neural network algorithms, significant effect in the visulization plots upon using self-organized maps (SOM), and a nonlinear relationship of the cubicclustering criterion with discontinities when the number of clusters reached 22 and 38.
The results of data mining performed also indicated that it was useful to visualize at the genomic level for the mouse data. The analysis shown here can also help researchers who are interested in genome data, and others to visualize the use of data mining at this micro dimensional level.
Future directions of the research are to continue to perform more data mining of the mouse genome data. This may entail using other data mining tools and software. Other future directions are to perform data mining for other data bases such as for other mammals that are of evolutionary relationship to humans, and also other genomic databases of differing dimensionalities to contrast the findings of the research presented in this paper.