Chat room

Create a Meebo Chat Room

Monday, December 20, 2010

imageJ download:gcufbioinfo

imageJ download:gcufbioinfo

imageJ download:gcufbioinfo

imageJ download:gcufbioinfo

Saturday, December 18, 2010

Beginning Python for Bioinformatics:gcufbioinfo

download and install python:gcufbioinfo

python installation:gcufbioinfo

User-defined Functions:gcufbioinfo

Let's create another function. How about reverse?
>>> def reverse(s): 
...     """Return the sequence string in reverse order.""" 
...     letters = list(s) 
...     letters.reverse() 
...     return ''.join(letters) 
...      
>>> reverse('CCGGAAGAGCTTACTTAG') 
'GATTCATTCGAGAAGGCC' 
There are a few new things in this function that need explanation. First, we've used an argument name of "s" instead of "dna". You can name your arguments whatever you like in Python. It is something of a convention to use short names based on their expected value or meaning. So "s" for string is fairly common in Python code. The other reason to use "s" instead of "dna" in this example is that this function works correctly on any string, not just strings representing dna sequences. So "s" is a better reflection of the generic utility of this function than "dna".
You can see that the reverse function takes in a string, creates a list based on the string, and reverses the order of the list. Now we need to put the list back together as a string so we can return a string. Python string objects have a join() method that joins together a list into a string, separating each list element by a string value. Since we do not want any character as a separator, we use the join() method on an empty string, represented by two quotes ('' or "").

User-defined Functions:gcufbioinfo

Here is the process for creating your own function in Python. The first line begins with the keyword def, is followed by the name of the function and any arguments (expected input values) surrounded by parentheses, and ends with a colon. Subsequent lines make up the body of the function and must be indented. If a string comment appears in the first line of the body, it becomes part of the documentation for the function. The last line of a function returns a result.
Let's define some functions in the PyCrust shell. Then we can try each function with some sample data and see the result returned by the function.
>>> def transcribe(dna): 
...     """Return dna string as rna string.""" 
...     return dna.replace('T', 'U') 
...      
>>> transcribe('CCGGAAGAGCTTACTTAG') 
'CCGGAAGAGCUUACUUAG' 

Python Functions:gcufbioinfo

Functions perform an operation on one or more values and return a result. Python comes with many pre-defined functions, as well as the ability to define your own functions. Let's look at a couple of the built-in functions: len() returns the number of items in a sequence; dir() returns a list of strings representing the attributes of an object; list() returns a new list initialized from some other sequence.
>>> dna = 'CTGACCACTTTACGAGGTTAGC' 
>>> bases = ['A', 'C', 'G', 'T'] 
>>> len(dna) 
22 
>>> len(bases) 
4 
>>> dir(dna) 
['__add__', '__class__', '__contains__', '__delattr__',  
'__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',  
'__getslice__', '__gt__', '__hash__', '__init__', '__le__',  
'__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__',  
'__repr__', '__rmul__', '__setattr__', '__str__', 'capitalize',  
'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',  
'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',  
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',  
'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split',  
'splitlines', 'startswith', 'strip', 'swapcase', 'title',  
'translate', 'upper'] 
>>> dir(bases) 
['__add__', '__class__', '__contains__', '__delattr__',  
'__delitem__', '__delslice__', '__doc__', '__eq__', '__ge__',  
'__getattribute__', '__getitem__', '__getslice__', '__gt__',  
'__hash__', '__iadd__', '__imul__', '__init__', '__le__', '__len__',  
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',  
'__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__',  
'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',  
'reverse', 'sort'] 
>>> list(dna) 
['C', 'T', 'G', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T',  
'A', 'C', 'G', 'A', 'G', 'G', 'T', 'T', 'A', 'G', 'C'] 

Python Lists:gcufbioinfo

Where Python strings are limited to characters, Python lists have no limitations. Python lists are ordered sequences of arbitrary Python objects, including other lists. In addition, you can insert, delete and replace elements in a list. Lists are written as a series of objects, separated by commas, inside of square brackets. Let's look at some lists, and some operations you can perform on lists.
>>> bases = ['A', 'C', 'G', 'T'] 
>>> bases 
['A', 'C', 'G', 'T'] 
>>> bases.append('U') 
>>> bases 
['A', 'C', 'G', 'T', 'U'] 
>>> bases.reverse() 
>>> bases 
['U', 'T', 'G', 'C', 'A'] 
>>> bases[0] 
'U' 
>>> bases[1] 
'T' 
>>> bases.remove('U') 
>>> bases 
['T', 'G', 'C', 'A'] 
>>> bases.sort() 
>>> bases 
['A', 'C', 'G', 'T']
In this example we created a list of single characters that we called bases.
Then we added an element to the end, reversed the order of all the elements,
retrieved elements by their index position, removed an element with the value
'U', and sorted the elements. Removing an element from a list
illustrates a situation where we need to supply the remove()
method with an additional piece of information, namely the value that we want
to remove from the list. As you can see in the picture below, PyCrust takes
advantage of Python's ability to let us know what is required for most
operations by displaying that information in a call tip pop-up window.  

Python Strings:gcufbioinfo


Let's take a look at the example code in more detail. The first thing we did was to create a string and assign it to a variable. Strings in Python are sequences of characters. You create a string literal by enclosing the characters in single ('), double (") or triple (''' or """) quotes. In the example we assigned the string literal CTGACCACTTTACGAGGTTAGC to the variable named dna.
>>> dna = 'CTGACCACTTTACGAGGTTAGC'
Then we simply typed the name of the variable, and Python responded by displaying the value of that variable, surrounding the value with quotes to remind us that the value is a string.
>>> dna 
'CTGACCACTTTACGAGGTTAGC' 
A Python string has several built-in capabilities. One of them is the ability to return a copy of itself with all lowercase letters. These capabilities are known as methods. To invoke a method of an object, use the dot syntax. That is, you type the name of the variable (which in this case is a reference to a string object) followed by the dot (.) operator, then the name of the method followed by opening and closing parentheses.
>>> dna.lower() 
'ctgaccactttacgaggttagc' 
You can access part of a string using the indexing operator s[i]. Indexing begins at zero, so s[0] returns the first character in the string, s[1] returns the second, and so on.
>>> dna[0] 
'C' 
>>> dna[1] 
'T' 
>>> dna[2] 
'G' 
>>> dna[3] 
'A' 
The final line in our screen shot shows PyCrust's autocompletion feature, whereby a list of valid methods (and properties) of an object are displayed when a dot is typed following an object variable. As you can see, Python lists have many built-in capabilities that you can experiment with in the Python shell. Now let's look at one of the other Python sequence types, the list.

Beginning Python for Bioinformatics:gcufbioinfo

Bioinformatics, the use of computers in biological research, is the newest wrinkle on one of the oldest pursuits--trying to uncover the secret of life. While we may not know all of life's secrets, at the very least computers are helping us understand many of the biological processes that take place inside of living things. In fact, the use of computers in biological research has risen to such a degree that computer programming has now become an important and almost essential skill for today's biologists.
The purpose of this article is to introduce Python as a useful and viable development language for the computer programming needs of the bioinformatics community. In this introduction, we'll identify some of the advantages of using Python for bioinformatics. Then we'll create and demonstrate examples of working code to get you started. In subsequent articles we'll explore some significant bioinformatics projects that make use of Python.

download python cookbook:gcufbioinfo

download python cookbook:gcufbioinfo

download python cookbook:gcufbioinfo

ucsf Chimera full download:

ucsf Chimera full download:

download UCSF Chimera:

What I concluded:


Conclusion
From all these research papers I have concluded that from all the data mining techniques PCA is considered to be the best because it reduces large data set into smaller ones and narrow down our research that we can useful information from small data sets. Dimension of a large dataset can be reduced by using principal component analysis which is considered as one of the most popular and useful statistical method. This method transforms the original data in to new dimensions.          

Predicting Breast Cancer Survivability Using Data Mining Techniques


Last edited by
Quratt ul ain Siddique
Summary:
This paper presents the prediction of the survivability rate of cancer using data mining techniques. In this paper scientists investigated three data mining like Naïve Bayes, the back-propagated neural network, and the C4.5 decision tree algorithms. C4.5 decision tree algorithm is considered to best from remaining two methods. In this study SEER data is used and introduced a pre-classification approach that take into account three variables: Survival Time Recode (STR), Vital Status Recode (VSR), and Cause of Death (COD).
In this paper three data mining techniques are used to find which one is best to find breast cancer survivability rate. In this research Weka toolkit is used for experimentation that used three data mining algorithms in this research raw SEER data is get before processing using different tools.
In this approach missing information from SEER data is completely ignored and included three approaches like STR, VSR, COD. In this study three data mining techniques is compared. The goal is to attain high precision and accuracy from these techniques. These are actually the matrices which are mostly used for the retrieval of the information but there they are considered to be related to the other existing metrics such as specificity and sensitivity.
This paper discussed and resolved the issues, algorithms and techniques and problems related to predict Breast Cancer Survivability using SEER database. It also discussed that among three data minig techniques the C4.5 decision tree is considered to be the best because it shown maximum accuracy ,precision and recall metrics.



Combined Supervised and Unsupervised Learning in Genomic Data Mining


Last edited by
Quratt ul ain Siddique
Summary:
In this paper they introduced the most comprehensive method for predicting the function of proteins. Their approach differs in several respects from the earlier work in that it uses a multistage decomposition that makes use of both unsupervised and supervised machine learning techniques; they refer to this as Unsupervised-Supervised Tree (UST) algorithm.
The typical first stage (optional) of the UST uses clustering algorithms such as neural network self organizing maps (SOMs) and K-means; this is the unsupervised stage. Subsequent indispensable stages typically involve constructing a Maximum Contrast Tree (MCT) so that protein functional relationships can be mapped onto the relational tree structure.
The MCTs are a family of completely independent algorithms that can be used alone. Testing is based on a newly developed MLIC (Multiple-Labeled Instance Classifier) based on supervised K nearest neighbor classifier on the tree structure. Performance has been compared with the decision tree C4.5 and C5 programs and with support vector machines.
Based on the experiments, UST algorithms appear to perform considerably better than decision tree algorithms C4.5 and C5, and support vector machines, and can provide a viable alternative to supervised or unsupervised methods alone. In addition, UST and MLIC classifiers are capable of handling protein functional classes with a small number of proteins (rare events), and also handle multifunctional proteins. The abilities of the USTs and MLICs to handle such cases means that a larger dataset can be used, which may provide deeper insight into protein functional relationships at the genomic level, and thus may lead to a better understanding of evolution at a molecular and genomic level.






Acute Coronary Syndrome Prediction Using Data Mining Techniques- An Application


Last edited by
Quratt ul ain Siddique
Summary:
In this research paper data mining techniques are used to investigate the factors that are responsible for enhancing the risk of acute coronary syndrome. They have applied binary regression to factors that effecting the dependent variable. For the better performance of regression model in predicting coronary syndrome the reduction technique which is principle component analysis is used and applied. Based on results of data reduction, they have considered only 14 out of sixteen factors.
In this research paper logistic regression model is used to find the factors which are responsible for this Acute Coronary Syndrome (ACS). For the analysis of this problem data mining technique is used for the comparasion of the persons who have ACS or who don’t have.  
In this paper first data reduction techniques are applied that reduce the dimensions. After data reduction, the fourteen independent variables are age, gender, smoke, hypertension, family history, diabetics mellitus, fasting blood sugar, random blood sugar, cholesterol, streptokinase, blood pressure (systolic), blood pressure (diastolic), heart rate and pulse rate. After the calculation of corresponding significance of smoking which is “0” indicating that it has a high prevalence in the risk of ACS. The calculation of wald statistics indicates positive coefficients of HR, RBS, and BPS revealed that the risk of ACS increases with the increasing value of these factors.
The negative coefficients of BPd and PR revealed that the more the negative these values the more the increase in the risk of this disease. They observed that smoking is considered to be the worst cause of this Acute Coronary Sysndrome.



Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning:

Last edited by
Quratt ul ain Siddique

Summary:

Among the well-known techniques of DNA-string matching are the Smith-Waterman algorithm, for local alignment, the Needleman-Wunsch algorithm for global alignment, Hidden Markov’s model, matrix model, evolutionary algorithms for multiple sequence alignment etc. These works, though extremely valuable, have their limitations.
Principal Component analysis is then employed on the DNA-descriptors for N sampled instances. The principal component analysis yields a unique feature descriptor for identifying the species from its genome sequence. The variance of the descriptors for a given genome sequence being negligible, the proposed scheme finds extensive applications in automatic species identification. PCA is actually used to find the structural signature within a sequence or species which is used to differentiate the specie without the loss of accuracy. Since PCA is a well-known tool for data reduction without loss of accuracy, we claim that our results on feature extraction from the genome database are also free from loss of accuracy.
It is quite evident that the feature descriptors provide a more unique identifier for the species from its genomic data. Thus we have certainly gained an advantage by incorporating the data reduction tool PCA into our search for an effective identifier for a species. It has been found out that the DNA-descriptors obtained from different samples of the same species contain wide disparities. But the Feature Descriptors obtained after processing a different set of DNA-descriptors are unique and present absolutely no significant disparities. Hence the Feature Descriptor Diagrams can be used as the unique representation of the genomic characteristics of the different species.
Feature Discriptors are more accurate in identifying different because they are obtained from Mitochondrial genome not from the whole genome sequence and by applying PCA on them gave accurate results.
If only the frequency count is plotted then we do get some difference from species to species but it is not enough to distinguish between them. This is where PCA comes in. When we applied PCA to data to get Features Descriptors Diagrams of different species then we are able to differentiate species. Moreover when feature descriptor vectors for similar species are calculated, they are effective in bringing out the similarities in the species though they still retain their individual distinguishing features. By constructing the Feature Descriptor Diagram for the species we get best identifier for the particular specie.

An alternative approach to automatic species classification and identification of species using Self-Organizing Feature Map is also discussed in the paper. The computational map is trained by using the DNA-descriptors from different species as the training inputs. The maps for different dimensions are constructed and analyzed for optimum performance. The scheme presents a novel method for identifying a species from its genome sequence with the help of a two dimensional map of neuronal clusters, where each cluster represents a particular species. The map is shown to provide an easier technique for recognition and classification of a species based on its genomic data. Maps of different dimensions are constructed and analyzed on the basis of their efficiency in clustering the extracted features from genomic data of different species.
Also the SOFM can help us demonstrate homology between new sequences and existing phyla. When a new sequence is obtained then its DNA Descriptor is computed and distance between existing neurons is calculated. Then winning neuron from which it has very low distance declares to which species it belongs or it belongs to a new specie. Then to which phylum this specie belongs.
Currently, works in Bioinformatics and biological data mining are aimed at discovering the parts of the DNA sequence that are translated to proteins and to which functions they are involved in forming different parts of the body, ie to identify the genes and their functionality. Another trend to predict structure and functions of these sequences. But the novelity of this work is automatic species identification from genomic data.




Data Mining and Visualization of Mouse Genome Data:

Last edited by
Quratt ul ain Siddique


Summary:

This paper discusses  the data mining of  the genomics of  the  mouse  that  is  an  area  of  importance  because  of  its  relationship  to  understanding  of  basic  genetics  of  other mammals and in particular the human as well as livestock genetics and  its breeding.
The data mining  tools of multiplot, data partition,  clustering, self-organized maps  (SOM),  regression,  association,  and neural networks were all used in  this research The  paper  has  demonstrated  the  data  mining  and visualization  results  including  virtual  gene  map,  mouse genomic  features  on  chromosome,  clustering,  cluster proximity,  T-Scores  effect,  self-organizing  map,  and regression analysis. One of  the novelties of  this  research is that  the data mining  is performed at the genomic level of a mammal that is commonly used as prototype testings for humans.
The  data mining  performed  on  the mouse  genome  data indicated  a  linearity  of  regression  for  the  B16F0 Chromosone,  significant  reduction  in  the  average  error upon  using  neural  network  algorithms,  significant  effect in  the  visulization plots  upon using  self-organized maps (SOM),  and  a  nonlinear  relationship  of  the  cubicclustering criterion with  discontinities when  the  number of clusters reached 22 and 38.
The  results of data mining performed also  indicated  that it  was  useful  to  visualize  at  the  genomic  level  for  the mouse  data. The  analysis  shown  here  can  also  help researchers who are interested in genome data, and others to  visualize  the  use  of  data  mining  at  this  micro dimensional level.
Future  directions  of  the  research  are  to  continue  to perform  more  data  mining  of  the  mouse  genome  data. This  may  entail  using  other  data  mining  tools  and software.  Other  future  directions  are  to  perform  data mining  for  other  data  bases  such  as  for  other mammals that are of evolutionary  relationship  to humans, and also other genomic databases of differing dimensionalities  to contrast  the  findings  of  the  research  presented  in  this paper.


Twitter Delicious Facebook Digg Stumbleupon Favorites More