The support vector machine (SVM) method based on n-peptide composition (Yu et al, Proteins: Struct. Funct. Genet. 2003:50:531-536) is used to predict the subcellular localizations of proteins. For an unbiased assessment of the results, we apply our approach to two independent data sets: one set consisting of two parts (Reinhardt and Hubbard, Nucleic Acids Res. 1998; 26:2230-2236): the prokaryotic set includes 997 protein sequences in three categories and the eukaryotic set includes 2427. sequences in four localization categories; another set comprising 2191 proteins in 12 subcellular localizations (Chou and Cai, J. Biol. Chem. 2002; 277:45765-45769). Our approach provides excellent results for both data sets. For the first data set, our approach gives an overall prediction accuracy 93.2% for prokaryotic sequences, 88.1% for eukaryotic sequences. Our approach also yields significantly better Matthews correlation coefficient for each subcellular localization than the existing approaches. For the second data set, our approach achieves an overall prediction accuracy 83.2%, which is also around 10% higher than the best existing result. Our approaches should be valuable in the high throughput analysis of genomics and proteomics.
|State||Published - 2008|
- AMINO-ACID-COMPOSITION; SUPPORT VECTOR MACHINES; FUNCTIONAL DOMAIN COMPOSITION; SECONDARY STRUCTURE; NEURAL-NETWORKS; LOCATION; ACCURACY; SEQUENCE