R package for generating various numerical representation schemes of protein sequences (2024)

Nan Xiao <https://nanx.me>

Introduction

The protr package offers a unique and comprehensive toolkit forgenerating various numerical representation schemes of proteinsequences. The descriptors included are extensively utilized inbioinformatics and chemogenomics research.

The commonly used descriptors listed in protr include amino acidcomposition, autocorrelation, CTD, conjoint traid, quasi-sequence order,pseudo amino acid composition, and profile-based descriptors derived byPosition-Specific Scoring Matrix (PSSM).

The descriptors for proteochemometric (PCM) modeling include thescales-based descriptors derived by principal components analysis,factor analysis, multidimensional scaling, amino acid properties(AAindex), 20+ classes of 2D and 3D molecular descriptors (for example,Topological, WHIM, VHSE, among others), and BLOSUM/PAM matrix-deriveddescriptors.

The protr package also implemented parallelized similaritycomputation derived by pairwise protein sequence alignment and GeneOntology (GO) semantic similarity measures. ProtrWeb, the webapplication built on protr, can be accessed from http://protr.org.

If you find protr useful in your research, please feel free to citeour paper:

Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015).protr/ProtrWeb: R package and web server for generating variousnumerical representation schemes of protein sequences.Bioinformatics 31 (11), 1857-1859.

BibTeX entry:

@article{Xiao2015, author = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.}, title = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences}, journal = {Bioinformatics}, year = {2015}, volume = {31}, number = {11}, pages = {1857--1859}, doi = {10.1093/bioinformatics/btv042}}

An example predictive modeling workflow

Here, we use the subcellular localization dataset of human proteinspresented in Chou and Shen (2008) todemonstrate the basic usage of protr.

The complete dataset includes 3,134 protein sequences (2,750different proteins) classified into 14 human subcellular locations. Weselected two classes of proteins as our benchmark dataset. Class 1contains 325 extracell proteins and class 2 includes 307mitochondrion proteins. We aim to build a random forestclassification model to classify these two types of proteins.

First, we load the protr package, then read the protein sequencesstored in two separated FASTA files with readFASTA():

library("protr")# Load FASTA files# Note that `system.file()` is for accessing example files# in the protr package. Replace it with your own file path.extracell <- readFASTA( system.file( "protseq/extracell.fasta", package = "protr" ))mitonchon <- readFASTA( system.file( "protseq/mitochondrion.fasta", package = "protr" ))

To read protein sequences stored in PDB format files, usereadPDB() instead. The loaded sequences will be stored astwo lists in R, and each component is a character string representingone protein sequence. In this case, there are 325 extracellprotein sequences and 306 mitonchon protein sequences:

length(extracell)
## [1] 325
length(mitonchon)
## [1] 306

To ensure that the protein sequences only have the 20 standard aminoacid types, which is usually required for the descriptor computation, weuse the protcheck() function to do the amino acid typesanity check and remove the non-standard sequences:

extracell <- extracell[(sapply(extracell, protcheck))]mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304

Two protein sequences were removed from each class. For the remainingsequences, we calculate the Type II PseAAC descriptor, i.e., theamphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou 2005) and make class labels forclassification modeling.

# Calculate APseAAC descriptorsx1 <- t(sapply(extracell, extractAPAAC))x2 <- t(sapply(mitonchon, extractAPAAC))x <- rbind(x1, x2)# Make class labelslabels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))

In protr, the functions of commonly used descriptors for proteinsequences and proteochemometric (PCM) modeling descriptors are namedafter extract...().

Next, we will split the data into a 75% training set and a 25% testset.

set.seed(1001)# Split training and test settr.idx <- c( sample(1:nrow(x1), round(nrow(x1) * 0.75)), sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75)))te.idx <- setdiff(1:nrow(x), tr.idx)x.tr <- x[tr.idx, ]x.te <- x[te.idx, ]y.tr <- labels[tr.idx]y.te <- labels[te.idx]

We will train a random forest classification model on the trainingset with 5-fold cross-validation, using the randomForestpackage.

library("randomForest")rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)rf.fit

The training result is:

## Call:## randomForest(x = x.tr, y = y.tr, cv.fold = 5)## Type of random forest: classification## Number of trees: 500## No. of variables tried at each split: 8#### OOB estimate of error rate: 25.11%## Confusion matrix:## 0 1 class.error## 0 196 46 0.1900826## 1 72 156 0.3157895

With the model trained on the training set, we predict on the testset and plot the ROC curve with the pROC package, as isshown in Figure 1.

# Predict on test setrf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]# Plot ROC curvelibrary("pROC")plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)

The area under the ROC curve (AUC) is:

## Call:## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",## grid = TRUE, print.auc = TRUE)#### Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).## Area under the curve: 0.8697

R package for generating various numerical representation schemes of protein sequences (1)

Figure 1: ROC curve for the test set of protein subcellular localizationdata.

Package overview

The protr package (Xiao et al. 2015)implemented most of the state-of-the-art protein sequence descriptorswith R. Generally, each type of the descriptor (feature) can becalculated with a function named extractX() in the protrpackage, where X stands for the abbreviation of thedescriptor name. The descriptors are:

  • Amino acid composition
    • extractAAC() - Amino acid composition
    • extractDC() - Dipeptide composition
    • extractTC() - Tripeptide composition
  • Autocorrelation
    • extractMoreauBroto() - Normalized Moreau-Brotoautocorrelation
    • extractMoran() - Moran autocorrelation
    • extractGeary() - Geary autocorrelation
  • CTD descriptors
    • extractCTDC() - Composition
    • extractCTDT() - Transition
    • extractCTDD() - Distribution
  • Conjoint triad descriptors
    • extractCTriad() - Conjoint triad descriptors
  • Quasi-sequence-order descriptors
    • extractSOCN() - Sequence-order-coupling number
    • extractQSO() - Quasi-sequence-order descriptors
  • Pseudo-amino acid composition
    • extractPAAC() - Pseudo-amino acid composition(PseAAC)
    • extractAPAAC() - Amphiphilic pseudo-amino acidcomposition (APseAAC)
  • Profile-based descriptors
    • extractPSSM()
    • extractPSSMAcc()
    • extractPSSMFeature()

The descriptors commonly used in Proteochemometric Modeling (PCM)implemented in protr include:

  • extractScales(), extractScalesGap() -Scales-based descriptors derived by Principal Components Analysis
    • extractProtFP(), extractProtFPGap() -Scales-based descriptors derived by amino acid properties from AAindex(a.k.a. Protein Fingerprint)
    • extractDescScales() - Scales-based descriptors derivedby 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM,VHSE, etc.)
  • extractFAScales() - Scales-based descriptors derived byFactor Analysis
  • extractMDSScales() - Scales-based descriptors derivedby Multidimensional Scaling
  • extractBLOSUM() - BLOSUM and PAM matrix-deriveddescriptors

The protr package integrates the function of parallelized similarityscore computation derived by local or global protein sequence alignmentbetween a list of protein sequences; Biostrings and pwalign provide thesequence alignment computation, and the corresponding functions listedin the protr package include:

  • twoSeqSim() - Similarity calculation derived bysequence alignment between two protein sequences
  • parSeqSim() - Parallelized pairwise similaritycalculation with a list of protein sequences (in-memory version)
  • parSeqSimDisk() - Parallelized pairwise similaritycalculation with a list of protein sequences (disk-based version)
  • crossSetSim() - Parallelized pairwise similaritycalculation between two sets of protein sequences (in-memoryversion)

protr also integrates the function for parallelized similarity scorecomputation derived by Gene Ontology (GO) semantic similarity measuresbetween a list of GO terms / Entrez Gene IDs, the GO similaritycomputation is provided by GOSemSim, the corresponding functions listedin the protr package include:

  • twoGOSim() - Similarity calculation derived by GO-termssemantic similarity measures between two GO terms / Entrez GeneIDs;
  • parGOSim() - Pairwise similarity calculation with alist of GO terms / Entrez Gene IDs.

To use the parSeqSim() function, we suggest the users toinstall the packages foreach and doParallel first in order to make theparallelized pairwise similarity computation available.

In the following sections, we will introduce the descriptors andfunction usage in this order.

Commonly used descriptors

Note: Users need to evaluate the underlying detailsof the descriptors provided intelligently, instead of using protr withtheir data blindly, especially for the descriptor types that have moreflexibility. It would be wise for the users to use some negative andpositive control comparisons where relevant to help guide theinterpretation of the results.

A protein or peptide sequence with \(N\) amino acid residues can be generallyrepresented as \(\{\,R_1, R_2, \ldots,R_n\,\}\), where \(R_i\)represents the residue at the \(i\)-thposition in the sequence. The labels \(i\) and \(j\) are used to index amino acid positionin a sequence, and \(r\), \(s\), \(t\)are used to represent the amino acid type. The computed descriptors areroughly categorized into 4 groups according to their majorapplications.

A protein sequence can be partitioned equally into segments. Thedescriptors designed for the complete sequence can often be applied toeach individual segment.

Amino acid composition descriptor

The amino acid composition describes the fraction of each amino acidtype within a protein sequence. The fractions of all 20 natural aminoacids are calculated as follows:

\[f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20.\]

where \(N_r\) is the number of theamino acid type \(r\) and \(N\) is the length of the sequence.

As was described above, we can use the functionextractAAC() to extract the descriptors (features) fromprotein sequences:

library("protr")x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]extractAAC(x)#> A R N D C E Q #> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270 #> G H I L K M F #> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847 #> P S T W Y V #> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206

Here, using the function readFASTA(), we loaded a singleprotein sequence (P00750, Tissue-type plasminogen activator) from aFASTA format file. Then, we extracted the AAC descriptors withextractAAC(). The result returned is a named vector whoseelements are tagged with the name of each amino acid.

Dipeptide composition descriptor

Dipeptide composition gives a 400-dimensional descriptor, definedas:

\[f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20.\]

where \(N_{rs}\) is the number ofdipeptides represented by amino acid type \(r\) and type \(s\). Similar to extractAAC(),we use extractDC() to compute the descriptors:

dc <- extractDC(x)head(dc, n = 30L)#> AA RA NA DA CA EA #> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062 #> QA GA HA IA LA KA #> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531 #> MA FA PA SA TA WA #> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000 #> YA VA AR RR NR DR #> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531 #> CR ER QR GR HR IR #> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062

Here, we only showed the first 30 elements of the result vector andomitted the rest of the result. The element names of the returned vectorare self-explanatory, as before.

Tripeptide composition descriptor

Tripeptide composition gives an 8000-dimensional descriptor, definedas:

\[f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20\]

where \(N_{rst}\) is the number oftripeptides represented by amino acid type \(r\), \(s\), and \(t\). With the functionextractTC(), we can easily obtain the length-8000descriptor, to save some space, here we also omitted the longoutputs:

tc <- extractTC(x)head(tc, n = 36L)#> AAA RAA NAA DAA CAA EAA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> QAA GAA HAA IAA LAA KAA #> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> MAA FAA PAA SAA TAA WAA #> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 #> YAA VAA ARA RRA NRA DRA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> CRA ERA QRA GRA HRA IRA #> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 #> LRA KRA MRA FRA PRA SRA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000

Autocorrelation descriptors

Autocorrelation descriptors are defined based on the distribution ofamino acid properties along the sequence. The amino acid properties usedhere are various types of amino acids index (Retrieved from the AAindex database,see Kawashima, Ogata, and Kanehisa (1999),Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for anillustrated example). Three types of autocorrelation descriptors aredefined here and described below.

All the amino acid indices are centralized and standardized beforethe calculation, i.e.

\[P_r = \frac{P_r - \bar{P}}{\sigma}\]

where \(\bar{P}\) is the average ofthe property of the 20 amino acids:

\[\bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma= \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2}\]

R package for generating various numerical representation schemes of protein sequences (2)

Figure 2: An illustrated example in the AAIndex database.

Normalized Moreau-Broto autocorrelation descriptors

For protein sequences, the Moreau-Broto autocorrelation descriptorscan be defined as:

\[AC(d) = \sum_{i=1}^{N-d} P_i P_{i + d} \quad d = 1, 2, \ldots,\textrm{nlag}\]

where \(d\) is called the lag of theautocorrelation; \(P_i\) and \(P_{i+d}\) are the properties of the aminoacids at position \(i\) and \(i+d\); \(\textrm{nlag}\) is the maximum value of thelag.

The normalized Moreau-Broto autocorrelation descriptors are definedas:

\[ATS(d) = \frac{AC(d)}{N-d} \quad d = 1, 2, \ldots, \textrm{nlag}\]

The corresponding function for this descriptor isextractMoreauBroto(). A typical call would be:

moreau <- extractMoreauBroto(x)head(moreau, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.081573213 -0.016064817 -0.015982990 -0.025739038 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.079058632 -0.042771564 -0.036320847 0.024087298 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.005273958 0.052274763 0.082170073 0.005419919 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.083292042 0.004810584 0.001872446 -0.001531495 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.011917230 0.071161551 0.033473197 0.026882737 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.073075402 0.115272790 0.041517897 -0.027025993 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.033477388 -0.003245255 0.078117010 -0.028177304 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.046695832 0.020584423 0.052740185 0.030804784 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.037170476 -0.058993771 0.070641780 -0.089192490

The eight default properties used here are:

  • AccNo. CIDH920105 - Normalized Average Hydrophobicity Scales
  • AccNo. BHAR880101 - Average Flexibility Indices
  • AccNo. CHAM820101 - Polarizability Parameter
  • AccNo. CHAM820102 - Free Energy of Solution in Water, kcal/mole
  • AccNo. CHOC760101 - Residue Accessible Surface Area inTripeptide
  • AccNo. BIGC670101 - Residue Volume
  • AccNo. CHAM810101 - Steric Parameter
  • AccNo. DAYM780201 - Relative Mutability

Users can change the property names of the AAindex database with theargument props. The AAindex data shipped with protr can beloaded by data(AAindex) which has detailed information oneach property. With the arguments customprops andnlag, users can specify their own properties and lag valueto calculate with. For example:

# Define 3 custom propertiesmyprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43))# Use 4 properties in the AAindex database and 3 customized propertiesmoreau2 <- extractMoreauBroto( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(moreau2, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.081573213 -0.016064817 -0.015982990 -0.025739038 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.079058632 -0.042771564 -0.036320847 0.024087298 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.005273958 0.052274763 0.082170073 0.005419919 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.083292042 0.004810584 0.001872446 -0.001531495 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.011917230 0.071161551 0.033473197 0.026882737 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.073075402 0.115272790 0.041517897 -0.027025993 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.033477388 -0.003245255 0.078117010 -0.028177304 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.046695832 0.020584423 0.052740185 0.030804784 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.037170476 -0.058993771 0.070641780 -0.089192490

About the standard input format of props andcustomprops, see ?extractMoreauBroto fordetails.

Moran autocorrelation descriptors

For protein sequences, the Moran autocorrelation descriptors can bedefined as:

\[I(d) = \frac{\frac{1}{N-d} \sum_{i=1}^{N-d} (P_i - \bar{P}')(P_{i+d} - \bar{P}')}{\frac{1}{N} \sum_{i=1}^{N} (P_i -\bar{P}')^2} \quad d = 1, 2, \ldots, 30\]

where \(d\), \(P_i\), and \(P_{i+d}\) are defined in the same way as inthe first place; \(\bar{P}'\) isthe considered property \(P\) along thesequence, i.e.,

\[\bar{P}' = \frac{\sum_{i=1}^N P_i}{N}\]

\(d\), \(P\), \(P_i\) and \(P_{i+d}\), \(\textrm{nlag}\) have the same meaning asabove.

With extractMoran() (which has the identical parametersas extractMoreauBroto()), we can compute the Moranautocorrelation descriptors (only print out the first 36 elements):

# Use the 3 custom properties defined before# and 4 properties in the AAindex databasemoran <- extractMoran( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(moran, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.062895724 -0.044827681 -0.045065117 -0.055955678 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.060586377 -0.074128412 -0.067308852 -0.001293384 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.033747588 0.029392193 0.061789800 -0.023368437 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.062769417 -0.024912264 -0.028298043 -0.031584063 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.043466730 0.047830694 0.005883901 -0.001769769 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.049334048 0.096427969 0.015147594 -0.060092509 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.007549152 -0.033987885 0.056307675 -0.061844453 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.021484780 -0.008461776 0.014229951 -0.009142419 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> -0.003272262 -0.109613332 0.033346233 -0.141538598

Geary autocorrelation descriptors

Geary autocorrelation descriptors for protein sequences can bedefined as:

\[C(d) = \frac{\frac{1}{2(N-d)} \sum_{i=1}^{N-d} (P_i -P_{i+d})^2}{\frac{1}{N-1} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d= 1, 2, \ldots, 30\]

where \(d\), \(P\), \(P_i\), \(P_{i+d}\), and \(\textrm{nlag}\) have the same meaning asabove.

For each amino acid index, there will be \(3 \times \textrm{nlag}\) autocorrelationdescriptors. The usage of extractGeary() is exactly thesame as extractMoreauBroto() andextractMoran():

# Use the 3 custom properties defined before# and 4 properties in the AAindex databasegeary <- extractGeary( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(geary, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.9361830 1.0442920 1.0452843 1.0563467 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.9406031 1.0765517 1.0675786 0.9991363 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> 1.0316555 0.9684585 0.9353130 1.0201990 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.9340933 1.0207373 1.0251486 1.0290464 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> 1.0414375 0.9494403 0.9905987 0.9987183 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.9472542 0.9010009 0.9828848 1.0574098 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.9897955 1.0290018 0.9400066 1.0584150 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.9762904 1.0029734 0.9818711 1.0051730 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.9967069 1.1012905 0.9595859 1.1337056

Composition/Transition/Distribution

The CTD descriptors are developed by Dubchaket al. (1995) and Dubchak et al.(1999).

R package for generating various numerical representation schemes of protein sequences (3)

Figure 3: The sequence of a hypothetic protein indicating theconstruction of composition, transition, and distribution descriptors ofa protein. The sequence index indicates the position of an amino acid inthe sequence. The index for each type of amino acid in the sequence(1, 2 or 3) indicates theposition of the first, second, third, … of that type of amino acid. 1/2transition indicates the position of 12 or 21pairs in the sequence (1/3 and 2/3 are defined similarly).

Step 1: Sequence Encoding

The amino acids are categorized into three classes according to theirattributes, and each amino acid is encoded by one of the indices 1, 2, 3according to which class it belongs. The attributes used here includehydrophobicity, normalized van der Waals volume, polarity, andpolarizability. The corresponding classification for each attribute islisted in Table 1.

Table 1: Amino acid attributes and the three-groupclassification of the 20 amino acids by each attribute
Group 1Group 2Group 3
HydrophobicityPolarNeutralHydrophobicity
R, K, E, D, Q, NG, A, S, T, P, H, YC, L, V, I, M, F, W
Normalized van der Waals Volume0-2.782.95-4.04.03-8.08
G, A, S, T, P, D, CN, V, E, Q, I, LM, H, K, F, R, Y, W
Polarity4.9-6.28.0-9.210.4-13.0
L, I, F, W, C, M, V, YP, A, T, G, SH, Q, R, K, N, E, D
Polarizability0-1.080.128-0.1860.219-0.409
G, A, S, D, TC, P, N, V, E, Q, I, LK, M, H, F, R, Y, W
ChargePositiveNeutralNegative
K, RA, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, VD, E
Secondary StructureHelixStrandCoil
E, A, L, M, Q, K, R, HV, I, Y, C, W, F, TG, N, P, S, D
Solvent AccessibilityBuriedExposedIntermediate
A, L, F, C, G, I, V, WR, K, Q, E, N, DM, S, P, T, H, Y

For example, for a given sequence MTEITAAMVKELRESTGAGA,it will be encoded as 32132223311311222222 according to itshydrophobicity.

Step 2: Compute Composition, Transition, and DistributionDescriptors

Three types of descriptors, Composition (C),Transition (T), and Distribution (D)can be calculated for a given attribute as follows.

Composition

Composition is defined as the global percentage for each encodedclass in the protein sequence. In the above example, using thehydrophobicity classification, the numbers for encoded classes1, 2, 3 are 5, 10, and 5, so thatthe compositions for them will be \(5/20=25\%\), \(10/20=10\%\), and \(5/20=25\%\), where 20 is the length of theprotein sequence. The composition descriptor can be expressed as

\[C_r = \frac{n_r}{n} \quad r = 1, 2, 3\]

where \(n_r\) is the number of aminoacid type \(r\) in the encodedsequence; \(N\) is the length of thesequence. An example for extractCTDC():

extractCTDC(x)#> hydrophobicity.Group1 hydrophobicity.Group2 hydrophobicity.Group3 #> 0.29715302 0.40569395 0.29715302 #> normwaalsvolume.Group1 normwaalsvolume.Group2 normwaalsvolume.Group3 #> 0.45195730 0.29715302 0.25088968 #> polarity.Group1 polarity.Group2 polarity.Group3 #> 0.33985765 0.33274021 0.32740214 #> polarizability.Group1 polarizability.Group2 polarizability.Group3 #> 0.33096085 0.41814947 0.25088968 #> charge.Group1 charge.Group2 charge.Group3 #> 0.11032028 0.79003559 0.09964413 #> secondarystruct.Group1 secondarystruct.Group2 secondarystruct.Group3 #> 0.38967972 0.29537367 0.31494662 #> solventaccess.Group1 solventaccess.Group2 solventaccess.Group3 #> 0.43060498 0.29715302 0.27224199

The result shows the elements that are named asPropertyNumber.GroupNumber in the returned vector.

Transition

A transition from class 1 to 2 is the percent frequency with which 1is followed by 2 or 2 is followed by 1 in the encoded sequences. Thetransition descriptor can be computed as

\[T_{rs} = \frac{n_{rs} + n_{sr}}{N - 1} \quad rs = \text{12}, \text{13},\text{23}\]

where \(n_{rs}\), \(n_{sr}\) are the numbers of dipeptideencoded as rs and sr in the sequence; \(N\) is the length of the sequence. Anexample for extractCTDT():

extractCTDT(x)#> prop1.Tr1221 prop1.Tr1331 prop1.Tr2332 prop2.Tr1221 prop2.Tr1331 prop2.Tr2332 #> 0.27094474 0.16042781 0.23351159 0.26737968 0.22638146 0.17112299 #> prop3.Tr1221 prop3.Tr1331 prop3.Tr2332 prop4.Tr1221 prop4.Tr1331 prop4.Tr2332 #> 0.21033868 0.20499109 0.23707665 0.27272727 0.15151515 0.24598930 #> prop5.Tr1221 prop5.Tr1331 prop5.Tr2332 prop6.Tr1221 prop6.Tr1331 prop6.Tr2332 #> 0.18181818 0.02139037 0.15686275 0.21925134 0.22816399 0.15864528 #> prop7.Tr1221 prop7.Tr1331 prop7.Tr2332 #> 0.25133690 0.21568627 0.18003565

Distribution

The distribution descriptor describes the distribution ofeach attribute in the sequence.

There are five “distribution” descriptors for each attribute, andthey are the position percent in the whole sequence for the firstresidue, 25% residues, 50% residues, 75% residues, and 100% residues fora certain encoded class. For example, there are 10 residues encoded as2 in the above example, the positions for the first residue2, the 2nd residue 2 (25% * 10 = 2), the 5th2 residue (50% * 10 = 5), the 7th 2 (75% * 10= 7) and the 10th residue 2 (100% * 10) in the encodedsequence are 2, 5, 15, 17, 20, so that the distribution descriptors for2 are: 10.0 (2/20 * 100), 25.0 (5/20 * 100), 75.0 (15/20 *100), 85.0 (17/20 * 100), 100.0 (20/20 * 100).

To compute the distribution descriptor, useextractCTDD():

extractCTDD(x)#> prop1.G1.residue0 prop1.G1.residue25 prop1.G1.residue50 prop1.G1.residue75 #> 0.3558719 23.1316726 50.1779359 73.8434164 #> prop1.G1.residue100 prop1.G2.residue0 prop1.G2.residue25 prop1.G2.residue50 #> 99.8220641 0.5338078 27.4021352 47.3309609 #> prop1.G2.residue75 prop1.G2.residue100 prop1.G3.residue0 prop1.G3.residue25 #> 75.2669039 100.0000000 0.1779359 19.5729537 #> prop1.G3.residue50 prop1.G3.residue75 prop1.G3.residue100 prop2.G1.residue0 #> 51.7793594 75.6227758 99.6441281 0.3558719 #> prop2.G1.residue25 prop2.G1.residue50 prop2.G1.residue75 prop2.G1.residue100 #> 25.6227758 48.0427046 75.4448399 100.0000000 #> prop2.G2.residue0 prop2.G2.residue25 prop2.G2.residue50 prop2.G2.residue75 #> 1.4234875 23.3096085 54.4483986 76.3345196 #> prop2.G2.residue100 prop2.G3.residue0 prop2.G3.residue25 prop2.G3.residue50 #> 99.4661922 0.1779359 22.7758007 48.9323843 #> prop2.G3.residue75 prop2.G3.residue100 prop3.G1.residue0 prop3.G1.residue25 #> 69.5729537 99.8220641 0.1779359 20.9964413 #> prop3.G1.residue50 prop3.G1.residue75 prop3.G1.residue100 prop3.G2.residue0 #> 50.8896797 74.5551601 99.6441281 0.5338078 #> prop3.G2.residue25 prop3.G2.residue50 prop3.G2.residue75 prop3.G2.residue100 #> 26.5124555 46.2633452 75.4448399 100.0000000 #> prop3.G3.residue0 prop3.G3.residue25 prop3.G3.residue50 prop3.G3.residue75 #> 0.3558719 24.1992883 50.5338078 73.8434164 #> prop3.G3.residue100 prop4.G1.residue0 prop4.G1.residue25 prop4.G1.residue50 #> 99.8220641 0.3558719 26.5124555 48.3985765 #> prop4.G1.residue75 prop4.G1.residue100 prop4.G2.residue0 prop4.G2.residue25 #> 76.1565836 99.2882562 1.4234875 21.5302491 #> prop4.G2.residue50 prop4.G2.residue75 prop4.G2.residue100 prop4.G3.residue0 #> 51.4234875 75.8007117 100.0000000 0.1779359 #> prop4.G3.residue25 prop4.G3.residue50 prop4.G3.residue75 prop4.G3.residue100 #> 22.7758007 48.9323843 69.5729537 99.8220641 #> prop5.G1.residue0 prop5.G1.residue25 prop5.G1.residue50 prop5.G1.residue75 #> 0.8896797 20.8185053 48.9323843 69.5729537 #> prop5.G1.residue100 prop5.G2.residue0 prop5.G2.residue25 prop5.G2.residue50 #> 99.8220641 0.1779359 24.9110320 49.1103203 #> prop5.G2.residue75 prop5.G2.residue100 prop5.G3.residue0 prop5.G3.residue25 #> 75.2669039 100.0000000 0.3558719 26.1565836 #> prop5.G3.residue50 prop5.G3.residue75 prop5.G3.residue100 prop6.G1.residue0 #> 64.2348754 77.4021352 99.2882562 0.1779359 #> prop6.G1.residue25 prop6.G1.residue50 prop6.G1.residue75 prop6.G1.residue100 #> 22.9537367 50.8896797 74.3772242 99.8220641 #> prop6.G2.residue0 prop6.G2.residue25 prop6.G2.residue50 prop6.G2.residue75 #> 1.6014235 21.5302491 49.2882562 70.8185053 #> prop6.G2.residue100 prop6.G3.residue0 prop6.G3.residue25 prop6.G3.residue50 #> 98.9323843 0.3558719 29.0035587 48.2206406 #> prop6.G3.residue75 prop6.G3.residue100 prop7.G1.residue0 prop7.G1.residue25 #> 77.4021352 100.0000000 0.5338078 23.4875445 #> prop7.G1.residue50 prop7.G1.residue75 prop7.G1.residue100 prop7.G2.residue0 #> 50.0000000 74.5551601 98.9323843 0.3558719 #> prop7.G2.residue25 prop7.G2.residue50 prop7.G2.residue75 prop7.G2.residue100 #> 23.1316726 50.1779359 73.8434164 99.8220641 #> prop7.G3.residue0 prop7.G3.residue25 prop7.G3.residue50 prop7.G3.residue75 #> 0.1779359 27.2241993 48.0427046 75.4448399 #> prop7.G3.residue100 #> 100.0000000

Conjoint triad descriptors

Conjoint triad descriptors are proposed by Shen et al. (2007). The conjoint triaddescriptors were used to model protein-protein interactions based on theclassification of amino acids. In this approach, each protein sequenceis represented by a vector space consisting of descriptors of aminoacids. To reduce the dimensions of vector space, the 20 amino acids wereclustered into several classes according to their dipoles and volumes ofthe side chains. The conjoint triad descriptors are calculated asfollows:

Step 1: Classification of Amino Acids

Electrostatic and hydrophobic interactions dominate protein-proteininteractions. These two kinds of interactions may be reflected by thedipoles and volumes of the side chains of amino acids, respectively.Accordingly, these two parameters were calculated using thedensity-functional theory method B3LYP/6-31G and the molecular modelingapproach. Based on the dipoles and volumes of the side chains, the 20amino acids can be clustered into seven classes (See Table 2). Aminoacids within the same class likely involve synonymous mutations becauseof their similar characteristics.

Table 2: Classification of amino acids based on dipoles andvolumes of the side chains
No.Dipole Scale\(^1\)Volume Scale\(^2\)Class
1\(-\)\(-\)Ala, Gly, Val
2\(-\)\(+\)Ile, Leu, Phe, Pro
3\(+\)\(+\)Tyr, Met, Thr, Ser
4\(++\)\(+\)His, Asn, Gln, Tpr
5\(+++\)\(+\)Arg, Lys
6\(+'+'+'\)\(+\)Asp, Glu
7\(+^3\)\(+\)Cys

1 Dipole Scale (Debye): \(-\), Dipole < 1.0; \(+\), 1.0 < Dipole < 2.0; \(++\), 2.0 < Dipole < 3.0; \(+++\), Dipole > 3.0; \(+'+'+'\), Dipole > 3.0 withopposite orientation.

2 Volume Scale (\(\overset{\lower.5em\circ}{\mathrm{A}}\lower.01em^3\)):\(-\), Volume < 50; \(+\), Volume > 50.

3 Cys is separated from class 3 because of its ability toform disulfide bonds.

Step 2: Conjoint Triad Calculation

The conjoint triad descriptors considered the properties of one aminoacid and its vicinal amino acids and regarded any three continuous aminoacids as a unit. Thus, the triads can be differentiated according to theclasses of amino acids, i.e., triads composed of three amino acidsbelonging to the same classes, such as ART and VKS, can be treatedidentically. To conveniently represent a protein, we first use a binaryspace \((\mathbf{V}, \mathbf{F})\) torepresent a protein sequence. Here, \(\mathbf{V}\) is the vector space of thesequence features, and each feature \(v_i\) represents a sort of triad type;\(\mathbf{F}\) is the frequency vectorcorresponding to \(\mathbf{V}\), andthe value of the \(i\)-th dimension of\(\mathbf{F} (f_i)\) is the frequencyof type \(v_i\) appearing in theprotein sequence. For the amino acids that have been categorized intoseven classes, the size of \(\mathbf{V}\) should be \(7 \times 7 \times 7\); thus \(i = 1, 2, \ldots, 343\). The detaileddescription for (\(\mathbf{V}\), \(\mathbf{F}\)) is illustrated in Figure4.

R package for generating various numerical representation schemes of protein sequences (4)

Figure 4: Schematic diagram for constructing the vector space (\(\mathbf{V}\), \(\mathbf{F}\)) of protein sequences. \(\mathbf{V}\) is the vector space of thesequence features; each feature (\(v_i\)) represents a triad composed of threeconsecutive amino acids; \(\mathbf{F}\)is the frequency vector corresponding to \(\mathbf{V}\), and the value of the \(i\)-th dimension of \(\mathbf{F} (f_i)\) is the frequency that\(v_i\) triad appeared in the proteinsequence.

Clearly, each protein correlates to the length (number of aminoacids) of protein. In general, a long protein would have a large valueof \(f_i\), which complicates thecomparison between two heterogeneous proteins. Thus, we defined a newparameter, \(d_i\), by normalizing\(f_i\) with the followingequation:

\[d_i = \frac{f_i - \min\{\,f_1, f_2 , \ldots, f_{343}\,\}}{\max\{\,f_1,f_2, \ldots, f_{343}\,\}}\]

The numerical value of \(d_i\) ofeach protein ranges from 0 to 1, enabling the comparison betweenproteins. Accordingly, we obtain another vector space (designated \(\mathbf{D}\)) consisting of \(d_i\) to represent protein.

To compute conjoint triads of protein sequences, we can use:

ctriad <- extractCTriad(x)head(ctriad, n = 65L)#> VS111 VS211 VS311 VS411 VS511 VS611 VS711 VS121 VS221 VS321 VS421 VS521 VS621 #> 0.1 0.3 0.6 0.2 0.4 0.0 0.3 1.0 0.6 0.5 0.0 0.2 0.3 #> VS721 VS131 VS231 VS331 VS431 VS531 VS631 VS731 VS141 VS241 VS341 VS441 VS541 #> 0.0 0.2 0.4 0.5 0.2 0.3 0.3 0.1 0.3 0.3 0.2 0.2 0.0 #> VS641 VS741 VS151 VS251 VS351 VS451 VS551 VS651 VS751 VS161 VS261 VS361 VS461 #> 0.1 0.2 0.2 0.2 0.5 0.1 0.2 0.0 0.0 0.1 0.4 0.2 0.3 #> VS561 VS661 VS761 VS171 VS271 VS371 VS471 VS571 VS671 VS771 VS112 VS212 VS312 #> 0.2 0.0 0.1 0.1 0.3 0.1 0.0 0.1 0.0 0.1 0.8 0.4 0.4 #> VS412 VS512 VS612 VS712 VS122 VS222 VS322 VS422 VS522 VS622 VS722 VS132 VS232 #> 0.6 0.1 0.5 0.2 0.8 0.5 0.2 0.3 0.2 0.0 0.2 0.1 0.3

by which we only outputted the first 65 of 343 dimensions to savespace.

Quasi-sequence-order descriptors

The quasi-sequence-order descriptors are proposed by Chou (2000). They are derived from the distancematrix between the 20 amino acids.

Sequence-order-coupling number

The \(d\)-th ranksequence-order-coupling number is defined as:

\[\tau_d = \sum_{i=1}^{N-d} (d_{i, i+d})^2 \quad d = 1, 2, \ldots,\textrm{maxlag}\]

where \(d_{i, i+d}\) is the distancebetween the two amino acids at position \(i\) and \(i+d\).

Note: maxlag is the maximum lag and the length ofthe protein must be not less than \(\textrm{maxlag}\).

The function extractSOCN() is used for computing thesequence-order-coupling numbers:

extractSOCN(x)#> Schneider.lag1 Schneider.lag2 Schneider.lag3 Schneider.lag4 Schneider.lag5 #> 204.2036 199.8708 206.8102 197.4828 193.3366 #> Schneider.lag6 Schneider.lag7 Schneider.lag8 Schneider.lag9 Schneider.lag10 #> 208.1936 195.5476 200.9789 196.7110 193.9931 #> Schneider.lag11 Schneider.lag12 Schneider.lag13 Schneider.lag14 Schneider.lag15 #> 199.7031 204.9389 187.0140 198.4702 205.4526 #> Schneider.lag16 Schneider.lag17 Schneider.lag18 Schneider.lag19 Schneider.lag20 #> 193.1274 187.3529 190.4949 202.8853 198.5299 #> Schneider.lag21 Schneider.lag22 Schneider.lag23 Schneider.lag24 Schneider.lag25 #> 191.1013 185.0074 189.9857 202.7113 201.6267 #> Schneider.lag26 Schneider.lag27 Schneider.lag28 Schneider.lag29 Schneider.lag30 #> 194.5770 185.9939 204.1297 191.1629 183.9073 #> Grantham.lag1 Grantham.lag2 Grantham.lag3 Grantham.lag4 Grantham.lag5 #> 6674686.0000 6761609.0000 7138892.0000 6748261.0000 6291229.0000 #> Grantham.lag6 Grantham.lag7 Grantham.lag8 Grantham.lag9 Grantham.lag10 #> 6839853.0000 6594164.0000 6556148.0000 6620183.0000 6770614.0000 #> Grantham.lag11 Grantham.lag12 Grantham.lag13 Grantham.lag14 Grantham.lag15 #> 6495689.0000 6865537.0000 6297267.0000 6498247.0000 6615566.0000 #> Grantham.lag16 Grantham.lag17 Grantham.lag18 Grantham.lag19 Grantham.lag20 #> 6572680.0000 6569081.0000 6173947.0000 6570829.0000 6471308.0000 #> Grantham.lag21 Grantham.lag22 Grantham.lag23 Grantham.lag24 Grantham.lag25 #> 6461649.0000 5939432.0000 6532121.0000 6652472.0000 6480660.0000 #> Grantham.lag26 Grantham.lag27 Grantham.lag28 Grantham.lag29 Grantham.lag30 #> 6382281.0000 6276521.0000 6537634.0000 6442991.0000 6350157.0000

Users can also specify the maximum lag value with thenlag argument.

Note: Besides the Schneider-Wrede physicochemicaldistance matrix (Schneider and Wrede 1994)used by Kuo-Chen Chou, another chemical distance matrix by Grantham (1974) is also used here. So thedescriptors dimension will be nlag * 2. Thequasi-sequence-order descriptors described next also utilized the twomatrices.

Quasi-sequence-order descriptors

For each amino acid type, a quasi-sequence-order descriptor can bedefined as:

\[X_r = \frac{f_r}{\sum_{r=1}^{20} f_r + w \sum_{d=1}^{\textrm{maxlag}}\tau_d} \quad r = 1, 2, \ldots, 20\]

where \(f_r\) is the normalizedoccurrence for amino acid type \(i\)and \(w\) is a weighting factor (\(w=0.1\)). These are the first 20quasi-sequence-order descriptors. The other 30 quasi-sequence-order aredefined as:

\[X_d = \frac{w \tau_{d-20}}{\sum_{r=1}^{20} f_r + w\sum_{d=1}^{\textrm{maxlag}} \tau_d} \quad d = 21, 22, \ldots, 20 +\textrm{maxlag}\]

R package for generating various numerical representation schemes of protein sequences (5)

Figure 5: A schematic drawing to show (a) the 1st-rank, (b) the2nd-rank, and (3) the 3rd-rank sequence-order-coupling mode along aprotein sequence. (a) Reflects the coupling mode between all the mostcontiguous residues, (b) that between all the 2nd most contiguousresidues, and (c) that between all the 3rd most contiguous residues.This figure is from Chou (2000).

A minimal example for extractQSO():

extractQSO(x)#> Schneider.Xr.A Schneider.Xr.R Schneider.Xr.N Schneider.Xr.D Schneider.Xr.C #> 6.096218e-02 6.773576e-02 3.725467e-02 4.910842e-02 6.434897e-02 #> Schneider.Xr.E Schneider.Xr.Q Schneider.Xr.G Schneider.Xr.H Schneider.Xr.I #> 4.572164e-02 4.572164e-02 7.789612e-02 2.878770e-02 3.386788e-02 #> Schneider.Xr.L Schneider.Xr.K Schneider.Xr.M Schneider.Xr.F Schneider.Xr.P #> 7.281594e-02 3.725467e-02 1.185376e-02 3.048109e-02 5.080182e-02 #> Schneider.Xr.S Schneider.Xr.T Schneider.Xr.W Schneider.Xr.Y Schneider.Xr.V #> 8.466970e-02 4.233485e-02 2.201412e-02 4.064145e-02 4.741503e-02 #> Grantham.Xr.A Grantham.Xr.R Grantham.Xr.N Grantham.Xr.D Grantham.Xr.C #> 1.835033e-06 2.038926e-06 1.121409e-06 1.478221e-06 1.936980e-06 #> Grantham.Xr.E Grantham.Xr.Q Grantham.Xr.G Grantham.Xr.H Grantham.Xr.I #> 1.376275e-06 1.376275e-06 2.344765e-06 8.665435e-07 1.019463e-06 #> Grantham.Xr.L Grantham.Xr.K Grantham.Xr.M Grantham.Xr.F Grantham.Xr.P #> 2.191845e-06 1.121409e-06 3.568120e-07 9.175167e-07 1.529194e-06 #> Grantham.Xr.S Grantham.Xr.T Grantham.Xr.W Grantham.Xr.Y Grantham.Xr.V #> 2.548657e-06 1.274329e-06 6.626509e-07 1.223356e-06 1.427248e-06 #> Schneider.Xd.1 Schneider.Xd.2 Schneider.Xd.3 Schneider.Xd.4 Schneider.Xd.5 #> 3.457972e-02 3.384600e-02 3.502111e-02 3.344162e-02 3.273951e-02 #> Schneider.Xd.6 Schneider.Xd.7 Schneider.Xd.8 Schneider.Xd.9 Schneider.Xd.10 #> 3.525537e-02 3.311390e-02 3.403364e-02 3.331093e-02 3.285068e-02 #> Schneider.Xd.11 Schneider.Xd.12 Schneider.Xd.13 Schneider.Xd.14 Schneider.Xd.15 #> 3.381760e-02 3.470422e-02 3.166883e-02 3.360882e-02 3.479121e-02 #> Schneider.Xd.16 Schneider.Xd.17 Schneider.Xd.18 Schneider.Xd.19 Schneider.Xd.20 #> 3.270408e-02 3.172623e-02 3.225829e-02 3.435647e-02 3.361893e-02 #> Schneider.Xd.21 Schneider.Xd.22 Schneider.Xd.23 Schneider.Xd.24 Schneider.Xd.25 #> 3.236099e-02 3.132904e-02 3.217206e-02 3.432701e-02 3.414334e-02 #> Schneider.Xd.26 Schneider.Xd.27 Schneider.Xd.28 Schneider.Xd.29 Schneider.Xd.30 #> 3.294954e-02 3.149609e-02 3.456720e-02 3.237140e-02 3.114275e-02 #> Grantham.Xd.1 Grantham.Xd.2 Grantham.Xd.3 Grantham.Xd.4 Grantham.Xd.5 #> 3.402298e-02 3.446605e-02 3.638918e-02 3.439801e-02 3.206838e-02 #> Grantham.Xd.6 Grantham.Xd.7 Grantham.Xd.8 Grantham.Xd.9 Grantham.Xd.10 #> 3.486488e-02 3.361253e-02 3.341875e-02 3.374516e-02 3.451195e-02 #> Grantham.Xd.11 Grantham.Xd.12 Grantham.Xd.13 Grantham.Xd.14 Grantham.Xd.15 #> 3.311057e-02 3.499580e-02 3.209915e-02 3.312361e-02 3.372162e-02 #> Grantham.Xd.16 Grantham.Xd.17 Grantham.Xd.18 Grantham.Xd.19 Grantham.Xd.20 #> 3.350302e-02 3.348467e-02 3.147055e-02 3.349358e-02 3.298629e-02 #> Grantham.Xd.21 Grantham.Xd.22 Grantham.Xd.23 Grantham.Xd.24 Grantham.Xd.25 #> 3.293706e-02 3.027516e-02 3.329628e-02 3.390974e-02 3.303396e-02 #> Grantham.Xd.26 Grantham.Xd.27 Grantham.Xd.28 Grantham.Xd.29 Grantham.Xd.30 #> 3.253250e-02 3.199340e-02 3.332438e-02 3.284195e-02 3.236875e-02

where users can also specify the maximum lag with the argumentnlag and the weighting factor with the argumentw.

Pseudo-amino acid composition (PseAAC)

This group of descriptors is proposed by Chou(2001). PseAAC descriptors are also named as the type 1pseudo-amino acid composition. Let \(H_1^o (i)\), \(H_2^o (i)\), \(M^o (i)\) (\(i=1,2, 3, \ldots, 20\)) be the original hydrophobicity values, theoriginal hydrophilicity values and the original side chain masses of the20 natural amino acids, respectively. They are converted to thefollowing qualities by a standard conversion:

\[H_1 (i) = \frac{H_1^o (i) - \frac{1}{20} \sum_{i=1}^{20} H_1^o(i)}{\sqrt{\frac{\sum_{i=1}^{20} [H_1^o (i) - \frac{1}{20}\sum_{i=1}^{20} H_1^o (i) ]^2}{20}}}\]

\(H_2^o (i)\) and \(M^o (i)\) are normalized as \(H_2 (i)\) and \(M(i)\) in the same way.

R package for generating various numerical representation schemes of protein sequences (6)

Figure 6: A schematic drawing to show (a) the first-tier, (b) thesecond-tier, and (3) the third-tier sequence order correlation modealong a protein sequence. Panel (a) reflects the correlation modebetween all the most contiguous residues, panel (b) between all thesecond-most contiguous residues, and panel (c) between all thethird-most contiguous residues. This figure is from Chou (2001).

Then, a correlation function can be defined as

\[\Theta (R_i, R_j) = \frac{1}{3} \bigg\{ [ H_1 (R_i) - H_1 (R_j) ]^2 + [H_2 (R_i) - H_2 (R_j) ]^2 + [ M (R_i) - M (R_j) ]^2 \bigg\}\]

This correlation function is an average value for the three aminoacid properties: hydrophobicity, hydrophilicity, and side chain mass.Therefore, we can extend this definition of correlation functions forone amino acid property or for a set of \(n\) amino acid properties.

For one amino acid property, the correlation can be defined as:

\[\Theta (R_i, R_j) = [H_1 (R_i) - H_1(R_j)]^2\]

where \(H (R_i)\) is the amino acidproperty of amino acid \(R_i\) afterstandardization.

For a set of n amino acid properties, it can be defined as: where\(H_k (R_i)\) is the \(k\)-th property in the amino acid propertyset for amino acid \(R_i\).

\[\Theta (R_i, R_j) = \frac{1}{n} \sum_{k=1}^{n} [H_k (R_i) - H_k (R_j)]^2\]

where \(H_k(R_i)\) is the \(k\)-th property in the amino acid propertyset for amino acid \(R_i\).

A set of descriptors named sequence order-correlated factors aredefined as:

\[\begin{align*}\theta_1 & = \frac{1}{N-1} \sum_{i=1}^{N-1} \Theta (R_i, R_{i+1})\\\theta_2 & = \frac{1}{N-2} \sum_{i=1}^{N-2} \Theta (R_i, R_{i+2})\\\theta_3 & = \frac{1}{N-3} \sum_{i=1}^{N-3} \Theta (R_i, R_{i+3})\\ & \ldots \\\theta_\lambda & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}\Theta (R_i, R_{i+\lambda})\end{align*}\]

\(\lambda\) (\(\lambda < L\)) is a parameter to bespecified. Let \(f_i\) be thenormalized occurrence frequency of the 20 amino acids in the proteinsequence, a set of \(20 + \lambda\)descriptors called the pseudo-amino acid composition for a proteinsequence can be defined as:

\[X_c = \frac{f_c}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{\lambda} \theta_j}\quad (1 < c < 20)\]

\[X_c = \frac{w \theta_{c-20}}{\sum_{r=1}^{20} f_r + w\sum_{j=1}^{\lambda} \theta_j} \quad (21 < c < 20 + \lambda)\]

where \(w\) is the weighting factorfor the sequence-order effect and is set to \(w = 0.05\) in protr as suggested byKuo-Chen Chou.

With extractPAAC(), we can compute the PseAACdescriptors directly:

extractPAAC(x)#> Xc1.A Xc1.R Xc1.N Xc1.D Xc1.C #> 9.07025432 10.07806035 5.54293319 7.30659376 9.57415734 #> Xc1.E Xc1.Q Xc1.G Xc1.H Xc1.I #> 6.80269074 6.80269074 11.58976941 4.28317565 5.03903018 #> Xc1.L Xc1.K Xc1.M Xc1.F Xc1.P #> 10.83391488 5.54293319 1.76366056 4.53512716 7.55854527 #> Xc1.S Xc1.T Xc1.W Xc1.Y Xc1.V #> 12.59757544 6.29878772 3.27536961 6.04683621 7.05464225 #> Xc2.lambda.1 Xc2.lambda.2 Xc2.lambda.3 Xc2.lambda.4 Xc2.lambda.5 #> 0.02514092 0.02500357 0.02527773 0.02553159 0.02445265 #> Xc2.lambda.6 Xc2.lambda.7 Xc2.lambda.8 Xc2.lambda.9 Xc2.lambda.10 #> 0.02561910 0.02486131 0.02506656 0.02553952 0.02437663 #> Xc2.lambda.11 Xc2.lambda.12 Xc2.lambda.13 Xc2.lambda.14 Xc2.lambda.15 #> 0.02491262 0.02533803 0.02351915 0.02479912 0.02548431 #> Xc2.lambda.16 Xc2.lambda.17 Xc2.lambda.18 Xc2.lambda.19 Xc2.lambda.20 #> 0.02478210 0.02513770 0.02457224 0.02543046 0.02500889 #> Xc2.lambda.21 Xc2.lambda.22 Xc2.lambda.23 Xc2.lambda.24 Xc2.lambda.25 #> 0.02476967 0.02342389 0.02431684 0.02610300 0.02626722 #> Xc2.lambda.26 Xc2.lambda.27 Xc2.lambda.28 Xc2.lambda.29 Xc2.lambda.30 #> 0.02457082 0.02343049 0.02588823 0.02490463 0.02451951

The extractPAAC() function also provides the additionalarguments props and customprops, which aresimilar to those arguments for Moreau-Broto/Moran/Geary autocorrelationdescriptors. For their minor differences, please see?extracPAAC. Users can specify the lambda parameter and theweighting factor with the arguments lambda andw.

Note: In the work of Kuo-Chen Chou, the definitionfor “normalized occurrence frequency” was not given. In this work, wedefine it as the occurrence frequency of amino acids in the sequencenormalized to 100%, and hence, our calculated values are not the same astheir values.

Amphiphilic pseudo-amino acid composition (APseAAC)

Amphiphilic Pseudo-Amino Acid Composition (APseAAC) was proposed inChou (2001). APseAAC is also recognized asthe type 2 pseudo-amino acid composition. The definitions ofthese qualities are similar to the PAAC descriptors. From \(H_1 (i)\) and \(H_2 (j)\) defined before, thehydrophobicity and hydrophilicity correlation functions are definedas:

\[\begin{align*}H_{i, j}^1 & = H_1 (i) H_1 (j)\\H_{i, j}^2 & = H_2 (i) H_2 (j)\end{align*}\]

From these qualities, sequence order factors can be defined as:

\[\begin{align*}\tau_1 & = \frac{1}{N-1} \sum_{i=1}^{N-1} H_{i, i+1}^1\\\tau_2 & = \frac{1}{N-1} \sum_{i=1}^{N-1} H_{i, i+1}^2\\\tau_3 & = \frac{1}{N-2} \sum_{i=1}^{N-2} H_{i, i+2}^1\\\tau_4 & = \frac{1}{N-2} \sum_{i=1}^{N-2} H_{i, i+2}^2\\ & \ldots \\\tau_{2 \lambda - 1} & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}H_{i, i+\lambda}^1\\\tau_{2 \lambda} & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}H_{i, i+\lambda}^2\end{align*}\]

R package for generating various numerical representation schemes of protein sequences (7)

Figure 7: A schematic diagram to show(a1/a2) the first-rank,(b1/b2) the second-rank and(c1/c2) the third-ranksequence-order-coupling mode along a protein sequence through ahydrophobicity/hydrophilicity correlation function, where \(H_{i, j}^1\) and \(H_{i, j}^2\) are given by Equation (3).Panel (a1/a2) reflects the coupling mode between all the most contiguousresidues, panel (b1/b2) between all the second-most contiguous residues,and panel (c1/c2) between all the third-most contiguous residues. Thisfigure is from Chou (2005).

Then a set of descriptors called Amphiphilic Pseudo-Amino AcidComposition (APseAAC) are defined as:

\[P_c = \frac{f_c}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{2 \lambda} \tau_j}\quad (1 < c < 20)\]

\[P_c = \frac{w \tau_u}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{2 \lambda}\tau_j} \quad (21 < u < 20 + 2 \lambda)\]

where \(w\) is the weighting factor,its default value is set to \(w = 0.5\)in protr.

A minimal example for extracAPAAC() is:

extractAPAAC(x)#> Pc1.A Pc1.R Pc1.N #> 3.537412e+01 3.930458e+01 2.161752e+01 #> Pc1.D Pc1.C Pc1.E #> 2.849582e+01 3.733935e+01 2.653059e+01 #> Pc1.Q Pc1.G Pc1.H #> 2.653059e+01 4.520027e+01 1.670445e+01 #> Pc1.I Pc1.L Pc1.K #> 1.965229e+01 4.225242e+01 2.161752e+01 #> Pc1.M Pc1.F Pc1.P #> 6.878302e+00 1.768706e+01 2.947844e+01 #> Pc1.S Pc1.T Pc1.W #> 4.913073e+01 2.456536e+01 1.277399e+01 #> Pc1.Y Pc1.V Pc2.Hydrophobicity.1 #> 2.358275e+01 2.751321e+01 2.196320e-04 #> Pc2.Hydrophilicity.1 Pc2.Hydrophobicity.2 Pc2.Hydrophilicity.2 #> 1.025766e-03 -3.088876e-04 -1.834385e-04 #> Pc2.Hydrophobicity.3 Pc2.Hydrophilicity.3 Pc2.Hydrophobicity.4 #> 1.174146e-03 7.400156e-04 -1.105715e-03 #> Pc2.Hydrophilicity.4 Pc2.Hydrophobicity.5 Pc2.Hydrophilicity.5 #> -4.493680e-04 1.766358e-03 1.471212e-03 #> Pc2.Hydrophobicity.6 Pc2.Hydrophilicity.6 Pc2.Hydrophobicity.7 #> -1.441572e-03 -4.913600e-03 -1.678053e-05 #> Pc2.Hydrophilicity.7 Pc2.Hydrophobicity.8 Pc2.Hydrophilicity.8 #> 7.312356e-04 -1.885399e-03 -1.928708e-03 #> Pc2.Hydrophobicity.9 Pc2.Hydrophilicity.9 Pc2.Hydrophobicity.10 #> -2.931177e-03 -1.555660e-03 2.916597e-03 #> Pc2.Hydrophilicity.10 Pc2.Hydrophobicity.11 Pc2.Hydrophilicity.11 #> 3.602591e-03 1.055082e-04 8.697920e-04 #> Pc2.Hydrophobicity.12 Pc2.Hydrophilicity.12 Pc2.Hydrophobicity.13 #> -9.276413e-04 -2.001384e-03 1.705044e-03 #> Pc2.Hydrophilicity.13 Pc2.Hydrophobicity.14 Pc2.Hydrophilicity.14 #> 4.364007e-03 7.883453e-04 -9.441693e-04 #> Pc2.Hydrophobicity.15 Pc2.Hydrophilicity.15 Pc2.Hydrophobicity.16 #> -3.133437e-04 -3.599332e-03 3.689079e-05 #> Pc2.Hydrophilicity.16 Pc2.Hydrophobicity.17 Pc2.Hydrophilicity.17 #> 2.483867e-03 4.832798e-04 2.465788e-03 #> Pc2.Hydrophobicity.18 Pc2.Hydrophilicity.18 Pc2.Hydrophobicity.19 #> -3.142728e-04 2.021961e-03 6.421283e-05 #> Pc2.Hydrophilicity.19 Pc2.Hydrophobicity.20 Pc2.Hydrophilicity.20 #> -8.896690e-04 -2.986886e-04 9.304039e-04 #> Pc2.Hydrophobicity.21 Pc2.Hydrophilicity.21 Pc2.Hydrophobicity.22 #> -6.777458e-04 1.646818e-03 3.193506e-03 #> Pc2.Hydrophilicity.22 Pc2.Hydrophobicity.23 Pc2.Hydrophilicity.23 #> 3.270656e-03 2.533569e-03 2.478252e-03 #> Pc2.Hydrophobicity.24 Pc2.Hydrophilicity.24 Pc2.Hydrophobicity.25 #> -2.489106e-03 -1.031008e-03 -3.992322e-03 #> Pc2.Hydrophilicity.25 Pc2.Hydrophobicity.26 Pc2.Hydrophilicity.26 #> -2.596060e-03 8.690771e-04 -1.221378e-03 #> Pc2.Hydrophobicity.27 Pc2.Hydrophilicity.27 Pc2.Hydrophobicity.28 #> 5.208649e-03 4.617400e-03 -1.088584e-03 #> Pc2.Hydrophilicity.28 Pc2.Hydrophobicity.29 Pc2.Hydrophilicity.29 #> -2.512263e-03 1.387641e-03 2.060890e-03 #> Pc2.Hydrophobicity.30 Pc2.Hydrophilicity.30 #> 3.177340e-04 1.451909e-03

This function has the same arguments asextractPAAC().

Profile-based descriptors

The profile-based descriptors for protein sequences are available inthe protr package. The feature vectors of profile-based methods werebased on the PSSM by running PSI-BLAST and often show good performance.See Ye, Wang, and Altschul (2011) andRangwala and Karypis (2005) for details.The functions extractPSSM(), extractPSSMAcc(),and extractPSSMFeature() are used to generate thesedescriptors. Users need to install the NCBI-BLAST+ software packagefirst to make the functions fully functional.

Proteochemometric modeling descriptors

Proteochemometric (PCM) modeling utilizes statistical modeling tomodel the ligand-target interaction space. The below descriptorsimplemented in protr are extensively used in proteochemometricmodeling.

  • Scales-based descriptors derived by Principal ComponentsAnalysis

    • Scales-based descriptors derived by Amino Acid Properties fromAAindex (Protein Fingerprint)
    • Scales-based descriptors derived by 20+ classes of 2D and 3Dmolecular descriptors (Topological, WHIM, VHSE, etc.)
  • Scales-based descriptors derived by Factor Analysis

  • Scales-based descriptors derived by MultidimensionalScaling

  • BLOSUM and PAM matrix-derived descriptors

Note that every scales-based descriptor function can freely combinemore than 20 classes of 2D and 3D molecular descriptors to constructhighly customized scales-based descriptors. Of course, these functionsare designed to be flexible enough that users can provide totallyself-defined property matrices to construct scales-baseddescriptors.

For example, to compute the “topological scales” derived by PCA(using the first 5 principal components), one can useextractDescScales():

x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]descscales <- extractDescScales( x, propmat = "AATopo", index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE)#> Summary of the first 5 principal components: #> PC1 PC2 PC3 PC4 PC5#> Standard deviation 2.581537 1.754133 0.4621854 0.1918666 0.08972087#> Proportion of Variance 0.666430 0.307700 0.0213600 0.0036800 0.00080000#> Cumulative Proportion 0.666430 0.974130 0.9954900 0.9991700 0.99998000

The propmat argument invokes the AATopodataset shipped with the protr package. The argument indexselects the 37 to 41 and the 43 to 47 columns (molecular descriptors) inthe AATopo dataset to use. The parameter lagwas set for the Auto Cross Covariance (ACC) to generate scales-baseddescriptors of the same length. At last, we printed the summary of thefirst 5 principal components (standard deviation, proportion ofvariance, cumulative proportion of variance).

The result is a length 175 named vector, which is consistent with thedescriptors before:

length(descscales)#> [1] 175head(descscales, 15)#> scl1.lag1 scl2.lag1 scl3.lag1 scl4.lag1 scl5.lag1 #> -2.645644e-01 -1.717847e-02 1.975438e-02 -7.930659e-05 -3.710597e-05 #> scl1.lag2 scl2.lag2 scl3.lag2 scl4.lag2 scl5.lag2 #> 3.548612e-01 1.343712e-01 5.699395e-03 -5.489472e-04 -6.364577e-05 #> scl1.lag3 scl2.lag3 scl3.lag3 scl4.lag3 scl5.lag3 #> 2.011431e-02 -9.211136e-02 -1.461755e-03 6.747801e-04 2.386782e-04

For another example, to compute the descriptors derived by theBLOSUM62 matrix and use the first 5 scales, one can use:

x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]blosum <- extractBLOSUM( x, submat = "AABLOSUM62", k = 5, lag = 7, scale = TRUE, silent = FALSE)#> Relative importance of all the possible 20 scales: #> [1] 1.204960e+01 7.982007e+00 6.254364e+00 4.533706e+00 4.326286e+00#> [6] 3.850579e+00 3.752197e+00 3.538207e+00 3.139155e+00 2.546405e+00#> [11] 2.373286e+00 1.666259e+00 1.553126e+00 1.263685e+00 1.024699e+00#> [16] 9.630187e-01 9.225759e-01 7.221636e-01 1.020085e-01 -4.592648e-17

The result is a length 175 named vector:

length(blosum)#> [1] 175head(blosum, 15)#> scl1.lag1 scl2.lag1 scl3.lag1 scl4.lag1 scl5.lag1 #> 0.0042370555 -0.0021502057 0.0005993291 0.0006456375 0.0014849592 #> scl1.lag2 scl2.lag2 scl3.lag2 scl4.lag2 scl5.lag2 #> -0.0014919096 0.0032873726 0.0011734162 -0.0021758536 -0.0018127568 #> scl1.lag3 scl2.lag3 scl3.lag3 scl4.lag3 scl5.lag3 #> -0.0029413528 0.0001494193 0.0003298806 -0.0017877430 -0.0051044133

Dealing with gaps. In proteochemometrics (sequencealignment), gaps can be very useful since a gap in a certain positioncontains information. The protr package has built-in support for suchgaps. We deal with the gaps by using a dummy descriptor to code for the21st type of amino acid. The function extractScalesGap()and extractProtFPGap() can be used to deal with such gaps.See ?extractScalesGap and ?extractProtFPGapfor details.

Similarity calculation by sequence alignment

Similarity computation derived by local or global protein sequencealignment between a list of protein sequences is of great need inprotein research. However, this type of pairwise similarity computationis often computationally intensive, especially when many proteinsequences exist. Luckily, this process is also highly parallelizable.The protr package integrates the function of parallelized similaritycomputation derived by local or global protein sequence alignmentbetween a list of protein sequences.

The function twoSeqSim() calculates the alignment resultbetween two protein sequences. The function parSeqSim()calculates the pairwise similarity with a list of protein sequences inparallel:

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]plist <- list(s1, s2, s3, s4, s5)psimmat <- parSeqSim(plist, cores = 4, type = "local", submat = "BLOSUM62")psimmat
## [,1] [,2] [,3] [,4] [,5]## [1,] 1.00000000 0.11825938 0.10236985 0.04921696 0.03943488## [2,] 0.11825938 1.00000000 0.18858241 0.12124217 0.06391103## [3,] 0.10236985 0.18858241 1.00000000 0.05819984 0.06175942## [4,] 0.04921696 0.12124217 0.05819984 1.00000000 0.05714638## [5,] 0.03943488 0.06391103 0.06175942 0.05714638 1.00000000

Similarly, the function crossSetSim() calculates thepairwise similarity between two sets of protein sequences inparallel.

Note that for a small number of proteins, calculating their pairwisesimilarity scores derived by sequence alignment in parallel may notsignificantly reduce the overall computation time since each task onlyrequires a relatively short time to finish. Thus, computationaloverheads may exist and affect the performance. In our benchmarks, weused about 1,000 protein sequences on 64 CPU cores and observedsignificant performance improvement compared to the sequentialcomputation.

Scaling up for a large number of sequences

batches. A useful argument for reducingmemory usage is batches. It splits the similaritycomputations into smaller batches. A larger batch number reduces thememory need for each parallel process but also adds the overhead offorking new processes. This is a trade-off between time (CPU) and space(memory). Also, use verbose = TRUE to print the computationprogress.

parSeqSimDisk(). For similaritycomputations over an even larger number of protein sequences, please tryparSeqSimDisk() and set batches to a largenumber. Unlike the in-memory version, this disk-based function cachesthe partial results in each batch to the hard drive and merges all theresults together in the end, which further reduces memory usage.

Example. Let’s assume we have 20,000 sequences. Bysetting cores = 64 and batches = 10000 inparSeqSimDisk(), the workload will be around20000^2/2/64/10000 = 312.5 alignments per batch per core.This could give you relatively quick feedback about whether the memorysize is going to be sufficient, and how long the computation will takein total after the first few batches.

Similarity calculation by GO semantic similarity measures

The protr package also integrates the function for similarity scorecomputation derived by Gene Ontology (GO) semantic similarity measuresbetween a list of GO terms or Entrez Gene IDs.

The function twoGOSim() calculates the similarityderived by GO terms semantic similarity measures between two GO terms /Entrez Gene IDs, and the function parGOSim() calculates thepairwise similarity with a list of GO terms / Entrez Gene IDs:

# By GO Termsgo1 <- c( "GO:0005215", "GO:0005488", "GO:0005515", "GO:0005625", "GO:0005802", "GO:0005905") # AP4B1go2 <- c( "GO:0005515", "GO:0005634", "GO:0005681", "GO:0008380", "GO:0031202") # BCAS2go3 <- c( "GO:0003735", "GO:0005622", "GO:0005840", "GO:0006412") # PDE4DIPgolist <- list(go1, go2, go3)parGOSim(golist, type = "go", ont = "CC", measure = "Wang")
## [,1] [,2] [,3]## [1,] 1.000 0.344 0.17## [2,] 0.344 1.000 0.24## [3,] 0.170 0.240 1.00
# By Entrez gene idgenelist <- list(c("150", "151", "152", "1814", "1815", "1816"))parGOSim(genelist, type = "gene", ont = "BP", measure = "Wang")
## 150 151 152 1814 1815 1816## 150 1.000 0.702 0.725 0.496 0.570 0.455## 151 0.702 1.000 0.980 0.456 0.507 0.551## 152 0.725 0.980 1.000 0.460 0.511 0.538## 1814 0.496 0.456 0.460 1.000 0.687 0.473## 1815 0.570 0.507 0.511 0.687 1.000 0.505## 1816 0.455 0.551 0.538 0.473 0.505 1.000

Miscellaneous tools

In this section, we will briefly introduce some useful tools theprotr package provides.

Retrieve protein sequences from UniProt

This function getUniProt() gets protein sequences fromuniprot.org by protein ID(s). The input ID is a charactervector specifying the protein ID(s). The returned sequences are storedin a list:

ids <- c("P00750", "P00751", "P00752")prots <- getUniProt(ids)prots
## [[1]]## [1] "MDAMKRGLCCVLLLCGAVFVSPSQEIHARFRRGARSYQVICRDEKTQMIYQQHQSWLRPVLRSNRVEYCWCN## SGRAQCHSVPVKSCSEPRCFNGGTCQQALYFSDFVCQCPEGFAGKCCEIDTRATCYEDQGISYRGTWSTAESGAECT## NWNSSALAQKPYSGRRPDAIRLGLGNHNYCRNPDRDSKPWCYVFKAGKYSSEFCSTPACSEGNSDCYFGNGSAYRGT## HSLTESGASCLPWNSMILIGKVYTAQNPSAQALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCSTCGLRQ## YSQPQFRIKGGLFADIASHPWQAAIFAKHRRSPGERFLCGGILISSCWILSAAHCFQERFPPHHLTVILGRTYRVVP## GEEEQKFEVEKYIVHKEFDDDTYDNDIALLQLKSDSSRCAQESSVVRTVCLPPADLQLPDWTECELSGYGKHEALSP## FYSERLKEAHVRLYPSSRCTSQHLLNRTVTDNMLCAGDTRSGGPQANLHDACQGDSGGPLVCLNDGRMTLVGIISWG## LGCGQKDVPGVYTKVTNYLDWIRDNMRP"#### [[2]]## [1] "MGSNLSPQLCLMPFILGLLSGGVTTTPWSLARPQGSCSLEGVEIKGGSFRLLQEGQALEYVCPSGFYPYPVQ## TRTCRSTGSWSTLKTQDQKTVRKAECRAIHCPRPHDFENGEYWPRSPYYNVSDEISFHCYDGYTLRGSANRTCQVNG## RWSGQTAICDNGAGYCSNPGIPIGTRKVGSQYRLEDSVTYHCSRGLTLRGSQRRTCQEGGSWSGTEPSCQDSFMYDT## PQEVAEAFLSSLTETIEGVDAEDGHGPGEQQKRKIVLDPSGSMNIYLVLDGSDSIGASNFTGAKKCLVNLIEKVASY## GVKPRYGLVTYATYPKIWVKVSEADSSNADWVTKQLNEINYEDHKLKSGTNTKKALQAVYSMMSWPDDVPPEGWNRT## RHVIILMTDGLHNMGGDPITVIDEIRDLLYIGKDRKNPREDYLDVYVFGVGPLVNQVNINALASKKDNEQHVFKVKD## MENLEDVFYQMIDESQSLSLCGMVWEHRKGTDYHKQPWQAKISVIRPSKGHESCMGAVVSEYFVLTAAHCFTVDDKE## HSIKVSVGGEKRDLEIEVVLFHPNYNINGKKEAGIPEFYDYDVALIKLKNKLKYGQTIRPICLPCTEGTTRALRLPP## TTTCQQQKEELLPAQDIKALFVSEEEKKLTRKEVYIKNGDKKGSCERDAQYAPGYDKVKDISEVVTPRFLCTGGVSP## YADPNTCRGDSGGPLIVHKRSRFIQVGVISWGVVDVCKNQKRQKQVPAHARDFHINLFQVLPWLKEKLQDEDLGFL"#### [[3]]## [1] "APPIQSRIIGGRECEKNSHPWQVAIYHYSSFQCGGVLVNPKWVLTAAHCKNDNYEVWLGRHNLFENENTAQF## FGVTADFPHPGFNLSLLKXHTKADGKDYSHDLMLLRLQSPAKITDAVKVLELPTQEPELGSTCEASGWGSIEPGPDB## FEFPDEIQCVQLTLLQNTFCABAHPBKVTESMLCAGYLPGGKDTCMGDSGGPLICNGMWQGITSWGHTPCGSANKPS## IYTKLIFYLDWINDTITENP"

Read FASTA files

The function readFASTA() provides a convenient way toread protein sequences stored in FASTA format files. See?readFASTA for details. The returned sequences are storedin a named list, whose components are named with the protein sequences’names.

Read PDB files

The Protein Data Bank (PDB) file format is a text file format thatdescribes the three-dimensional structures of proteins. ThereadPDB() function allows you to read protein sequencesstored in PDB format files. See ?readPDB for details.

Sanity check for amino acid types

The protcheck() function checks if the proteinsequence’s amino acid types are in the 20 default types, which returnsTRUE if all the amino acids in the sequence belong to the20 default types:

x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]# A real sequenceprotcheck(x)#> [1] TRUE# An artificial sequenceprotcheck(paste(x, "Z", sep = ""))#> [1] FALSE

Remove gaps from sequences

The removeGaps() function allows us to remove/replacegaps (-) or any “irregular” characters from amino acidsequences to prepare them for feature extraction or sequencealignment-based similarity computation.

Protein sequence partition

The protseg() function partitions the protein sequencesto create sliding windows. This is usually required when creatingfeature vectors for machine learning tasks. Users can specify a sequencex, a character aa (one of the 20 amino acidtypes), and a positive integer k, which controls the windowsize (half of the window).

This function returns a named list. Each component contains one ofthe segmentations (a character string). Names of the list components arethe positions of the specified amino acid in the sequence.

protseg(x, aa = "M", k = 5)#> $`48`#> [1] "DEKTQMIYQQH"#> #> $`242`#> [1] "LPWNSMILIGK"#> #> $`490`#> [1] "TVTDNMLCAGD"#> #> $`525`#> [1] "LNDGRMTLVGI"

Auto cross covariance computation

Auto Cross Covariance (ACC) is extensively used in the scales-baseddescriptors computation, this approach calculates the auto covarianceand auto cross covariance for generating scale-based descriptors of thesame length. Users can write their own scales-based descriptor functionswith the help of acc() function in the protr package.

Pre-computed 2D and 3D descriptor sets for the 20 amino acids

The protr package ships with more than 20 pre-computed 2D and 3Ddescriptor sets for the 20 amino acids to use with the scales-baseddescriptors. Please use data(package = "protr") to list allthe datasets included in the protr package.

BLOSUM and PAM matrices for the 20 amino acids

The BLOSUM and PAM matrices for the 20 amino acids can be used tocalculate BLOSUM and PAM matrix-derived descriptors with functionextractBLOSUM(), the datasets are named inAABLOSUM45, AABLOSUM50,AABLOSUM62, AABLOSUM80,AABLOSUM100, AAPAM30, AAPAM40,AAPAM70, AAPAM120, andAAPAM250.

Metadata of the 20 amino acids

As the reference, the AAMetaInfo dataset containsmetadata of the 20 amino acids used for the 2D and 3D descriptorcalculation in the protr package. This dataset includes each aminoacid’s name, one-letter representation, three-letter representation,SMILE representation, PubChem CID, and PubChem link. Seedata(AAMetaInfo) for details.

ProtrWeb

The web application built on protr, namely ProtrWeb, is available athttp://protr.org.

ProtrWeb does not require the user to have any programming knowledge.It is a user-friendly web application that computes the protein sequencedescriptors in the protr package.

R package for generating various numerical representation schemes of protein sequences (8)

Figure 8: A screenshot of the web application ProtrWeb.

Source code repository for this Shiny web application: https://github.com/nanxstats/protrweb.

Summary

The summary of the descriptors in the protr package is listed inTable 3.

Table 3: List of commonly used descriptors in protr
Descriptor GroupDescriptor NameDescriptor DimensionFunction Name
Amino Acid CompositionAmino Acid Composition20extractAAC()
Dipeptide Composition400extractDC()
Tripeptide Composition8000extractTC()
AutocorrelationNormalized Moreau-Broto Autocorrelation240\(^1\)extractMoreauBroto()
Moran Autocorrelation240\(^1\)extractMoran()
Geary Autocorrelation240\(^1\)extractGeary()
CTDComposition21extractCTDC(), extractCTDCClass()
Transition21extractCTDT(), extractCTDTClass()
Distribution105extractCTDD(), extractCTDDClass()
Conjoint TriadConjoint Triad343extractCTriad(), extractCTriadClass()
Quasi-Sequence-OrderSequence-Order-Coupling Number60\(^2\)extractSOCN()
Quasi-Sequence-Order Descriptors100\(^2\)extractQSO()
Pseudo-Amino Acid CompositionPseudo-Amino Acid Composition50\(^3\)extractPAAC()
Amphiphilic Pseudo-Amino Acid Composition80\(^4\)extractAPAAC()

1 The number depends on the choice of the number ofproperties of amino acids and the choice of the maximum values of thelag. The default is to use 8 types of properties andlag = 30.

2 The number depends on the maximum value oflag. By default, lag = 30. Two distancematrices are used, so the descriptor dimension is \(30 \times 2 = 60\) and \((20 + 30) \times 2 = 100\).

3 The number depends on the choice of the number of the setof amino acid properties and the choice of the \(\lambda\) value. The default is to use 3types of properties proposed by Kuo-Chen Chou and \(\lambda = 30\).

4 The number depends on the choice of the \(\lambda\) vlaue. The default is \(\lambda = 30\).

The scales-based PCM descriptors in the protr package are listed inTable 4.

Table 4: List of PCM descriptors in protr
Derived byDescriptor ClassFunction Name
Principal Components AnalysisScales-based descriptors derived by PrincipalComponents AnalysisextractScales(), extractScalesGap()
Scales-based descriptors derived by amino acidproperties from AAindex (a.k.a. Protein Fingerprint)extractProtFP(), extractProtFPGap()
Scales-based descriptors derived by 2D and 3D moleculardescriptors (Topological, WHIM, VHSE, etc.)extractDescScales()
Factor AnalysisScales-based descriptors derived by FactorAnalysisextractFAScales()
Multidimensional ScalingScales-based descriptors derived by MultidimensionalScalingextractMDSScales()
Substitution MatrixBLOSUM and PAM matrix-derived descriptorsextractBLOSUM()

The amino acid descriptor sets used by scales-based descriptorsprovided by the protr package are listed in Table 5. Note that thenon-informative descriptors (like the descriptors have only one valueacross all the 20 amino acids) in these datasets have already beenfiltered out.

Table 5: List of the pre-calculated descriptor sets of the 20amino acids in protr
Dataset NameDescriptor Set NameDimensionalityCalculated by
AA2DACOR2D Autocorrelations Descriptors92Dragon
AA3DMoRSE3D-MoRSE Descriptors160Dragon
AAACFAtom-Centred Fragments Descriptors6Dragon
AABurdenBurden Eigenvalues Descriptors62Dragon
AAConnConnectivity Indices Descriptors33Dragon
AAConstConstitutional Descriptors23Dragon
AAEdgeAdjEdge Adjacency Indices Descriptors97Dragon
AAEigIdxEigenvalue-Based Indices Descriptors44Dragon
AAFGCFunctional Group Counts Descriptors5Dragon
AAGeomGeometrical Descriptors41Dragon
AAGETAWAYGETAWAY Descriptors194Dragon
AAInfoInformation Indices Descriptors47Dragon
AAMolPropMolecular Properties Descriptors12Dragon
AARandicRandic Molecular Profiles Descriptors41Dragon
AARDFRDF Descriptors82Dragon
AATopoTopological Descriptors78Dragon
AATopoChgTopological Charge Indices Descriptors15Dragon
AAWalkWalk and Path Counts Descriptors40Dragon
AAWHIMWHIM Descriptors99Dragon
AACPSACPSA Descriptors41Accelrys Discovery Studio
AADescAllAll the 2D Descriptors Calculated by Dragon1171Dragon
AAMOE2DAll the 2D Descriptors Calculated by MOE148MOE
AAMOE3DAll the 3D Descriptors Calculated by MOE143MOE

References

Chou, Kuo-Chen. 2000. “Prediction of Protein Subcellar Locationsby Incorporating Quasi-Sequence-Order Effect.” Biochemicaland Biophysical Research Communications 278: 477–83.

———. 2001. “Prediction of Protein Cellular Attributes UsingPseudo-Amino Acid Composition.” PROTEINS: Structure,Function, and Genetics 43: 246–55.

———. 2005. “Using Amphiphilic Pseudo Amino Acid Composition toPredict Enzyme Subfamily Classes.” Bioinformatics 21(1): 10–19.

Chou, Kuo-Chen, and Hong-Bin Shen. 2008.Cell-PLoc: A Package of Web Servers forPredicting Subcellular Localization of Proteins in VariousOrganisms.” Nature Protocols 3 (2): 153–62.

Dubchak, Inna, Ilya Muchink, Stephen R. Holbrook, and Sung-Hou Kim.1995. “Prediction of Protein Folding Class Using GlobalDescription of Amino Acid Sequence.” Proceedings of theNational Academy of Sciences 92: 8700–8704.

Dubchak, Inna, Ilya Muchink, Christopher Mayor, Igor Dralyuk, andSung-Hou Kim. 1999. “Recognition of a Protein Fold in the Contextof the SCOP Classification.” Proteins:Structure, Function and Genetics 35: 401–7.

Grantham, R. 1974. “Amino Acid Difference Formula to Help ExplainProtein Evolution.” Science 185: 862–64.

Kawashima, S., and M. Kanehisa. 2000. AAindex: AminoAcid Index Database.” Nucleic Acids Research 28: 374.

Kawashima, S., H. Ogata, and M. Kanehisa. 1999.AAindex: Amino Acid Index Database.”Nucleic Acids Research 27: 368–69.

Kawashima, S., P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama,and M. Kanehisa. 2008. AAindex: Amino Acid IndexDatabase (Progress Report).” Nucleic Acids Research 36:D202–5.

Rangwala, Huzefa, and George Karypis. 2005. “Profile-Based DirectKernels for Remote Homology Detection and Fold Recognition.”Bioinformatics 21 (23): 4239–47.

Schneider, Gisbert, and Paul Wrede. 1994. “The Rational Design ofAmino Acid Sequences by Artificial Neural Networks and SimulatedMolecular Evolution: Do Novo Design of an Idealized Leader CleavageSite.” Biophysical Journal 66: 335–44.

Shen, J. W., J. Zhang, X. M. Luo, W. L. Zhu, K. Q. Yu, K. X. Chen, Y. X.Li, and H. L. Jiang. 2007. “Predicting Protein-ProteinInteractions Based Only on Sequences Information.”Proceedings of the National Academy of Sciences 104: 4337–41.

Xiao, Nan, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. 2015.protr/ProtrWeb:R Package and Web Server for Generating Various NumericalRepresentation Schemes of Protein Sequences.”Bioinformatics 31 (11): 1857–59.

Ye, Xugang, Guoli Wang, and Stephen F Altschul. 2011. “AnAssessment of Substitution Scores for Protein Profile–ProfileComparison.” Bioinformatics 27 (24): 3356–63.

R package for generating various numerical representation schemes of protein sequences (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5363

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.