Evolution Analysis

Release Notes

v2.4.1 series come with new and improved sequence, structure, and dynamics analysis features. See release notes for details.

How to Cite

Bakan A, Meireles LM, Bahar I ProDy: Protein Dynamics Inferred from Theory and Experiments
Bioinformatics 2011 27(11):1575-1577.

Bakan A, Dutta A, Mao W, Liu Y, Chennubhotla C, Lezon TR, Bahar I Evol and ProDy for Bridging Protein Sequence Evolution and Structural Dynamics
Bioinformatics 2014 30(18):2681-2683.

Zhang S, Krieger JM, Zhang Y, Kaya C, Kaynak B, Mikulska-Ruminska K, Doruker P, Li H, Bahar I ProDy 2.0: Increased scale and scope after 10 years of protein dynamics modelling with Python
Bioinformatics 2021 37(20):3657-3659.

Evolution Analysis¶

This part follows from MSA Files. The aim of this part is to show how to calculate residue conservation and coevolution properties based on multiple sequence alignments (MSAs). MSA

First, we import everything from the ProDy package.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()  # turn interactive mode on

Get MSA data¶

Let’s download full MSA file for protein family RnaseA. We can do this by specifying the PDB ID of a protein in this family.

In [4]: searchPfam('1K2A').keys()
Out[4]: ['PF00074']

In [5]: fetchPfamMSA('PF00074')
Out[5]: 'PF00074_full.sth'

Let’s parse the downloaded file:

In [6]: msa = parseMSA('PF00074_full.sth')

Refine MSA¶

Here, we refine the MSA to decrease the number of gaps. We will remove any columns in the alignment for which there is a gap in the specified PDB file, and then remove any rows that have more than 20% gaps. refineMSA() does all of this and returns an MSA object.

In [7]: msa_refine = refineMSA(msa, label='RNAS2_HUMAN', rowocc=0.8, seqid=0.98)

In [8]: msa_refine
Out[8]: <MSA: PF00074_full refined (label=RNAS2_HUMAN, rowocc>=0.8, seqid>=0.98) (698 sequences, 126 residues)>

MSA is refined based on the sequence of RNAS2_HUMAN, corresponding to 1K2A.

Occupancy calculation¶

Evol plotting functions are prefixed with show. We can plot the occupancy for each column to see if there are any positions in the MSA that have a lot of gaps. We use the function showMSAOccupancy() that uses calcMSAOccupancy() to calculate occupancy for MSA.

In [9]: showMSAOccupancy(msa_refine, occ='res');

Let’s find the minimum:

In [10]: calcMSAOccupancy(msa_refine, occ='res').min()
Out[10]: 0.011461318051575931

We can also specify indices based on the PDB.

In [11]: indices = list(range(5,131))

In [12]: showMSAOccupancy(msa_refine, occ='res', indices=indices);

../../_images/msa_analysis_occ_res_indices.png

Further refining the MSA to remove positions that have low occupancy will change the start and end positions of the labels in the MSA. This is not corrected automatically on refinement. We can also plot occupancy based on rows for the sequences in the MSA.

Entropy Calculation¶

Here, we show how to calculate and plot Shannon Entropy. Entropy for each position in the MSA is calculated using calcShannonEntropy(). It takes MSA object or a numpy 2D array containg MSA as input and returns a 1D numpy array.

In [13]: entropy = calcShannonEntropy(msa_refine)

In [14]: entropy
Out[14]: 
array([-0.        ,  1.97536341,  1.6976014 ,  1.98686002,  2.04097927,
        0.33881322,  2.13977922,  2.00661895,  1.08055377,  0.47729866,
        1.58125405,  1.78890686,  2.51747686,  2.31397501,  2.38803313,
        2.40126663,  2.54295735,  0.09338611,  1.41579988,  2.69688922,
        1.8686074 ,  0.75606438,  2.34383631,  2.39720547,  1.47431149,
        1.88334663,  2.14950549,  2.02528004,  2.58853012,  2.45108208,
        2.61723915,  0.16813412,  0.71292546,  2.36093797,  2.32632049,
        0.61799546,  0.89021762,  0.36043225,  1.01850997,  1.33382837,
        2.05607945,  2.15459689,  2.3895889 ,  2.52272713,  2.26277362,
        1.07883485,  2.22209291,  2.10332436,  1.46075956,  0.10580037,
        2.5880002 ,  2.50810725,  2.35814643,  2.09300504,  2.29386782,
        2.5724491 ,  1.68877804,  1.07047455,  1.57433048,  1.03752672,
        2.0206458 ,  1.97820357,  2.60225   ,  2.62010838,  1.71632538,
        1.36403271,  1.79131851,  2.30781167,  0.35730727,  2.55844119,
        2.45589515,  2.30325997,  1.40702068,  2.42986786,  1.63284318,
        0.96868799,  2.39301813,  0.08317819,  2.15806993,  1.39390954,
        2.01888973,  1.48477641,  2.25039015,  2.071024  ,  1.80876533,
        2.21861345,  2.1495278 ,  2.26531537,  1.29842729,  2.2432391 ,
        0.16590041,  2.28183392,  0.10492724,  2.08455959,  2.15120138,
        2.36710596,  2.56442544,  2.51541107,  2.42719479,  1.80265513,
        2.34569871,  1.4964265 ,  1.96057778,  0.98329793,  1.51199785,
        0.05404462,  1.95389007,  1.24837514,  1.04713865, -0.        ,
        1.71085363,  1.68585224,  1.88812929,  1.64865814,  1.8563798 ,
        2.26222015,  2.28490399,  1.93257978,  2.34153018,  2.17263465,
        1.97322381,  0.05898735,  0.64841109,  0.79058895,  1.23695064,
        0.52622679])

entropy is a 1D NumPy array. Plotting is done using showShannonEntropy().

In [15]: showShannonEntropy(msa_refine, indices);

Mutual Information¶

We can calculate mutual information between the positions of the MSA using buildMutinfoMatrix() which also takes an MSA object or a numpy 2D array containing MSA as input.

In [16]: mutinfo = buildMutinfoMatrix(msa_refine)

In [17]: mutinfo
Out[17]: 
array([[0.        , 0.20631824, 0.16401097, ..., 0.06095202, 0.14566864,
        0.05380968],
       [0.20631824, 0.        , 0.60253758, ..., 0.22563382, 0.22440521,
        0.18667669],
       [0.16401097, 0.60253758, 0.        , ..., 0.35108633, 0.37773013,
        0.24761374],
       ...,
       [0.06095202, 0.22563382, 0.35108633, ..., 0.        , 0.4087536 ,
        0.22060682],
       [0.14566864, 0.22440521, 0.37773013, ..., 0.4087536 , 0.        ,
        0.27725374],
       [0.05380968, 0.18667669, 0.24761374, ..., 0.22060682, 0.27725374,
        0.        ]])

Result is a 2D NumPy array.

We can also apply normalization using applyMutinfoNorm() and correction using applyMutinfoCorr() to the mutual information matrix based on references [Martin05] and [Dunn08], respectively.

[Martin05]

Martin LC, Gloor GB, Dunn SD, Wahl LM. Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005 21(22):4116-4124.

[Dunn08]

Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008 24(3):333-340.

In [18]: mutinfo_norm = applyMutinfoNorm(mutinfo, entropy, norm='minent')

In [19]: mutinfo_corr = applyMutinfoCorr(mutinfo, corr='apc')

Note that by default norm="sument" normalization is applied in applyMutinfoNorm and corr="prod" is applied in applyMutinfoCorr.

Now we plot the mutual information matrices that we obtained above and see the effects of different corrections and normalizations.

In [20]: showMutinfoMatrix(mutinfo);

In [21]: showMutinfoMatrix(mutinfo_corr, clim=[0, mutinfo_corr.max()],
   ....:    xlabel='1K2A: 4-131');
   ....: 

../../_images/msa_analysis_mutinfo_corr.png

Output Results¶

Here we show how to write the mutual information and entropy arrays to file. We use the writeArray() to write NumPy array data.

In [22]: writeArray('1K2A_MI.txt', mutinfo)
Out[22]: '1K2A_MI.txt'

This can be later loaded using parseArray().

Rank-ordering¶

Further analysis can also be done by rank ordering the matrix and analyzing the pairs with highest mutual information or the most co-evolving residues. This is done using calcRankorder(). A z-score normalization can also be applied to select coevolving pairs based on a z score cutoff.

In [23]: rank_row, rank_col, zscore_sort = calcRankorder(mutinfo, zscore=True)

In [24]: asarray(indices)[rank_row[:5]]
Out[24]: array([127,  31,  31, 128, 115])

In [25]: asarray(indices)[rank_col[:5]]
Out[25]: array([126,  26,  29, 126, 113])

In [26]: zscore_sort[:5]
Out[26]: array([4.50792926, 3.55425262, 3.42345689, 3.27298657, 3.21999586])