Evolution Analysis

This part follows from MSA Files. The aim of this part is to show how to calculate residue conservation and coevolution properties based on multiple sequence alignments (MSAs). MSA

First, we import everything from the ProDy package.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()  # turn interactive mode on

Get MSA data

Let’s download full MSA file for protein family RnaseA. We can do this by specifying the PDB ID of a protein in this family.

In [4]: searchPfam('1K2A').keys()
Out[4]: ['PF00074']
In [5]: fetchPfamMSA('PF00074')
Out[5]: 'PF00074_full.sth'

Let’s parse the downloaded file:

In [6]: msa = parseMSA('PF00074_full.sth')

Refine MSA

Here, we refine the MSA to decrease the number of gaps. We will remove any columns in the alignment for which there is a gap in the specified PDB file, and then remove any rows that have more than 20% gaps. refineMSA() does all of this and returns an MSA object.

In [7]: msa_refine = refineMSA(msa, label='RNAS2_HUMAN', rowocc=0.8, seqid=0.98)

In [8]: msa_refine
Out[8]: <MSA: PF00074_full refined (label=RNAS2_HUMAN, rowocc>=0.8, seqid>=0.98) (698 sequences, 126 residues)>

MSA is refined based on the sequence of RNAS2_HUMAN, corresponding to 1K2A.

Occupancy calculation

Evol plotting functions are prefixed with show. We can plot the occupancy for each column to see if there are any positions in the MSA that have a lot of gaps. We use the function showMSAOccupancy() that uses calcMSAOccupancy() to calculate occupancy for MSA.

In [9]: showMSAOccupancy(msa_refine, occ='res');
../../_images/msa_analysis_occ_res.png

Let’s find the minimum:

In [10]: calcMSAOccupancy(msa_refine, occ='res').min()
Out[10]: 0.011461318051575931

We can also specify indices based on the PDB.

In [11]: indices = list(range(5,131))

In [12]: showMSAOccupancy(msa_refine, occ='res', indices=indices);
../../_images/msa_analysis_occ_res_indices.png

Further refining the MSA to remove positions that have low occupancy will change the start and end positions of the labels in the MSA. This is not corrected automatically on refinement. We can also plot occupancy based on rows for the sequences in the MSA.

Entropy Calculation

Here, we show how to calculate and plot Shannon Entropy. Entropy for each position in the MSA is calculated using calcShannonEntropy(). It takes MSA object or a numpy 2D array containg MSA as input and returns a 1D numpy array.

In [13]: entropy = calcShannonEntropy(msa_refine)

In [14]: entropy
Out[14]: 
array([-0.        ,  1.97536341,  1.6976014 ,  1.98686002,  2.04097927,
        0.33881322,  2.13977922,  2.00661895,  1.08055377,  0.47729866,
        1.58125405,  1.78890686,  2.51747686,  2.31397501,  2.38803313,
        2.40126663,  2.54295735,  0.09338611,  1.41579988,  2.69688922,
        1.8686074 ,  0.75606438,  2.34383631,  2.39720547,  1.47431149,
        1.88334663,  2.14950549,  2.02528004,  2.58853012,  2.45108208,
        2.61723915,  0.16813412,  0.71292546,  2.36093797,  2.32632049,
        0.61799546,  0.89021762,  0.36043225,  1.01850997,  1.33382837,
        2.05607945,  2.15459689,  2.3895889 ,  2.52272713,  2.26277362,
        1.07883485,  2.22209291,  2.10332436,  1.46075956,  0.10580037,
        2.5880002 ,  2.50810725,  2.35814643,  2.09300504,  2.29386782,
        2.5724491 ,  1.68877804,  1.07047455,  1.57433048,  1.03752672,
        2.0206458 ,  1.97820357,  2.60225   ,  2.62010838,  1.71632538,
        1.36403271,  1.79131851,  2.30781167,  0.35730727,  2.55844119,
        2.45589515,  2.30325997,  1.40702068,  2.42986786,  1.63284318,
        0.96868799,  2.39301813,  0.08317819,  2.15806993,  1.39390954,
        2.01888973,  1.48477641,  2.25039015,  2.071024  ,  1.80876533,
        2.21861345,  2.1495278 ,  2.26531537,  1.29842729,  2.2432391 ,
        0.16590041,  2.28183392,  0.10492724,  2.08455959,  2.15120138,
        2.36710596,  2.56442544,  2.51541107,  2.42719479,  1.80265513,
        2.34569871,  1.4964265 ,  1.96057778,  0.98329793,  1.51199785,
        0.05404462,  1.95389007,  1.24837514,  1.04713865, -0.        ,
        1.71085363,  1.68585224,  1.88812929,  1.64865814,  1.8563798 ,
        2.26222015,  2.28490399,  1.93257978,  2.34153018,  2.17263465,
        1.97322381,  0.05898735,  0.64841109,  0.79058895,  1.23695064,
        0.52622679])

entropy is a 1D NumPy array. Plotting is done using showShannonEntropy().

In [15]: showShannonEntropy(msa_refine, indices);
../../_images/msa_analysis_entropy.png

Mutual Information

We can calculate mutual information between the positions of the MSA using buildMutinfoMatrix() which also takes an MSA object or a numpy 2D array containing MSA as input.

In [16]: mutinfo = buildMutinfoMatrix(msa_refine)

In [17]: mutinfo
Out[17]: 
array([[0.        , 0.20631824, 0.16401097, ..., 0.06095202, 0.14566864,
        0.05380968],
       [0.20631824, 0.        , 0.60253758, ..., 0.22563382, 0.22440521,
        0.18667669],
       [0.16401097, 0.60253758, 0.        , ..., 0.35108633, 0.37773013,
        0.24761374],
       ...,
       [0.06095202, 0.22563382, 0.35108633, ..., 0.        , 0.4087536 ,
        0.22060682],
       [0.14566864, 0.22440521, 0.37773013, ..., 0.4087536 , 0.        ,
        0.27725374],
       [0.05380968, 0.18667669, 0.24761374, ..., 0.22060682, 0.27725374,
        0.        ]])

Result is a 2D NumPy array.

We can also apply normalization using applyMutinfoNorm() and correction using applyMutinfoCorr() to the mutual information matrix based on references [Martin05] and [Dunn08], respectively.

[Martin05]Martin LC, Gloor GB, Dunn SD, Wahl LM. Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005 21(22):4116-4124.
[Dunn08]Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008 24(3):333-340.
In [18]: mutinfo_norm = applyMutinfoNorm(mutinfo, entropy, norm='minent')

In [19]: mutinfo_corr = applyMutinfoCorr(mutinfo, corr='apc')

Note that by default norm="sument" normalization is applied in applyMutinfoNorm and corr="prod" is applied in applyMutinfoCorr.

Now we plot the mutual information matrices that we obtained above and see the effects of different corrections and normalizations.

In [20]: showMutinfoMatrix(mutinfo);
../../_images/msa_analysis_mutinfo.png
In [21]: showMutinfoMatrix(mutinfo_corr, clim=[0, mutinfo_corr.max()],
   ....:    xlabel='1K2A: 4-131');
   ....: 
../../_images/msa_analysis_mutinfo_corr.png

Output Results

Here we show how to write the mutual information and entropy arrays to file. We use the writeArray() to write NumPy array data.

In [22]: writeArray('1K2A_MI.txt', mutinfo)
Out[22]: '1K2A_MI.txt'

This can be later loaded using parseArray().

Rank-ordering

Further analysis can also be done by rank ordering the matrix and analyzing the pairs with highest mutual information or the most co-evolving residues. This is done using calcRankorder(). A z-score normalization can also be applied to select coevolving pairs based on a z score cutoff.

In [23]: rank_row, rank_col, zscore_sort = calcRankorder(mutinfo, zscore=True)

In [24]: asarray(indices)[rank_row[:5]]
Out[24]: array([127,  31,  31, 128, 115])

In [25]: asarray(indices)[rank_col[:5]]
Out[25]: array([126,  26,  29, 126, 113])

In [26]: zscore_sort[:5]
Out[26]: array([4.50792926, 3.55425262, 3.42345689, 3.27298657, 3.21999586])