Summary of apo-holo data

Introduction

In this notebook, we look at the data used to create the apo-holo comparisons, generating plots and summary statistics.

[1]:
import os
import subprocess
import tempfile

import logomaker
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import openchord
import pandas as pd
from IPython.display import display, Markdown
from pyxdameraulevenshtein import damerau_levenshtein_distance

from tcr_pmhc_interface_analysis.utils import mhc_slug_to_code
Matplotlib is building the font cache; this may take a moment.
[2]:
apo_holo_summary = pd.read_csv('../data/processed/apo-holo-tcr-pmhc-class-I/apo_holo_summary.csv')
apo_holo_summary
[2]:
file_name pdb_id structure_type state alpha_chain beta_chain antigen_chain mhc_chain1 mhc_chain2 cdr_sequences_collated peptide_sequence mhc_slug
0 1ao7_D-E-C-A-B_tcr_pmhc.pdb 1ao7 tcr_pmhc holo D E C A B DRGSQS-IYSNGD-AVTTDSWGKLQ-MNHEY-SVGAGI-ASRPGLA... LLFGYPVYV hla_a_02_01
1 1b0g_C-A-B_pmhc.pdb 1b0g pmhc apo NaN NaN C A B NaN ALWGFFPVL hla_a_02_01
2 1b0g_F-D-E_pmhc.pdb 1b0g pmhc apo NaN NaN F D E NaN ALWGFFPVL hla_a_02_01
3 1bd2_D-E-C-A-B_tcr_pmhc.pdb 1bd2 tcr_pmhc holo D E C A B NSMFDY-ISSIKDK-AAMEGAQKLV-MNHEY-SVGAGI-ASSYPGG... LLFGYPVYV hla_a_02_01
4 1bii_P-A-B_pmhc.pdb 1bii pmhc apo NaN NaN P A B NaN RGPGRAFVTI h2_dd
... ... ... ... ... ... ... ... ... ... ... ... ...
386 7rtd_C-A-B_pmhc.pdb 7rtd pmhc apo NaN NaN C A B NaN YLQPRTFLL hla_a_02_01
387 7rtr_D-E-C-A-B_tcr_pmhc.pdb 7rtr tcr_pmhc holo D E C A B DRGSQS-IYSNGD-AVNRDDKII-SEHNR-FQNEAQ-ASSPDIEQY YLQPRTFLL hla_a_02_01
388 8gvb_A-B-P-H-L_tcr_pmhc.pdb 8gvb tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVGFTGGGNKLT-SEHNR-FQNEAQ-ASSD... RYPLTFGW hla_a_24_02
389 8gvg_A-B-P-H-L_tcr_pmhc.pdb 8gvg tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVGFTGGGNKLT-SEHNR-FQNEAQ-ASSD... RFPLTFGW hla_a_24_02
390 8gvi_A-B-P-H-L_tcr_pmhc.pdb 8gvi tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVVFTGGGNKLT-SEHNR-FQNEAQ-ASSL... RYPLTFGW hla_a_24_02

391 rows × 12 columns

Summary Numbers

[3]:
num_entries = len(apo_holo_summary)
num_pdb_ids = len(apo_holo_summary['pdb_id'].unique())

print('There are', num_entries, 'entries in the dataset spread across', num_pdb_ids, 'PDB structures.')
There are 391 entries in the dataset spread across 301 PDB structures.
[4]:
def both_apo_holo(group: pd.DataFrame, structure_type: str) -> bool:
    apo_structures = group.query("state == 'apo' and structure_type == @structure_type")
    holo_structures = group.query("state == 'holo'")

    return len(apo_structures) >= 1 and len(holo_structures) >= 1
[5]:
both_apo_holo_tcrs = apo_holo_summary.groupby('cdr_sequences_collated').apply(lambda group: both_apo_holo(group, 'tcr'))
apo_holo_tcrs = both_apo_holo_tcrs.index[both_apo_holo_tcrs].tolist()

num_unique_tcrs = len(apo_holo_tcrs)

display(Markdown((f'There are {num_unique_tcrs} TCRs with both *apo* and *holo* conformations in the dataset. '
                 'Here is the list of them based on their CDR sequences:')))

print('\n'.join(apo_holo_tcrs))

There are 22 TCRs with both apo and holo conformations in the dataset. Here is the list of them based on their CDR sequences:

ATGYPS-ATKADDK-ALSDPVNDMR-SGHAT-FQNNGV-ASSLRGRGDQPQH
DRGSQS-IYSNGD-ALTRGPGNQFY-SGHVS-FNYEAQ-ASSSPGGVSTEAF
DRGSQS-IYSNGD-AVNFGGGKLI-MRHNA-SNTAGT-ASSLSFGTEAF
DRGSQS-IYSNGD-AVNRDDKII-SEHNR-FQNEAQ-ASSPDIEQY
DRGSQS-IYSNGD-AVRTNSGYALN-QGHDT-YYEEEE-ASSDTVSYEQY
DRGSQS-IYSNGD-AVTTDSWGKLQ-MNHEY-SVGAGI-ASRPGLAGGRPEQY
DRGSQS-IYSNGD-GTYNQGGKLI-MNHEY-SMNVEV-ASSGASHEQY
DSAIYN-IQSSQRE-AQLNQAGTALI-MNHEY-SVGAGI-ASSYGTGINYGYT
DSAIYN-IQSSQRE-AVRMDSSYKLI-SEHNR-FQNEAQ-ASSSWDTGELF
DSAIYN-IQSSQRE-AVRPLLDGTYIPT-MNHEY-SVGAGT-ASSYLGNTGELF
DSAIYN-IQSSQRE-AVRPTSGGSYIPT-MNHEY-SVGAGI-ASSYVGNTGELF
FLGSQS-TYREGD-AVNDGGRLT-GTSNPN-WGPFG-AWSETGLGMGGWQ
NIATNDY-GYKTK-LVGEILDNFNKFY-MDHEN-SYDVKM-ASSQRQEGDTQY
NSAFDY-ILSVSNK-AASASFGDNSKLI-MSHET-SYDVDS-ASSLGHTEVF
NSAFQY-TYSSGN-AMRGDSSYKLI-SGHDY-FNNNVP-ASSLWEKLAKNIQY
NSASQS-VYSSG-VVQPGGYQKVT-MNHNS-SASEGT-ASSEGLWQVGDEQY
NSASQS-VYSSG-VVRAGKLI-MNHEY-SVGEGT-ASGQGNFDIQY
SVFSS-VVTGGEV-AGAGSQGNLI-LNHDA-SQIVND-ASSSRSSYEQY
TISGNEY-GLKNN-IVWGGYQKVT-SEHNR-FQNEAQ-ASRYRDDSYNEQF
TISGTDY-GLTSN-ILPLAGGTSYGKLT-SGHVS-FQNEAQ-ASSLGQAYEQY
YSATPY-YYSGDPVV-AVSGFASALT-NNHNN-SYGAGS-ASGGGGTLY
YSGSPE-HISR-ALSGFNNAGNMLT-SGHAT-FQNNGV-ASSLGGAGGADTQY
[6]:
both_apo_holo_pmhcs = apo_holo_summary.groupby(['mhc_slug', 'peptide_sequence']).apply(lambda group: both_apo_holo(group, 'pmhc'))
apo_holo_pmhcs = both_apo_holo_pmhcs.index[both_apo_holo_pmhcs].tolist()

mhc_slugs = set([slug for slug, _ in apo_holo_pmhcs])
peptide_sequences = set([peptide for _, peptide in apo_holo_pmhcs])

num_combinations = len(apo_holo_pmhcs)
num_unique_mhcs = len(mhc_slugs)
num_unique_peptides = len(peptide_sequences)

display(Markdown((f'There are {num_combinations} combinations of MHC alleles and peptides with '
                  'both *apo* and *holo* structures.'
                  f'These come from {num_unique_mhcs} MHC alleles and {num_unique_peptides} unique peptides.'
                  'Here are the combinations:')))

pd.DataFrame({
    'MHC Allele': [slug for slug, _ in apo_holo_pmhcs],
    'Peptide': [peptide for _, peptide in apo_holo_pmhcs],
}, index=pd.RangeIndex(start=1, stop=len(apo_holo_pmhcs) + 1, name='Index'))

There are 81 combinations of MHC alleles and peptides with both apo and holo structures.These come from 19 MHC alleles and 79 unique peptides.Here are the combinations:

[6]:
MHC Allele Peptide
Index
1 h2_db ASNENMETM
2 h2_db KAPANFATM
3 h2_db KAPFNFATM
4 h2_db KAPYDYAPI
5 h2_db KAPYNFATM
... ... ...
77 hla_b_51_01 TAFTIPSI
78 hla_b_53_01 QASQEVKNW
79 hla_b_81_01 TPQDLNTML
80 hla_c_08_02 GADGVGKSAL
81 hla_e_01_03 RLPAKAPLL

81 rows × 2 columns

Data plots

[7]:
stcrdab_summary = pd.read_csv('../data/raw/stcrdab/db_summary.dat', delimiter='\t')
[8]:
apo_holo_summary = apo_holo_summary.merge(stcrdab_summary,
                                          how='left',
                                          left_on=['pdb_id', 'alpha_chain', 'beta_chain'],
                                          right_on=['pdb', 'Achain', 'Bchain'])
[9]:
apo_holo_summary_non_redundant_tcrs = apo_holo_summary.drop_duplicates('cdr_sequences_collated')
apo_holo_summary_non_redundant_pmhcs = apo_holo_summary.drop_duplicates(['mhc_slug', 'peptide_sequence'])
[10]:
fig, axes = plt.subplots(ncols=2, figsize=(20, 20))

apo_holo_summary_non_redundant_tcrs['alpha_subgroup'].dropna().value_counts().plot.pie(title=r'$\alpha$-chain gene usage',
                                                                                  autopct='%1.2f%%',
                                                                                  ax=axes[0])
apo_holo_summary_non_redundant_tcrs['beta_subgroup'].dropna().value_counts().plot.pie(title=r'$\beta$-chain gene usage',
                                                                                 autopct='%1.2f%%',
                                                                                 ax=axes[1])

for ax in axes:
    ax.set_ylabel('')
../_images/source_Summary_of_apo_holo_data_13_0.png
[11]:
counts = apo_holo_summary_non_redundant_tcrs.value_counts(['alpha_subgroup', 'beta_subgroup'])
labels = sorted(np.concatenate([apo_holo_summary_non_redundant_tcrs['alpha_subgroup'].dropna().unique(),
                                apo_holo_summary_non_redundant_tcrs['beta_subgroup'].dropna().unique()]).tolist())

num_labels = len(labels)
adjacency_matrix = np.zeros((num_labels, num_labels), dtype=int)

for i in range(num_labels):
    for j in range(i + 1, num_labels):
        try:
            adjacency_matrix[i, j] = counts[labels[i]][labels[j]]
        except KeyError:
            adjacency_matrix[i, j] = 0

adjacency_matrix = np.maximum(adjacency_matrix, adjacency_matrix.transpose())
[12]:
fig = openchord.Chord(adjacency_matrix, labels)

fig.colormap = ['#636EFA',
                '#EF553B',
                '#00CC96',
                '#AB63FA',
                '#FFA15A',
                '#19D3F3',
                '#FF6692',
                '#B6E880',
                '#FF97FF',
                '#FECB52']

fig.radius = 300
fig.padding = 150
fig.font_size = 12

fig.show()
[12]:
../_images/source_Summary_of_apo_holo_data_15_0.svg
[13]:
(apo_holo_summary_non_redundant_pmhcs['mhc_slug'].dropna()
                                                 .map(mhc_slug_to_code)
                                                 .value_counts()
                                                 .plot.pie(figsize=(15, 15), autopct='%1.2f%%'))
plt.title('Breakdown of the different MHC alleles in the data')
plt.ylabel('')

plt.show()
../_images/source_Summary_of_apo_holo_data_16_0.png
[14]:
sequences = apo_holo_summary_non_redundant_pmhcs['peptide_sequence'].dropna().values
[15]:
with tempfile.TemporaryDirectory() as directory:
    input_file = os.path.join(directory, 'input.fasta')
    output_file = os.path.join(directory, 'output.fasta')

    with open(input_file, 'w') as fh:
        fh.write('\n'.join([f'>{num}\n{seq}\n' for num, seq in enumerate(sequences, 1)]))

    cmd = f'clustalw2 -INFILE={input_file} -OUTFILE={output_file} -OUTPUT=FASTA'
    subprocess.run(cmd.split(), check=True, stdout=subprocess.DEVNULL)

    with open(output_file, 'r') as fh:
        aligned_sequences = [line.strip() for line in fh.readlines() if not line.startswith('>')]
[16]:
probs = logomaker.alignment_to_matrix(aligned_sequences, to_type='information')
logomaker.Logo(probs, color_scheme='chemistry')

plt.ylabel('Bits')
plt.xticks([])

plt.show()
/project/koohylab/bmcmaste/miniconda3/envs/tcr-pmhc-interface-analysis/lib/python3.10/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
../_images/source_Summary_of_apo_holo_data_19_1.png
[17]:
distance_matrix = np.zeros((sequences.shape[0], sequences.shape[0]), dtype=int)

for i in range(sequences.shape[0]):
    for j in range(i + 1, sequences.shape[0]):
        distance_matrix[i, j] = damerau_levenshtein_distance(sequences[i], sequences[j])

distance_matrix = np.maximum(distance_matrix, distance_matrix.T)

print(f'The average levenshtein distance between all peptide sequences is {np.average(distance_matrix):.2f} edits',
      f'with an standard deviation of {np.std(distance_matrix):.2f} edits.')
The average levenshtein distance between all peptide sequences is 8.38 edits with an standard deviation of 1.78 edits.
[18]:
CUTOFF_PERCENT = 70

sequence_lengths = np.array([len(seq) for seq in sequences])
grid1, grid2 = np.meshgrid(sequence_lengths, sequence_lengths)

max_lengths = np.maximum(grid1, grid2)
sequence_percent_identities = ((max_lengths - distance_matrix) / max_lengths) * 100

similar_peptides = sequence_percent_identities > CUTOFF_PERCENT


[19]:
G = nx.Graph()
G.add_nodes_from(range(sequences.shape[0]))

for i in range(1, sequences.shape[0]):
    for j in range(i + 1, sequences.shape[0]):
        if similar_peptides[i, j]:
            G.add_edge(i, j, weight=sequence_percent_identities[i, j] / 100)
[20]:
mhc_slugs_for_sequences = apo_holo_summary_non_redundant_pmhcs[
    ['peptide_sequence', 'mhc_slug']
].query('peptide_sequence.notnull()')['mhc_slug'].values

allele_options = sorted(list(set(list(mhc_slugs_for_sequences))))
[21]:
matplotlib_colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
n = int(np.ceil(len(allele_options) / len(matplotlib_colors)))

colour_palette = dict(zip(allele_options, matplotlib_colors * n))
[22]:
node_colours = [colour_palette[slug] for slug in mhc_slugs_for_sequences]
[23]:
positions = nx.nx_agraph.graphviz_layout(G, prog="neato")

nx.draw(G, positions, node_color=node_colours, node_size=50)

plt.title(f'Peptide repertoire of the TCR:pMHCs grouped at {CUTOFF_PERCENT}% identity')

plt.show()
../_images/source_Summary_of_apo_holo_data_26_0.png

Conclusion

Based on the plots and summary statistics in this notebook, it is evident that the dataset is quite biased towards certain TCR genes and MHC Alleles. The peptides are moderately well balanced. This will need to be taken into account in the conclusions drawn throughout the rest of the project.