Ascertaining the generalisability of the structure data

Introduction

In this notebook, we set out to ascertain the generalisability of our data and analysis to the broader TCR space. We compared sequence properties of our dataset to a background of TCR sequences sampled from OTS and also look at how the selected structures fit in the overall available TCR structures space.

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
[2]:
def rmsd(x: np.ndarray) -> float:
    '''Compute the RMSD of an array.'''
    x_bar = np.mean(x)
    n = len(x)
    return np.sqrt(np.sum((x - x_bar) ** 2) / n)

Load Data

Load OTS data

[3]:
ots_sample = pd.read_csv('../data/interim/ots_sample.csv')
ots_sample
[3]:
cdr1_aa_alpha cdr2_aa_alpha cdr3_aa_alpha cdr1_aa_beta cdr2_aa_beta cdr3_aa_beta v_call_alpha v_call_beta j_call_alpha j_call_beta sample_num
0 TISGNEY GLKNN IVRVVWGGGADGLT SGHDN FVKESK ASSLLGVSTDTQY TRAV26-1*01 TRBV14*01 TRAJ45*01 TRBJ2-3*01 1
1 NSMFDY ISSIKDK AASAQGGSYIPT LGHDT YNNKEL ASSRRPTDTQY TRAV29/DV5*04 TRBV3-1*01 TRAJ6*01 TRBJ2-3*01 1
2 SSNFYA MTLNGDE ALGRNSGNTPLV SGHAT FQNNGV ASNLAGAYEQY TRAV24*01 TRBV11-2*01 TRAJ29*01 TRBJ2-7*01 1
3 SSVPPY YTTGATLV AVSEPGSQGNLI DFQATT SNEGSKA SALGQPLGETQY TRAV8-4*01 TRBV20-1*02 TRAJ42*01 TRBJ2-5*01 1
4 TSGFNG NVLDGL AVRDLRGSQGNLI MGHRA YSYEKL ASSQAPQGADTQY TRAV1-2*01 TRBV4-1*01 TRAJ42*01 TRBJ2-3*01 1
... ... ... ... ... ... ... ... ... ... ... ...
9995 SSVSVY YLSGSTLV AVSVRGSQGNLI MNHNY SVGAGI ASSYGNREGYT TRAV8-6*01 TRBV6-6*01 TRAJ42*01 TRBJ1-2*01 10
9996 NSAFQY TYSSGN AMRRNTNAGKST DFQATT SNEGSKA SAPTDPAGTEAF TRAV12-3*01 TRBV20-1*01 TRAJ27*01 TRBJ1-1*01 10
9997 SVFSS VVTGGEV AAIIQGAQKLV MDHEN SYDVKM ASSYQYYEQY TRAV27*01 TRBV28*01 TRAJ54*01 TRBJ2-7*01 10
9998 DSAIYN IQSSQRE APYSGGGADGLT SEHNR FQNEAQ ASSSGTGGVEHGYT TRAV21*02 TRBV7-9*01 TRAJ45*01 TRBJ1-2*01 10
9999 NSAFQY TYSSGN AMTGAGSYQLT LGHNA YSLEER ASSQELLGAELHYGYT TRAV12-3*01 TRBV4-3*01 TRAJ28*01 TRBJ1-2*01 10

10000 rows × 11 columns

[4]:
ots_sample_genes = ots_sample[['v_call_alpha', 'v_call_beta', 'j_call_alpha', 'j_call_beta', 'sample_num']].copy()

ots_sample_genes['alpha_subgroup'] = ots_sample_genes['v_call_alpha'].str.extract(r'^(TRAV\d+)')
ots_sample_genes['beta_subgroup'] = ots_sample_genes['v_call_beta'].str.extract(r'^(TRBV\d+)')

ots_sample_genes
[4]:
v_call_alpha v_call_beta j_call_alpha j_call_beta sample_num alpha_subgroup beta_subgroup
0 TRAV26-1*01 TRBV14*01 TRAJ45*01 TRBJ2-3*01 1 TRAV26 TRBV14
1 TRAV29/DV5*04 TRBV3-1*01 TRAJ6*01 TRBJ2-3*01 1 TRAV29 TRBV3
2 TRAV24*01 TRBV11-2*01 TRAJ29*01 TRBJ2-7*01 1 TRAV24 TRBV11
3 TRAV8-4*01 TRBV20-1*02 TRAJ42*01 TRBJ2-5*01 1 TRAV8 TRBV20
4 TRAV1-2*01 TRBV4-1*01 TRAJ42*01 TRBJ2-3*01 1 TRAV1 TRBV4
... ... ... ... ... ... ... ...
9995 TRAV8-6*01 TRBV6-6*01 TRAJ42*01 TRBJ1-2*01 10 TRAV8 TRBV6
9996 TRAV12-3*01 TRBV20-1*01 TRAJ27*01 TRBJ1-1*01 10 TRAV12 TRBV20
9997 TRAV27*01 TRBV28*01 TRAJ54*01 TRBJ2-7*01 10 TRAV27 TRBV28
9998 TRAV21*02 TRBV7-9*01 TRAJ45*01 TRBJ1-2*01 10 TRAV21 TRBV7
9999 TRAV12-3*01 TRBV4-3*01 TRAJ28*01 TRBJ1-2*01 10 TRAV12 TRBV4

10000 rows × 7 columns

[5]:
ots_sample_cdrs = ots_sample.filter(regex=r'^cdr\d_aa_[alpha|beta]', axis=1)
cdr_cols = ots_sample_cdrs.columns.tolist()

ots_sample_cdrs = pd.concat([ots_sample_cdrs, ots_sample['sample_num']], axis='columns')

ots_sample_cdrs = ots_sample_cdrs.melt(id_vars='sample_num',
                                       value_vars=cdr_cols,
                                       var_name='cdr_type',
                                       value_name='sequence')
ots_sample_cdrs[['cdr', 'chain_type']] = ots_sample_cdrs['cdr_type'].str.extract('cdr(\d)_aa_(alpha|beta)')

ots_sample_cdrs['length'] = ots_sample_cdrs['sequence'].str.len()

ots_sample_cdrs
[5]:
sample_num cdr_type sequence cdr chain_type length
0 1 cdr1_aa_alpha TISGNEY 1 alpha 7
1 1 cdr1_aa_alpha NSMFDY 1 alpha 6
2 1 cdr1_aa_alpha SSNFYA 1 alpha 6
3 1 cdr1_aa_alpha SSVPPY 1 alpha 6
4 1 cdr1_aa_alpha TSGFNG 1 alpha 6
... ... ... ... ... ... ...
59995 10 cdr3_aa_beta ASSYGNREGYT 3 beta 11
59996 10 cdr3_aa_beta SAPTDPAGTEAF 3 beta 12
59997 10 cdr3_aa_beta ASSYQYYEQY 3 beta 10
59998 10 cdr3_aa_beta ASSSGTGGVEHGYT 3 beta 14
59999 10 cdr3_aa_beta ASSQELLGAELHYGYT 3 beta 16

60000 rows × 6 columns

Load STCRDab data

[6]:
stcrdab_summary = pd.read_csv('../data/raw/stcrdab/db_summary.dat', delimiter='\t')
stcrdab_summary
[6]:
pdb Bchain Achain Dchain Gchain TCRtype model antigen_chain antigen_type antigen_name ... authors resolution method r_free r_factor affinity affinity_method affinity_temperature affinity_pmid engineered
0 7zt2 E D NaN NaN abTCR 0 A | A protein | Hapten major histocompatibility complex class i-relat... ... Karuppiah, V., Srikannathasan, V., Robinson, R.A. 2.4 X-RAY DIFFRACTION 0.276 0.215 NaN NaN NaN NaN True
1 7zt3 E D NaN NaN abTCR 0 A protein major histocompatibility complex class i-relat... ... Karuppiah, V., Srikannathasan, V., Robinson, R.A. 2.4 X-RAY DIFFRACTION 0.236 0.191 NaN NaN NaN NaN True
2 7zt4 E D NaN NaN abTCR 0 A protein major histocompatibility complex class i-relat... ... Karuppiah, V., Robinson, R.A. 2.02 X-RAY DIFFRACTION 0.268 0.234 NaN NaN NaN NaN True
3 7zt5 E D NaN NaN abTCR 0 A protein major histocompatibility complex class i-relat... ... Karuppiah, V., Robinson, R.A. 2.09 X-RAY DIFFRACTION 0.266 0.225 NaN NaN NaN NaN True
4 7zt7 E D NaN NaN abTCR 0 A protein major histocompatibility complex class i-relat... ... Karuppiah, V., Robinson, R.A. 1.84 X-RAY DIFFRACTION 0.255 0.207 NaN NaN NaN NaN True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
931 3rtq D C NaN NaN abTCR 0 A Hapten N-[(2S,3S,4R)-3,4-DIHYDROXY-1-{[(1S,2S,3R,4R,5... ... Yu, E.D., Zajonc, D.M. 2.8 X-RAY DIFFRACTION 0.268 0.227 NaN NaN NaN NaN True
932 3dxa O N NaN NaN abTCR 0 M peptide ebv decapeptide epitope ... Archbold, J.K., Macdonald, W.A., Gras, S., Ros... 3.5 X-RAY DIFFRACTION 0.330 0.286 NaN NaN NaN NaN True
933 1d9k B A NaN NaN abTCR 0 P peptide conalbumin peptide ... Reinherz, E.L., Tan, K., Tang, L., Kern, P., L... 3.2 X-RAY DIFFRACTION 0.293 0.247 NaN NaN NaN NaN True
934 4gg6 H G NaN NaN abTCR 0 J peptide peptide from alpha/beta-gliadin mm1 ... Broughton, S.E., Theodossis, A., Petersen, J.,... 3.2 X-RAY DIFFRACTION 0.285 0.246 NaN NaN NaN NaN True
935 2apf A NaN NaN NaN NaN 0 NaN NaN NaN ... Cho, S., Swaminathan, C.P., Yang, J., Kerzic, ... 1.8 X-RAY DIFFRACTION 0.263 0.197 NaN NaN NaN NaN True

936 rows × 39 columns

Load apo-holo data

[7]:
apo_holo_summary = pd.read_csv('../data/processed/apo-holo-tcr-pmhc-class-I/apo_holo_summary.csv')
apo_holo_summary
[7]:
file_name pdb_id structure_type state alpha_chain beta_chain antigen_chain mhc_chain1 mhc_chain2 cdr_sequences_collated peptide_sequence mhc_slug
0 1ao7_D-E-C-A-B_tcr_pmhc.pdb 1ao7 tcr_pmhc holo D E C A B DRGSQS-IYSNGD-AVTTDSWGKLQ-MNHEY-SVGAGI-ASRPGLA... LLFGYPVYV hla_a_02_01
1 1bd2_D-E-C-A-B_tcr_pmhc.pdb 1bd2 tcr_pmhc holo D E C A B NSMFDY-ISSIKDK-AAMEGAQKLV-MNHEY-SVGAGI-ASSYPGG... LLFGYPVYV hla_a_02_01
2 1bii_A-B-P_pmhc.pdb 1bii pmhc apo NaN NaN P A B NaN RGPGRAFVTI h2_dd
3 1ddh_A-B-P_pmhc.pdb 1ddh pmhc apo NaN NaN P A B NaN RGPGRAFVTI h2_dd
4 1duz_A-B-C_pmhc.pdb 1duz pmhc apo NaN NaN C A B NaN LLFGYPVYV hla_a_02_01
... ... ... ... ... ... ... ... ... ... ... ... ...
353 8gon_D-E-C-A-B_tcr_pmhc.pdb 8gon tcr_pmhc holo D E C A B TSESDYY-QEAYKQQN-ASSGNTPLV-SGHNS-FNNNVP-ASTWGR... NaN NaN
354 8gop_A-B_tcr.pdb 8gop tcr apo A B NaN NaN NaN TSESDYY-QEAYKQQN-ASSGNTPLV-SGHNS-FNNNVP-ASTWGR... NaN NaN
355 8gvb_A-B-P-H-L_tcr_pmhc.pdb 8gvb tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVGFTGGGNKLT-SEHNR-FQNEAQ-ASSD... RYPLTFGW hla_a_24_02
356 8gvg_A-B-P-H-L_tcr_pmhc.pdb 8gvg tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVGFTGGGNKLT-SEHNR-FQNEAQ-ASSD... RFPLTFGW hla_a_24_02
357 8gvi_A-B-P-H-L_tcr_pmhc.pdb 8gvi tcr_pmhc holo A B P H L YGATPY-YFSGDTLV-AVVFTGGGNKLT-SEHNR-FQNEAQ-ASSL... RYPLTFGW hla_a_24_02

358 rows × 12 columns

[8]:
apo_holo_genes = apo_holo_summary.copy().merge(
    stcrdab_summary,
    how='left',
    left_on=['pdb_id', 'alpha_chain', 'beta_chain'],
    right_on=['pdb', 'Achain', 'Bchain'],
).drop_duplicates(['cdr_sequences_collated'])[['alpha_subgroup', 'beta_subgroup']]
[9]:
apo_holo_cdr_sequences = apo_holo_summary['cdr_sequences_collated'].str.split('-').apply(pd.Series)
apo_holo_cdr_sequences.columns = ['cdr1_alpha', 'cdr2_alpha', 'cdr3_alpha', 'cdr1_beta', 'cdr2_beta', 'cdr3_beta']

apo_holo_cdr_sequences = apo_holo_cdr_sequences.drop_duplicates()
apo_holo_cdr_sequences = apo_holo_cdr_sequences.dropna()

apo_holo_cdr_sequences
apo_holo_cdr_sequences = apo_holo_cdr_sequences.melt(value_vars=apo_holo_cdr_sequences.columns,
                                                     var_name='cdr_type',
                                                     value_name='sequence')
apo_holo_cdr_sequences[['cdr', 'chain_type']] = apo_holo_cdr_sequences['cdr_type'].str.extract('cdr(\d)_(alpha|beta)')

apo_holo_cdr_sequences['length'] = apo_holo_cdr_sequences['sequence'].str.len()

apo_holo_cdr_sequences
[9]:
cdr_type sequence cdr chain_type length
0 cdr1_alpha DRGSQS 1 alpha 6
1 cdr1_alpha NSMFDY 1 alpha 6
2 cdr1_alpha TQDSSYF 1 alpha 7
3 cdr1_alpha YSATPY 1 alpha 6
4 cdr1_alpha TISGTDY 1 alpha 7
... ... ... ... ... ...
385 cdr3_beta ASKVGPGQHNSPLH 3 beta 14
386 cdr3_beta ASSYGTGINYGYT 3 beta 13
387 cdr3_beta ASTWGRASTDTQY 3 beta 13
388 cdr3_beta ASSDRDRVPETQY 3 beta 13
389 cdr3_beta ASSLRDRVPETQY 3 beta 13

390 rows × 5 columns

Compare CDR Lengths

[10]:
ots_sample_cdrs['source'] = 'OTS'
apo_holo_cdr_sequences['source'] = 'STCRDab'

cdr_lengths = pd.concat([ots_sample_cdrs, apo_holo_cdr_sequences])
cdr_lengths
[10]:
sample_num cdr_type sequence cdr chain_type length source
0 1.0 cdr1_aa_alpha TISGNEY 1 alpha 7 OTS
1 1.0 cdr1_aa_alpha NSMFDY 1 alpha 6 OTS
2 1.0 cdr1_aa_alpha SSNFYA 1 alpha 6 OTS
3 1.0 cdr1_aa_alpha SSVPPY 1 alpha 6 OTS
4 1.0 cdr1_aa_alpha TSGFNG 1 alpha 6 OTS
... ... ... ... ... ... ... ...
385 NaN cdr3_beta ASKVGPGQHNSPLH 3 beta 14 STCRDab
386 NaN cdr3_beta ASSYGTGINYGYT 3 beta 13 STCRDab
387 NaN cdr3_beta ASTWGRASTDTQY 3 beta 13 STCRDab
388 NaN cdr3_beta ASSDRDRVPETQY 3 beta 13 STCRDab
389 NaN cdr3_beta ASSLRDRVPETQY 3 beta 13 STCRDab

60390 rows × 7 columns

[11]:
ots_sample_cdrs_proptions = (ots_sample_cdrs.groupby(['chain_type', 'cdr', 'sample_num'])
                                            ['length']
                                            .value_counts(normalize=True)
                                            * 100)
ots_sample_cdrs_proptions.name = 'percentage'
ots_sample_cdrs_proptions = ots_sample_cdrs_proptions.reset_index()
[12]:
ots_sample_cdrs_proptions_agg = (ots_sample_cdrs_proptions.groupby(['chain_type', 'cdr', 'length'])
                                                          .agg({'percentage': ['mean', 'sem']})
                                                          .fillna(0.0)
                                                          .reset_index())
ots_sample_cdrs_proptions_agg.columns = ['_'.join(col).rstrip('_').replace('_mean', '')
                                         for col in ots_sample_cdrs_proptions_agg.columns]
[13]:
apo_holo_cdr_length_proportions = (apo_holo_cdr_sequences.groupby(['chain_type', 'cdr'])
                                   ['length']
                                   .value_counts(normalize=True)
                                   * 100)
apo_holo_cdr_length_proportions.name = 'percentage'
apo_holo_cdr_length_proportions = apo_holo_cdr_length_proportions.reset_index()
[14]:
cdr_length_proportions = ots_sample_cdrs_proptions_agg.merge(apo_holo_cdr_length_proportions,
                                                             how='outer',
                                                             on=['chain_type', 'cdr', 'length'],
                                                             suffixes=['_ots', '_stcrdab']).fillna(0)


cdr_length_proportions['difference'] = (cdr_length_proportions['percentage_stcrdab']
                                        - cdr_length_proportions['percentage_ots'])
[15]:
cdr_length_percentages = (cdr_lengths.groupby(['chain_type', 'cdr', 'sample_num', 'source'], dropna=False)
                                     ['length']
                                     .value_counts(normalize=True)
                                     * 100)
cdr_length_percentages.name = 'percentage'
cdr_length_percentages = cdr_length_percentages.reset_index()
[16]:
for (chain_type, cdr), group in cdr_length_percentages.groupby(['chain_type', 'cdr']):
    length_min = group['length'].min()
    length_max = group['length'].max()
    expected_length_values = set(range(length_min, length_max + 1))

    actual_length_values = set(group['length'].unique())
    missing_length_values = expected_length_values - actual_length_values
    missing_length_values = sorted(list(missing_length_values))

    if len(missing_length_values) > 0:
        missing_entries = pd.DataFrame({'chain_type': [chain_type for _ in missing_length_values],
                                        'cdr': [cdr for _ in missing_length_values],
                                        'source': ['OTS' for _ in missing_length_values],
                                        'length': missing_length_values,
                                        'percentage': [0.0 for _ in missing_length_values]})

        cdr_length_percentages = pd.concat([cdr_length_percentages, missing_entries])

cdr_length_percentages = cdr_length_percentages.reset_index(drop=True)
[17]:
grid = sns.catplot(cdr_length_percentages,
            x='length',
            y='percentage',
            hue='source',
            row='chain_type', col='cdr',
            kind='bar', sharex=False)

plt.show()
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_23_0.png
[18]:
cdr_lengths.groupby(['chain_type', 'cdr', 'source'])['length'].agg([pd.Series.mode, 'mean', 'std'])
[18]:
mode mean std
chain_type cdr source
alpha 1 OTS 6 6.003800 0.570279
STCRDab 6 6.153846 0.565430
2 OTS 7 6.805900 1.031664
STCRDab 7 6.615385 1.026102
3 OTS 12 11.730200 1.692954
STCRDab 11 11.169231 1.616026
beta 1 OTS 5 5.116500 0.320840
STCRDab 5 5.061538 0.242186
2 OTS 6 6.092400 0.363147
STCRDab 6 5.984615 0.215950
3 OTS 13 12.568400 1.753918
STCRDab 13 11.846154 1.533285
[19]:
fig, axes = plt.subplots(ncols=3, nrows=2, tight_layout=True, sharey=True)
plt.ylim([-100, 100])
fig.suptitle('Relative frequency of structure CDR lengths compared to OTS')

for i, chain_type in enumerate(('alpha', 'beta')):
    for j, cdr in enumerate(('1', '2', '3')):
        proportion_data = cdr_length_proportions.query("chain_type == @chain_type and cdr == @cdr")
        colors = proportion_data['difference'].map(lambda diff: 'lightblue' if diff < 0 else 'orange')

        axes[i, j].bar(proportion_data['length'], proportion_data['difference'],
                       yerr=proportion_data['percentage_sem'],
                       color=colors, edgecolor='black')

        axes[i, j].set_title(f'CDR{cdr}$\\{chain_type}$')
        axes[i, j].set_xlabel('Loop Length')
        axes[i, j].set_ylabel('Percentage (%)')

plt.show()
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_25_0.png
[20]:
for chain_type in ('alpha', 'beta'):
    for cdr in ('1', '2', '3'):
        proportion_data = cdr_length_proportions.query("chain_type == @chain_type and cdr == @cdr")
        print(f'CDR{cdr}{chain_type} RMSD:', round(rmsd(proportion_data['difference']), 2))

print()
print('Overall RMSD:', round(rmsd(cdr_length_proportions['difference']), 2))
CDR1alpha RMSD: 5.44
CDR2alpha RMSD: 6.25
CDR3alpha RMSD: 4.23
CDR1beta RMSD: 5.5
CDR2beta RMSD: 7.99
CDR3beta RMSD: 2.48

Overall RMSD: 4.47

Overall, it seems that the distribution of lengths is similar between the apo-holo structures taken from the STCRDab and the background TCRs from OTS. The distributions look similar between the data sources, with a slightly larger variety for the OTS TCRs. Also, the mode length for each CDR type is the same between the OTS background and the dataset used in this analysis (6 for CDR1α, 7 for CDR2α, 11 for CDR3α, 5 for CDR1β, 6 for CDR2β, and 13 for CDR3β).

Compare TCR V Gene Usage

[21]:
ots_sample_genes['source'] = 'OTS'
apo_holo_genes['source'] = 'STCRDab'

gene_usage = pd.concat([ots_sample_genes, apo_holo_genes])
gene_usage = gene_usage.dropna(subset=['alpha_subgroup', 'beta_subgroup'])
gene_usage = gene_usage.melt(id_vars=['sample_num', 'source'],
                             value_vars=['alpha_subgroup', 'beta_subgroup'],
                             var_name='subgroup', value_name='gene')

gene_usage['subgroup_num'] = gene_usage['gene'].str.extract(r'TR[A|B]V(\d+)', expand=False).apply(int)

gene_usage
[21]:
sample_num source subgroup gene subgroup_num
0 1.0 OTS alpha_subgroup TRAV26 26
1 1.0 OTS alpha_subgroup TRAV29 29
2 1.0 OTS alpha_subgroup TRAV24 24
3 1.0 OTS alpha_subgroup TRAV8 8
4 1.0 OTS alpha_subgroup TRAV1 1
... ... ... ... ... ...
20121 NaN STCRDab beta_subgroup TRBV7 7
20122 NaN STCRDab beta_subgroup TRBV6 6
20123 NaN STCRDab beta_subgroup TRBV6 6
20124 NaN STCRDab beta_subgroup TRBV7 7
20125 NaN STCRDab beta_subgroup TRBV7 7

20126 rows × 5 columns

[22]:
gene_usage_percentages = (gene_usage.groupby(['subgroup', 'source', 'sample_num'], dropna=False)
                                    [['gene', 'subgroup_num']]
                                    .value_counts(normalize=True)
                                    * 100)
gene_usage_percentages.name = 'percentage'

gene_usage_percentages = gene_usage_percentages.reset_index()
[23]:
grid = sns.catplot(gene_usage_percentages.sort_values(['subgroup', 'subgroup_num']),
            row='subgroup',
            x='gene', y='percentage',
            hue='source',
            kind='bar',
            sharex=False)

for ax in grid.axes.flat:
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

grid.tight_layout(pad=3.0)

plt.show()
/var/scratch/bmcmaste/1915160/ipykernel_3725429/351373078.py:9: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_31_1.png
[24]:
ots_sample_gene_usage_percentages = gene_usage_percentages.query("source == 'OTS'")
ots_sample_gene_usage_percentages_agg = (ots_sample_gene_usage_percentages.groupby(['gene', 'subgroup', 'subgroup_num'])
                                                                          .agg({'percentage': ['mean', 'sem']})
                                                                          .reset_index()
                                                                          .fillna(0.0))

ots_sample_gene_usage_percentages_agg.columns = ['_'.join(col).rstrip('_').replace('_mean', '')
                                                 for col in ots_sample_gene_usage_percentages_agg.columns]
[25]:
apo_holo_gene_usage_percentages = gene_usage_percentages.query("source == 'STCRDab'")
gene_usage_percentages_compare = apo_holo_gene_usage_percentages.merge(ots_sample_gene_usage_percentages_agg,
                                                                       how='outer',
                                                                       on=['gene', 'subgroup', 'subgroup_num'],
                                                                       suffixes=('_stcrdab', '_ots'))

gene_usage_percentages_compare = gene_usage_percentages_compare.drop(['source', 'sample_num'], axis='columns')
gene_usage_percentages_compare = gene_usage_percentages_compare.fillna(0.0)

gene_usage_percentages_compare['difference'] = (gene_usage_percentages_compare['percentage_stcrdab']
                                                - gene_usage_percentages_compare['percentage_ots'])
[26]:
fig, axes = plt.subplots(nrows=2, constrained_layout=True, sharey=True)
plt.ylim([-100, 100])
fig.suptitle('Relative frequency of structure V gene usage compared to OTS')

for i, subgroup in enumerate(('alpha_subgroup', 'beta_subgroup')):
    plot_data = gene_usage_percentages_compare.query('subgroup == @subgroup').sort_values('subgroup_num')

    colors = plot_data['difference'].map(lambda diff: 'lightblue' if diff < 0 else 'orange')

    axes[i].bar(plot_data['gene'],
               plot_data['difference'],
               yerr=plot_data['percentage_sem'],
               color=colors,
               edgecolor='black')

    axes[i].set_xlabel(f"TR{'A' if subgroup == 'alpha_subgroup' else 'B'}V Gene Usage")
    axes[i].set_ylabel('Percentage (%)')
    axes[i].tick_params(axis='x', labelrotation=90)

plt.show()
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_34_0.png
[27]:
print('TRAV Gene Usage RMSD:',
      round(rmsd(gene_usage_percentages_compare.query("subgroup == 'alpha_subgroup'")['difference']), 2))

print('TRBV Gene Usage RMSD:',
      round(rmsd(gene_usage_percentages_compare.query("subgroup == 'beta_subgroup'")['difference']), 2))
TRAV Gene Usage RMSD: 4.18
TRBV Gene Usage RMSD: 4.5

The gene usage between apo-holo TCRs from STCRDab and the background TCRs from OTS is largely similar. There are a few exceptions in the STCRDab structures that show an over-representation of certain gene usages. For exampe, TRAV12, TRAV14, TRAV21, and TRBV6 all seem to be in higher abundance in the STCRDab data compared with the OTS background.

Comparison of amino acide types in loops

[28]:
AMINO_ACIDS = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']

def count_amino_acids(residues: pd.Series, normalize = True) -> pd.Series:
    counts = {res: 0 for res in AMINO_ACIDS}

    for res in residues.tolist():
        counts[res] += 1

    if normalize:
        num_residues = len(residues)
        counts = {res: count / num_residues for res, count in counts.items()}

    counts = pd.Series(counts)
    counts.name = 'count' if not normalize else 'proportion'
    counts.index.name = 'residue'

    return counts
[29]:
ots_sample_cdrs_residues = ots_sample_cdrs[['sample_num', 'chain_type', 'cdr', 'sequence']].copy()
ots_sample_cdrs_residues['residue'] = ots_sample_cdrs_residues['sequence'].map(list)
ots_sample_cdrs_residues = ots_sample_cdrs_residues.explode('residue')

ots_sample_cdrs_residues_counts = (ots_sample_cdrs_residues.groupby(['sample_num', 'chain_type', 'cdr'])
                                                            ['residue']
                                                            .apply(count_amino_acids))
ots_sample_cdrs_residues_counts.name = 'proportion'
ots_sample_cdrs_residues_counts = ots_sample_cdrs_residues_counts.to_frame()
ots_sample_cdrs_residues_counts = ots_sample_cdrs_residues_counts.reset_index()

ots_sample_cdrs_residues_counts
[29]:
sample_num chain_type cdr residue proportion
0 1 alpha 1 A 0.067791
1 1 alpha 1 C 0.000000
2 1 alpha 1 D 0.062615
3 1 alpha 1 E 0.014527
4 1 alpha 1 F 0.033228
... ... ... ... ... ...
1195 10 beta 3 S 0.181695
1196 10 beta 3 T 0.073013
1197 10 beta 3 V 0.021305
1198 10 beta 3 W 0.006623
1199 10 beta 3 Y 0.066949

1200 rows × 5 columns

[30]:
ots_sample_cdrs_residues_counts_agg = (ots_sample_cdrs_residues_counts.groupby(['chain_type', 'cdr', 'residue'])
                                                                      .agg({'proportion': ['mean', 'sem']})
                                                                      .reset_index())

ots_sample_cdrs_residues_counts_agg.columns = ['_'.join(col).rstrip('_').replace('_mean', '')
                                               for col in ots_sample_cdrs_residues_counts_agg.columns]
[31]:
apo_holo_cdr_sequences_residues = apo_holo_cdr_sequences[['chain_type', 'cdr', 'sequence']].copy()
apo_holo_cdr_sequences_residues['residue'] = apo_holo_cdr_sequences_residues['sequence'].map(list)
apo_holo_cdr_sequences_residues = apo_holo_cdr_sequences_residues.explode('residue')

apo_holo_cdr_sequences_residues_counts = (apo_holo_cdr_sequences_residues.groupby(['chain_type', 'cdr'])
                                          ['residue']
                                          .apply(count_amino_acids))

apo_holo_cdr_sequences_residues_counts.name = 'proportion'
apo_holo_cdr_sequences_residues_counts = apo_holo_cdr_sequences_residues_counts.to_frame()
apo_holo_cdr_sequences_residues_counts = apo_holo_cdr_sequences_residues_counts.reset_index()

apo_holo_cdr_sequences_residues_counts
[31]:
chain_type cdr residue proportion
0 alpha 1 A 0.057500
1 alpha 1 C 0.000000
2 alpha 1 D 0.092500
3 alpha 1 E 0.022500
4 alpha 1 F 0.035000
... ... ... ... ...
115 beta 3 S 0.180519
116 beta 3 T 0.059740
117 beta 3 V 0.019481
118 beta 3 W 0.011688
119 beta 3 Y 0.077922

120 rows × 4 columns

[32]:
ots_sample_cdrs_residues_counts_agg['source'] = 'ots'
apo_holo_cdr_sequences_residues_counts['source'] = 'stcrdab'

cdr_residue_counts = pd.concat([ots_sample_cdrs_residues_counts_agg, apo_holo_cdr_sequences_residues_counts])
[33]:
angles = np.linspace(0, 2 * np.pi, len(AMINO_ACIDS), endpoint=False).tolist()
amino_acid_angles = dict(zip(AMINO_ACIDS, angles))
[34]:
fig, axes = plt.subplots(nrows=2, ncols=3,
                         figsize=(15, 12),
                         sharey=True,
                         subplot_kw={'projection': 'polar'})

lines = []
for i, chain_type in enumerate(('alpha', 'beta')):
    for j, cdr in enumerate(('1', '2', '3')):
        for source in 'ots', 'stcrdab':
            residue_counts = cdr_residue_counts[(cdr_residue_counts['source'] == source)
                                                & (cdr_residue_counts['chain_type'] == chain_type)
                                                & (cdr_residue_counts['cdr'] == cdr)].sort_values('residue')

            proportions = residue_counts['proportion'].tolist()
            label_angles = residue_counts['residue'].map(amino_acid_angles).tolist()

            # Join ends
            proportions.append(proportions[0])
            label_angles.append(label_angles[0])

            line, = axes[i, j].plot(label_angles, proportions, marker='o', markersize=3, label=source)
            axes[i, j].bar(label_angles, proportions, alpha=0.25, width=0.3)

            lines.append(line)

        axes[i, j].set_xticks(angles)
        axes[i, j].set_xticklabels(AMINO_ACIDS)
        axes[i, j].grid(which='major', axis='x', visible=False)

        axes[i, j].set_title(f'CDR{cdr}$\\{chain_type}$')

unique_handles_labels = {}
for line in lines:
    label = line.get_label()
    if label not in unique_handles_labels:
        unique_handles_labels[label] = line

labels, handles = zip(*unique_handles_labels.items())

legend = fig.legend(handles, labels, title='Source', fontsize='large')
legend.get_title().set_fontsize('large')

fig.suptitle('Comparison of amino acids in CDR loops between selected structures and OTS',
             fontsize=20)

plt.show()
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_44_0.png
[35]:
cdr_residue_counts_compare = cdr_residue_counts.pivot(index=['chain_type', 'cdr', 'residue'],
                                                      columns='source',
                                                      values='proportion').reset_index()

cdr_residue_counts_compare['difference'] = cdr_residue_counts_compare['stcrdab'] - cdr_residue_counts_compare['ots']

print(cdr_residue_counts_compare.groupby(['chain_type', 'cdr'])['difference'].agg(rmsd))

print(f"Overall RMSD: {rmsd(cdr_residue_counts_compare['difference']):.2e}")
chain_type  cdr
alpha       1      0.016588
            2      0.014409
            3      0.007707
beta        1      0.028133
            2      0.021649
            3      0.007354
Name: difference, dtype: float64
Overall RMSD: 1.76e-02

Selected structures coverage of canonical classes

[36]:
cdr_clusters = pd.read_csv('../data/processed/stcrdab_clusters.csv')
cdr_clusters
[36]:
name cluster chain_type cdr sequence cluster_type
0 7zt2_DE 12 alpha_chain 1 TSGFNG pseudo
1 7zt3_DE 12 alpha_chain 1 TSGFNG pseudo
2 7zt4_DE 12 alpha_chain 1 TSGFNG pseudo
3 7zt5_DE 12 alpha_chain 1 TSGFNG pseudo
4 7zt7_DE 12 alpha_chain 1 TSGFNG pseudo
... ... ... ... ... ... ...
4807 6miv_CD 22 beta_chain 3 ASGDEGYTQY canonical
4808 3rtq_CD 22 beta_chain 3 ASGDEGYTQY canonical
4809 3dxa_NO noise beta_chain 3 ASRYRDDSYNEQF NaN
4810 1d9k_AB noise beta_chain 3 ASGGQGRAEQF NaN
4811 4gg6_GH noise beta_chain 3 ASSVAVSAGTYEQY NaN

4812 rows × 6 columns

[37]:
cdr_clusters[['pdb_id', 'chains']] = cdr_clusters['name'].str.split('_').apply(pd.Series)
cdr_clusters[['alpha_chain', 'beta_chain']] = cdr_clusters['chains'].apply(list).apply(pd.Series)
[38]:
common_columns = ['pdb_id', 'alpha_chain', 'beta_chain']
apo_holo_index = pd.MultiIndex.from_frame(apo_holo_summary[common_columns])

cdr_clusters['in_apo_holo'] = cdr_clusters[common_columns].apply(tuple, axis=1).isin(apo_holo_index)
[39]:
canonical_cdr_clusters = cdr_clusters.query("cluster_type == 'canonical'").copy().reset_index(drop=True)
canonical_cdr_clusters['cluster_num'] = canonical_cdr_clusters['cluster'].map(int)
[40]:
grouped = (canonical_cdr_clusters.groupby(['chain_type', 'cdr', 'cluster', 'in_apo_holo'])
                                 .size()
                                 .reset_index(name='count'))
total_counts = (canonical_cdr_clusters.groupby(['chain_type', 'cdr'])
                                      .size()
                                      .reset_index(name='total_count'))

grouped = grouped.merge(total_counts, on=['chain_type', 'cdr'])
grouped['percent'] = grouped['count'] / grouped['total_count'] * 100
[41]:
grouped['cluster_num'] = grouped['cluster'].apply(int)
grouped = grouped.sort_values('cluster_num')
[43]:
grid = sns.displot(grouped,
                   row='chain_type', col='cdr',
                   x='cluster',
                   weights='percent',
                   hue='in_apo_holo', multiple='stack',
                   facet_kws=dict(sharex=False))

grid.set_axis_labels("Cluster", "Percentage (%)")
grid.set(ylim=(0, 100))

plt.show()
../_images/source_Ascertaining_the_generalisability_of_the_structure_data_53_0.png
[42]:
apo_holo_coverage = (canonical_cdr_clusters.groupby(['chain_type', 'cdr', 'cluster'])
                                           .apply(lambda group: group['in_apo_holo'].any())
                                           .value_counts(normalize=True) * 100)[True]

print(f'There is {apo_holo_coverage:.2f}% coverage of canonical clusters in the apo-holo dataset.')
There is 78.26% coverage of canonical clusters in the apo-holo dataset.

Conclusion

Our analysis here shows that the conclusions drawn about TCRs from our structural analysis in this project should generalise well to other TCR:pMHC-I interactions where structure information is not currently available. The comparisons here show that the distribution of CDR lengths, amino acid compisition of CDR loops, and generally the distribution of V gene usage is comparable between the selected STCRDab structures and the background of TCRs sampled from OTS. However, it does seem that there is some over representation bias of certain V genes which may affect results. The analysis also shows that canonical CDR loop clusters are well represented in the selected analysis. We did not look at amino acid composition and other possible comparable properties from these sequences that could be done.