Working with PDB Structures in Pandas

Let’s start by loading up a PDB structure from my favourite OPIG tool, the STCRDab! Here we will look at PDB ID 6eqb. It is a crystal structure of a T cell receptor interacting with a peptide presented by an MHC class I molecule.

[1]:

import nglview

[2]:

view = nglview.show_file('data/6eqb.pdb')
view

Now let’s load this molecules into a pandas dataframe and do some analysis. The PDB data is loaded into columns in a similar way that PDB files are formatted as columns.

[3]:

from python_pdb.parsers import parse_pdb_to_pandas

[4]:

with open('data/6eqb.pdb', 'r') as fh:
    df = parse_pdb_to_pandas(fh.read())

df

[4]:

	record_type	atom_number	atom_name	alt_loc	residue_name	chain_id	residue_seq_id	residue_insert_code	pos_x	pos_y	pos_z	occupancy	b_factor	element	charge
0	ATOM	1	N	A	ALA	C	2	None	48.681	-11.013	29.600	0.5	30.86	N	None
1	ATOM	2	CA	A	ALA	C	2	None	49.343	-9.708	29.330	0.5	29.82	C	None
2	ATOM	3	C	A	ALA	C	2	None	49.310	-8.792	30.562	0.5	29.16	C	None
3	ATOM	4	O	A	ALA	C	2	None	48.668	-9.107	31.537	0.5	29.75	O	None
4	ATOM	5	CB	A	ALA	C	2	None	48.679	-9.044	28.146	0.5	29.57	C	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6644	HETATM	6645	O	None	HOH	B	128	None	63.467	-9.052	4.232	1.0	46.86	O	None
6645	HETATM	6646	O	None	HOH	B	129	None	64.998	-3.887	10.735	1.0	48.05	O	None
6646	HETATM	6647	O	None	HOH	B	130	None	69.089	-37.773	1.079	1.0	63.55	O	None
6647	HETATM	6648	O	None	HOH	B	131	None	70.137	-40.643	1.067	1.0	59.75	O	None
6648	HETATM	6649	O	None	HOH	B	132	None	66.479	4.782	14.497	1.0	64.56	O	None

6649 rows × 15 columns

To start things off, let’s clean up this structure by highlighting one of the most powerful aspects of this approach: querying. As you can seen from the data frame above- there are water molecules in the structure that we might not care about. Let’s remove them…

[5]:

df_clean = df.query("record_type == 'ATOM'")  # or "residue_name != 'HOH'" would have worked as well
df_clean.tail()

[5]:

	record_type	atom_number	atom_name	alt_loc	residue_name	chain_id	residue_seq_id	residue_insert_code	pos_x	pos_y	pos_z	occupancy	b_factor	element	charge
6633	ATOM	6634	CB	None	MET	B	125	None	74.125	-32.337	9.293	1.0	121.02	C	None
6634	ATOM	6635	CG	None	MET	B	125	None	73.143	-31.632	10.263	1.0	116.33	C	None
6635	ATOM	6636	SD	None	MET	B	125	None	71.379	-32.103	10.373	1.0	113.91	S	None
6636	ATOM	6637	CE	None	MET	B	125	None	70.626	-31.028	9.134	1.0	109.06	C	None
6637	ATOM	6638	OXT	None	MET	B	125	None	74.839	-34.197	6.246	1.0	130.70	O	None

By using pandas’ built in querying function- we can easily get rid of the HETATMs in the file that we might not care about. We can also use this querying to just select the TCR or pMHC to perform analysis on this molecule seperately. In this example, the TCR \(\alpha\)- and \(\beta\) chain is labelled as chains D and E repectively. The MHC molecule is chain A (B- representing the \(\beta_2\)m) and C the peptide.

[6]:

!head -n5 data/6eqb.pdb

REMARK   5 IMGT RENUMBERED STRUCTURE 6EQB GENERATED BY STCRDAB
REMARK   5 TCR CHAINS ARE RENUMBERED IN THE VARIABLE REGIONS ONLY
REMARK   5 MHC CHAINS ARE RENUMBERED IN THE G DOMAINS OR FOR B2M-GLOBULIN
REMARK   5 NON-TCR AND NON-MHC CHAINS ARE LEFT WITH RESIDUE IDS AS IN PDB
REMARK   5 PAIRED_ABTCR BCHAIN=E ACHAIN=D MHCCHAINS=AB AGCHAIN=C AGTYPE=PEPTIDE

[7]:

tcr_df = df_clean.query("chain_id == 'D' or chain_id == 'E'")
tcr_df

[7]:

	record_type	atom_number	atom_name	alt_loc	residue_name	chain_id	residue_seq_id	residue_insert_code	pos_x	pos_y	pos_z	occupancy	b_factor	element	charge
57	ATOM	58	N	None	SER	E	1	None	44.786	19.936	31.694	1.0	91.71	N	None
58	ATOM	59	CA	None	SER	E	1	None	43.328	20.085	31.879	1.0	88.14	C	None
59	ATOM	60	C	None	SER	E	1	None	42.582	19.234	30.860	1.0	80.31	C	None
60	ATOM	61	O	None	SER	E	1	None	41.944	19.762	29.940	1.0	81.48	O	None
61	ATOM	62	CB	None	SER	E	1	None	42.912	21.571	31.778	1.0	93.53	C	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3519	ATOM	3520	CD1	None	PHE	D	215	None	-5.784	33.229	67.928	1.0	154.28	C	None
3520	ATOM	3521	CD2	None	PHE	D	215	None	-6.202	32.551	65.649	1.0	145.03	C	None
3521	ATOM	3522	CE1	None	PHE	D	215	None	-6.684	34.292	67.786	1.0	160.08	C	None
3522	ATOM	3523	CE2	None	PHE	D	215	None	-7.092	33.613	65.506	1.0	151.52	C	None
3523	ATOM	3524	CZ	None	PHE	D	215	None	-7.336	34.483	66.576	1.0	159.15	C	None

3461 rows × 15 columns

[8]:

mhc_df = df_clean.query("chain_id == 'A'")
mhc_df

[8]:

	record_type	atom_number	atom_name	alt_loc	residue_name	chain_id	residue_seq_id	residue_insert_code	pos_x	pos_y	pos_z	occupancy	b_factor	element	charge
3541	ATOM	3542	N	None	GLY	A	1	None	63.937	-26.599	37.997	1.0	88.67	N	None
3542	ATOM	3543	CA	None	GLY	A	1	None	64.891	-25.601	38.583	1.0	88.18	C	None
3543	ATOM	3544	C	None	GLY	A	1	None	64.208	-24.275	38.448	1.0	82.85	C	None
3544	ATOM	3545	O	None	GLY	A	1	None	63.079	-24.143	38.876	1.0	81.66	O	None
3545	ATOM	3546	N	None	SER	A	2	None	64.856	-23.315	37.808	1.0	80.55	N	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5790	ATOM	5791	O	None	PRO	A	1186	None	77.326	-47.026	15.302	1.0	147.25	O	None
5791	ATOM	5792	CB	None	PRO	A	1186	None	79.797	-49.588	14.420	1.0	159.44	C	None
5792	ATOM	5793	CG	None	PRO	A	1186	None	81.295	-49.731	14.585	1.0	160.29	C	None
5793	ATOM	5794	CD	None	PRO	A	1186	None	81.803	-48.494	15.300	1.0	158.37	C	None
5794	ATOM	5795	OXT	None	PRO	A	1186	None	77.441	-48.931	16.082	1.0	151.45	O	None

2254 rows × 15 columns

[9]:

peptide_df = df_clean.query("chain_id == 'C'")

peptide_residues_df = peptide_df.groupby(['residue_seq_id', 'residue_insert_code'], dropna=False)
print(f'The peptide is a {len(peptide_residues_df)}-mer!')

The peptide is a 9-mer!

Another advantage of using pandas is we can add new columns to annotate properties in the structure that we care about. In this example, since the TCR is from STCRDab and has been renumbered using the IMGT numbering convention we can easily identify the complimentary determining regions based on their residue_seq_id property.

[10]:

IMGT_CDR1 = set(range(27, 38 + 1))
IMGT_CDR2 = set(range(56, 65 + 1))
IMGT_CDR3 = set(range(105, 117 + 1))


def assign_cdr_number(seq_id: int) -> int | None:
    '''
    Map imgt_id to CDR domains, return number associated with domain or return None if input is not in a CDR
    domain.
    '''
    if seq_id in IMGT_CDR1:
        return 1

    if seq_id in IMGT_CDR2:
        return 2

    if seq_id in IMGT_CDR3:
        return 3

    return None

tcr_df = tcr_df.copy()  # Doing this on a copy of the dataframe since it is originally a slice of df!
tcr_df['cdr'] = tcr_df['residue_seq_id'].map(assign_cdr_number)

We can also annotations for the \(\alpha\) and \(\beta\) chain since these are defined by the STCRDab header.

[11]:

tcr_df['chain_type'] = tcr_df['chain_id'].map(lambda chain_id: 'alpha' if chain_id == 'D' else 'beta')

Now we can easily get rich information about the TCR CDR loops with ease.

[12]:

tcr_cdrs_df = tcr_df.query('cdr.notnull()')

cdr_lengths = tcr_cdrs_df[
    ['chain_type', 'cdr', 'residue_seq_id', 'residue_insert_code']
].drop_duplicates().groupby(['chain_type', 'cdr'], dropna=False).size()

cdr_lengths

[12]:

chain_type  cdr
alpha       1.0     6
            2.0     6
            3.0     9
beta        1.0     6
            2.0     5
            3.0    13
dtype: int64

We can also easily properties such as b-factors in the TCR variable domain.

[13]:

import seaborn as sns

[14]:

tcr_variable_df = tcr_df.query('residue_seq_id <= 129')
sns.lineplot(x=tcr_variable_df['residue_seq_id'], y=tcr_variable_df['b_factor'])

[14]:

<Axes: xlabel='residue_seq_id', ylabel='b_factor'>

../_images/source_Working_with_PDB_Structures_in_Pandas_21_1.png

Computing things like contacting residues between the TCR CDR loops and the peptide is a breeze.

[15]:

import numpy as np

[16]:

CONTACT_DISTANCE = 5  # Angstroms (Å)

def euclidean_distance(x1, y1, z1, x2, y2, z2):
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2  + (z2 - z1)**2)


interface = tcr_cdrs_df.merge(peptide_df, how='cross', suffixes=('_tcr', '_peptide'))
interface['atom_distances'] = euclidean_distance(interface['pos_x_tcr'], interface['pos_y_tcr'], interface['pos_z_tcr'],
                                                 interface['pos_x_peptide'], interface['pos_y_peptide'], interface['pos_z_peptide'])

contacting_atoms = interface[interface['atom_distances'] <= CONTACT_DISTANCE]
contacting_residues = contacting_atoms[['chain_id_tcr', 'residue_seq_id_tcr', 'residue_insert_code_tcr', 'cdr', 'chain_type',
                                        'residue_seq_id_peptide', 'residue_insert_code_peptide']].drop_duplicates()

contacting_residues

[16]:

	chain_id_tcr	residue_seq_id_tcr	residue_insert_code_tcr	cdr	chain_type	residue_seq_id_peptide	residue_insert_code_peptide
6424	E	108	None	3.0	beta	8	None
6603	E	109	None	3.0	beta	9	None
6641	E	109	None	3.0	beta	7	None
6704	E	109	None	3.0	beta	8	None
6926	E	110	None	3.0	beta	7	None
7151	E	111	None	3.0	beta	7	None
7194	E	111	None	3.0	beta	4	None
7253	E	111	None	3.0	beta	5	None
7261	E	111	None	3.0	beta	6	None
7280	E	111	None	3.0	beta	8	None
7361	E	111	None	3.0	beta	3	None
7594	E	112	None	3.0	beta	4	None
7596	E	112	None	3.0	beta	5	None
7603	E	112	None	3.0	beta	6	None
7607	E	112	None	3.0	beta	7	None
7622	E	112	None	3.0	beta	8	None
8114	E	113	None	3.0	beta	5	None
11763	D	37	None	1.0	alpha	5	None
11923	D	37	None	1.0	alpha	4	None
12030	D	37	None	1.0	alpha	2	None
12033	D	37	None	1.0	alpha	3	None
12218	D	38	None	1.0	alpha	5	None
13131	D	57	None	2.0	alpha	5	None

Finally, if we want to save part of the data frame as a PDB file, we can convert it back to a Structure object and save it to a file.

[17]:

import warnings

from python_pdb.entities import Structure, StructureConstructionWarning

[18]:

# Suppressing warnings here because there are alternate locations specified in this PDB file
with warnings.catch_warnings():
    warnings.filterwarnings('ignore', category=StructureConstructionWarning)
    tcr_structure = Structure.from_pandas(tcr_df)

with open('tcr.pdb', 'w') as fh:
    fh.write(str(tcr_structure))

[19]:

!ls | grep '.pdb'

tcr.pdb

[20]:

!head tcr.pdb

ATOM     58  N   SER E   1      44.786  19.936  31.694  1.00 91.71           N
ATOM     59  CA  SER E   1      43.328  20.085  31.879  1.00 88.14           C
ATOM     60  C   SER E   1      42.582  19.234  30.860  1.00 80.31           C
ATOM     61  O   SER E   1      41.944  19.762  29.940  1.00 81.48           O
ATOM     62  CB  SER E   1      42.912  21.571  31.778  1.00 93.53           C
ATOM     63  OG  SER E   1      41.483  21.690  31.841  1.00 93.62           O
ATOM     64  N   GLN E   2      42.634  17.921  31.043  1.00 72.67           N
ATOM     65  CA  GLN E   2      41.716  17.039  30.308  1.00 69.30           C
ATOM     66  C   GLN E   2      40.246  17.307  30.674  1.00 66.83           C
ATOM     67  O   GLN E   2      39.999  17.911  31.668  1.00 70.99           O