residues with no coordinates in pdb structures

Hello there,

is have been creating Jalview feature files to annotate structural
features of pdb structures. So that pdb residue number and sequence
position match up I have been inserting 'U' residues in the sequence
where residues are missing in the structure. I realise this is somewhat
unusual but it clearly shows (usually) disordered residues in when
viewing an alignment. It also has the advantage that alignment programs
'know' there is a residue there so that the correct spacing is
preserved. All this works nicely until I try and "Fetch DB References".
When I do this, my sequence with 'U's in it doesn't match the DB
sequence and no match is found. Can you help me find a way round this
problem?

thanks,

Will

···

--
Dr Will Pitt
Visiting Research Associate
Cambridge University
Department of Biochemistry
80 Tennis Court Road
Cambridge
CB2 1GA
UK

44(0)1223 766028

Hi Will.

is have been creating Jalview feature files to annotate structural
features of pdb structures. So that pdb residue number and sequence
position match up I have been inserting 'U' residues in the sequence
where residues are missing in the structure. I realise this is somewhat
unusual but it clearly shows (usually) disordered residues in when
viewing an alignment. It also has the advantage that alignment programs
'know' there is a residue there so that the correct spacing is
preserved. All this works nicely until I try and "Fetch DB References".
When I do this, my sequence with 'U's in it doesn't match the DB
sequence and no match is found. Can you help me find a way round this
problem?

We can certainly try. Unfortunately, Jalview behaves in a similar fashion to the alignment programs; that is, It knows that U is a residue - and treats it accordingly. It would be possible to get the Jalview sequence matcher to ignore selenocysteines when comparing the database sequence against the sequence in the alignment, but it's probably not something that most people would want to have enabled by default.

The obvious route is to change the way that you work - but I'm not sure how invested you are in your current approach - and so how much effort it will take to fix or revise your existing annotation files and alignments. However, normally, the way to achieve what you ask would be to simply use the real sequences in your alignments, and let Jalview deal with residues missing coordinates automatically (by aligning residues with coordinate data with the sequence). Jalview annotates any residue with coordinates using the 'PDBRESNUM' feature, so you can see which ones are not found in the pdb structure by virtue of the fact that no PDBRESNUM annotation is present at that position.

If the above is not sufficient, then do you want to preserve the utility of your original approach by having Jalview automatically highlight regions that have no structure coordinates ?

It would be possible to have Jalview automatically add a complement to the PDBRESNUM annotated sequence positions. Although, personally, I would also want to have some geometry checks to make sure that there really is a chain break in the model before I'd indicate that un-mapped residues correspond to disordered regions.

I also have another question about the coordinate space of structural annotation that you are generating. If you are already working in the 'expressed sequence' coordinate system rather than the PDB numbering, then you shouldn't need to change your exisiting feature files if you simply use the real sequence rather than the ones with chain breaks replaced by U's. Is that the case, or do you also need to 'lift over' your structure annotation onto the 'expressed sequence' coordinate space ?

Jim.

···

On 24/11/2010 10:15, William Ross Pitt wrote:

--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

Hi Jim,

I am still developing my code and so I am open to any suggestions. I am
using residue numbers extracted from ATOM records using Biopython. I am
linking to other databases we have in-house (Credo for protein-small
molecule interactions and Piccolo for protein-protein interactions)
which are also built upon residue numbers from ATOM records so it's
important for me to be consistent with them. My impression is that
matching sequences from other sources (even the SEQRES records) to PDB
structures can be problematic. Any advice on this would be very useful.
For instance, where would you get the 'expressed sequence' from.

thanks,

Will

Jim Procter wrote:

···

Hi Will.

On 24/11/2010 10:15, William Ross Pitt wrote:
  

is have been creating Jalview feature files to annotate structural
features of pdb structures. So that pdb residue number and sequence
position match up I have been inserting 'U' residues in the sequence
where residues are missing in the structure. I realise this is somewhat
unusual but it clearly shows (usually) disordered residues in when
viewing an alignment. It also has the advantage that alignment programs
'know' there is a residue there so that the correct spacing is
preserved. All this works nicely until I try and "Fetch DB References".
When I do this, my sequence with 'U's in it doesn't match the DB
sequence and no match is found. Can you help me find a way round this
problem?
    

We can certainly try. Unfortunately, Jalview behaves in a similar
fashion to the alignment programs; that is, It knows that U is a residue
- and treats it accordingly. It would be possible to get the Jalview
sequence matcher to ignore selenocysteines when comparing the database
sequence against the sequence in the alignment, but it's probably not
something that most people would want to have enabled by default.

The obvious route is to change the way that you work - but I'm not sure
how invested you are in your current approach - and so how much effort
it will take to fix or revise your existing annotation files and
alignments. However, normally, the way to achieve what you ask would be
to simply use the real sequences in your alignments, and let Jalview
deal with residues missing coordinates automatically (by aligning
residues with coordinate data with the sequence). Jalview annotates any
residue with coordinates using the 'PDBRESNUM' feature, so you can see
which ones are not found in the pdb structure by virtue of the fact that
no PDBRESNUM annotation is present at that position.

If the above is not sufficient, then do you want to preserve the utility
of your original approach by having Jalview automatically highlight
regions that have no structure coordinates ?

It would be possible to have Jalview automatically add a complement to
the PDBRESNUM annotated sequence positions. Although, personally, I
would also want to have some geometry checks to make sure that there
really is a chain break in the model before I'd indicate that un-mapped
residues correspond to disordered regions.

I also have another question about the coordinate space of structural
annotation that you are generating. If you are already working in the
'expressed sequence' coordinate system rather than the PDB numbering,
then you shouldn't need to change your exisiting feature files if you
simply use the real sequence rather than the ones with chain breaks
replaced by U's. Is that the case, or do you also need to 'lift over'
your structure annotation onto the 'expressed sequence' coordinate space ?

Jim.

--
Dr Will Pitt
Visiting Research Associate
Cambridge University
Department of Biochemistry
80 Tennis Court Road
Cambridge
CB2 1GA
UK

44(0)1223 766028

Will,

I'm sure Jim will answer this as well, but your best source of cross-referencing between PDB and other resources at the residue-level is SIFTS (SIFTS < PDBe < EMBL-EBI). This is maintained at EBI and is used by both Uniprot and the wwPDB to keep structures synchronised with protein sequences.

All the best,

Geoff.

···

On 24/11/2010 13:24, William Ross Pitt wrote:

Hi Jim,

I am still developing my code and so I am open to any suggestions. I am
using residue numbers extracted from ATOM records using Biopython. I am
linking to other databases we have in-house (Credo for protein-small
molecule interactions and Piccolo for protein-protein interactions)
which are also built upon residue numbers from ATOM records so it's
important for me to be consistent with them. My impression is that
matching sequences from other sources (even the SEQRES records) to PDB
structures can be problematic. Any advice on this would be very useful.
For instance, where would you get the 'expressed sequence' from.

thanks,

Will

Jim Procter wrote:

Hi Will.

On 24/11/2010 10:15, William Ross Pitt wrote:

is have been creating Jalview feature files to annotate structural
features of pdb structures. So that pdb residue number and sequence
position match up I have been inserting 'U' residues in the sequence
where residues are missing in the structure. I realise this is somewhat
unusual but it clearly shows (usually) disordered residues in when
viewing an alignment. It also has the advantage that alignment programs
'know' there is a residue there so that the correct spacing is
preserved. All this works nicely until I try and "Fetch DB References".
When I do this, my sequence with 'U's in it doesn't match the DB
sequence and no match is found. Can you help me find a way round this
problem?

We can certainly try. Unfortunately, Jalview behaves in a similar
fashion to the alignment programs; that is, It knows that U is a residue
- and treats it accordingly. It would be possible to get the Jalview
sequence matcher to ignore selenocysteines when comparing the database
sequence against the sequence in the alignment, but it's probably not
something that most people would want to have enabled by default.

The obvious route is to change the way that you work - but I'm not sure
how invested you are in your current approach - and so how much effort
it will take to fix or revise your existing annotation files and
alignments. However, normally, the way to achieve what you ask would be
to simply use the real sequences in your alignments, and let Jalview
deal with residues missing coordinates automatically (by aligning
residues with coordinate data with the sequence). Jalview annotates any
residue with coordinates using the 'PDBRESNUM' feature, so you can see
which ones are not found in the pdb structure by virtue of the fact that
no PDBRESNUM annotation is present at that position.

If the above is not sufficient, then do you want to preserve the utility
of your original approach by having Jalview automatically highlight
regions that have no structure coordinates ?

It would be possible to have Jalview automatically add a complement to
the PDBRESNUM annotated sequence positions. Although, personally, I
would also want to have some geometry checks to make sure that there
really is a chain break in the model before I'd indicate that un-mapped
residues correspond to disordered regions.

I also have another question about the coordinate space of structural
annotation that you are generating. If you are already working in the
'expressed sequence' coordinate system rather than the PDB numbering,
then you shouldn't need to change your exisiting feature files if you
simply use the real sequence rather than the ones with chain breaks
replaced by U's. Is that the case, or do you also need to 'lift over'
your structure annotation onto the 'expressed sequence' coordinate space ?

Jim.

--
Geoff Barton, Professor of Bioinformatics, College of Life Sciences
University of Dundee, Scotland, UK. g.j.barton@dundee.ac.uk
Tel:+44 1382 385860/388731 (Fax:385764) www.compbio.dundee.ac.uk

The University of Dundee is registered Scottish charity: No.SC015096

Thanks Goeff and Jim. I will give SIFTS a go.

-Kind regards,

Will

Geoff Barton wrote:

···

Will,

I'm sure Jim will answer this as well, but your best source of
cross-referencing between PDB and other resources at the residue-level
is SIFTS (SIFTS < PDBe < EMBL-EBI). This is maintained
at EBI and is used by both Uniprot and the wwPDB to keep structures
synchronised with protein sequences.

All the best,

Geoff.

On 24/11/2010 13:24, William Ross Pitt wrote:

Hi Jim,

I am still developing my code and so I am open to any suggestions. I am
using residue numbers extracted from ATOM records using Biopython. I am
linking to other databases we have in-house (Credo for protein-small
molecule interactions and Piccolo for protein-protein interactions)
which are also built upon residue numbers from ATOM records so it's
important for me to be consistent with them. My impression is that
matching sequences from other sources (even the SEQRES records) to PDB
structures can be problematic. Any advice on this would be very useful.
For instance, where would you get the 'expressed sequence' from.

thanks,

Will

Jim Procter wrote:

Hi Will.

On 24/11/2010 10:15, William Ross Pitt wrote:

is have been creating Jalview feature files to annotate structural
features of pdb structures. So that pdb residue number and sequence
position match up I have been inserting 'U' residues in the sequence
where residues are missing in the structure. I realise this is
somewhat
unusual but it clearly shows (usually) disordered residues in when
viewing an alignment. It also has the advantage that alignment
programs
'know' there is a residue there so that the correct spacing is
preserved. All this works nicely until I try and "Fetch DB
References".
When I do this, my sequence with 'U's in it doesn't match the DB
sequence and no match is found. Can you help me find a way round this
problem?

We can certainly try. Unfortunately, Jalview behaves in a similar
fashion to the alignment programs; that is, It knows that U is a
residue
- and treats it accordingly. It would be possible to get the Jalview
sequence matcher to ignore selenocysteines when comparing the database
sequence against the sequence in the alignment, but it's probably not
something that most people would want to have enabled by default.

The obvious route is to change the way that you work - but I'm not sure
how invested you are in your current approach - and so how much effort
it will take to fix or revise your existing annotation files and
alignments. However, normally, the way to achieve what you ask would be
to simply use the real sequences in your alignments, and let Jalview
deal with residues missing coordinates automatically (by aligning
residues with coordinate data with the sequence). Jalview annotates any
residue with coordinates using the 'PDBRESNUM' feature, so you can see
which ones are not found in the pdb structure by virtue of the fact
that
no PDBRESNUM annotation is present at that position.

If the above is not sufficient, then do you want to preserve the
utility
of your original approach by having Jalview automatically highlight
regions that have no structure coordinates ?

It would be possible to have Jalview automatically add a complement to
the PDBRESNUM annotated sequence positions. Although, personally, I
would also want to have some geometry checks to make sure that there
really is a chain break in the model before I'd indicate that un-mapped
residues correspond to disordered regions.

I also have another question about the coordinate space of structural
annotation that you are generating. If you are already working in the
'expressed sequence' coordinate system rather than the PDB numbering,
then you shouldn't need to change your exisiting feature files if you
simply use the real sequence rather than the ones with chain breaks
replaced by U's. Is that the case, or do you also need to 'lift over'
your structure annotation onto the 'expressed sequence' coordinate
space ?

Jim.

--
Dr Will Pitt
Visiting Research Associate
Cambridge University
Department of Biochemistry
80 Tennis Court Road
Cambridge
CB2 1GA
UK

44(0)1223 766028