wwpdb
PDB FORMAT Version 2.3
Main Index
Basic Notions of the Format Description
Record Format
Types of Records
PDB Format Change Policy
Order of Records
Sections of an Entry
Field Formats

Introduction

The Protein Data Bank (PDB) is an archive of experimentally determined three-dimensional structures of biological macromolecules that serves a global community of researchers, educators, and students. The data contained in the archive include atomic coordinates, bibliographic citations, primary and secondary structure, information, and crystallographic structure factors and NMR experimental data.

This guide describes the "PDB format" used by the members of the worldwide Protein Data Bank (Berman, H.M., Henrick, K. and Nakamura, H. (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol, 10, 980). Questions should be sent to info@wwpdb.org

This version of the PDB file format has been in use since July 9, 1998. Please note that as of July 1, 2002, models are available from a directory separate from the main archive at FTP Site. As of October 15, 2006, theoretical models are no longer accepted for deposition.


Basic Notions of the Format Description

Character Set

Only non-control ASCII characters, as well as the space and end-of-line indicator, appear in a PDB coordinate entry file. Namely:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
` - = [ ] \ ; ' , . / ~ ! @ # $ % ^ & * ( ) _ + { } | : " < > ?

We discourage use of punctuation characters in the place of alphanumeric characters.

The space, and end-of-line:. The end-of-line indicator is system-specific. Unix uses a line feed character; other systems may use a carriage return followed by a line feed.

Special Characters

Greek letters are spelled out, i.e., alpha, beta, gamma, etc.
Bullets are represented as (DOT).
Right arrow is represented as -->.
Left arrow is represented as <--.
If "=" is surrounded by at least one space on each side, then it is assumed to be an equal sign, e.g., 2 + 4 = 6.

Commas, colons, and semi-colons are used as list delimiters in records that have one of the following data types:

  • List
  • SList
  • Specification List
  • Specification
  • If a comma, colon, or semi-colon is used in any context other than as a delimiting character, then the character must be escaped, i.e., immediately preceded by a backslash, "\". Examples of this use are found in line 4 of each of the following:

    COMPND        MOL_ID: 1;
    COMPND      2 MOLECULE: GLUTATHIONE SYNTHETASE;
    COMPND      3 CHAIN: A;
    COMPND      4 SYNONYM: GAMMA-L-GLUTAMYL-L-CYSTEINE\:GLYCINE LIGASE
    COMPND      5 (ADP-FORMING);
    COMPND      6 EC: 6.3.2.3;
    COMPND      7 ENGINEERED: YES
    
    COMPND        MOL_ID: 1;
    COMPND      2 MOLECULE: S-ADENOSYLMETHIONINE SYNTHETASE;
    COMPND      3 CHAIN: A, B;
    COMPND      4 SYNONYM: MAT, ATP\:L-METHIONINE S-ADENOSYLTRANSFERASE;
    COMPND      5 EC: 2.5.1.6;
    COMPND      6 ENGINEERED: YES;
    COMPND      7 BIOLOGICAL_UNIT: TETRAMER;
    COMPND      8 OTHER_DETAILS: TETRAGONAL MODIFICATION
    


    Record Format

    Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end- of- line indicator.

    Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names.

    The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines.

    Each record type is further divided into fields.

    Each record type is detailed in this document. The description of each record type includes the following sections:

  • Overview
  • Record Format
  • Details
  • Verification/Validation/Value Authority Control
  • Relationship to Other Record Types
  • Example
  • Known Problems
  • For records that are fully described in fixed column format, columns not assigned to fields must be left blank.


    Types of Records

    It is possible to group records into categories based upon how often the record type appears in an entry.

    Single: There are records that may only appear one time (without continuations) in a file. Listed alphabetically, these are:

    RECORD TYPE      DESCRIPTION
    --------------------------------------------------------------------
    CRYST1           Unit cell parameters, space group, and Z.
    END              Last record in the file.
    HEADER           First line of the entry, contains PDB ID code,
                     classification, and date of deposition.
    MASTER           Control record for bookkeeping.
    ORIGXn           Transformation from orthogonal coordinates to the
                     submitted coordinates (n = 1, 2, or 3).
    SCALEn           Transformation from orthogonal coordinates to fractional
                     crystallographic coordinates (n = 1, 2, or 3).
    

    It is an error for a duplicate of any of these records to appear in an entry.

    There are records that conceptually exist only once in an entry, but the information content may exceed the number of columns available. These records are therefore continued on subsequent lines. Listed alphabetically, these are:

    RECORD TYPE      DESCRIPTION
    ---------------------------------------------------------------------
    AUTHOR           List of contributors.
    CAVEAT           Severe error indicator
    COMPND           Description of macromolecular contents of the entry.
    EXPDTA           Experimental technique used for the structure
                     determination.
    KEYWDS           List of keywords describing the macromolecule.
    OBSLTE           Statement that the entry has been removed from
                     Distribution and list of the ID code(s) which
                     replaced it.
    SOURCE           Biological source of macromolecules in the entry.
    SPRSDE           List of entries withdrawn from release and replaced by
                     current entry.
    TITLE            Description of the experiment represented in the entry.
    

    The second and subsequent lines contain a continuation field, which is a right-justified integer. This number increments by one for each additional line of the record, and is followed by a blank character.

    Multiple: Most record types appear multiple times, often in groups where the information is not logically concatenated but is presented in the form of a list. Many of these record types have a custom serialization that may be used not only to order the records, but also to connect to other record types. Listed alphabetically, these are:

    RECORD TYPE      DESCRIPTION
    -----------------------------------------------------------------------
    ANISOU           Anisotropic temperature factors.
    ATOM             Atomic coordinate records for standard groups.
    CISPEP           Identification of peptide residues in cis conformation.
    CONECT           Connectivity records.
    DBREF            Reference to the entry in the sequence database(s).
    HELIX            Identification of helical substructures.
    HET              Identification of non-standard groups or residues
                     (heterogens)
    HETSYN           Synonymous compound names for heterogens.
    LINK             Identification of inter-residue bonds.
    MODRES           Identification of modifications to standard residues.
    MTRIXn           Transformations expressing non-crystallographic symmetry
                     (n = 1, 2, or 3). There may be multiple sets of these
    	            records.
    REVDAT           Revision date and related information.
    SEQADV           Identification of conflicts between PDB and the named
                     Sequence database.
    SEQRES           Primary sequence of backbone residues.
    SHEET            Identification of sheet substructures.
    SIGATM           Standard deviations of atomic parameters.
    SIGUIJ           Standard deviations of anisotropic temperature factors.
    SITE             Identification of groups comprising important sites.
    SSBOND           Identification of disulfide bonds.
    TVECT            Translation vector for infinite covalently connected
                     structures.
    

    There are records that conceptually exist multiple times in an entry, but the information content may exceed the number of columns available. These records are therefore continued on subsequent lines. Listed alphabetically, these are:

    RECORD TYPE     DESCRIPTION
    ---------------------------------------------------------------------
    FORMUL          Chemical formula of non-standard groups.
    HETATM          Atomic coordinate records for heterogens.
    HETNAM          Compound name of the heterogens.
    

    The second and subsequent lines contain a continuation field which is a right-justified integer. This number increments by one for each additional line of the record, and is followed by a blank character.

    Grouping: There are three record types used to group other records. Listed alphabetically, these are:

    RECORD TYPE    DESCRIPTION
    ---------------------------------------------------------------------
    ENDMDL         End-of-model record for multiple structures in a single
                   coordinate entry.
    MODEL          Specification of model number for multiple structures in a
                   single coordinate entry.
    TER            Chain terminator.
    

    The MODEL/ENDMDL records surround groups of ATOM, HETATM, SIGATM, ANISOU, SIGUIJ, and TER records. TER records indicate the end of a chain.

    Other: The remaining record types have a detailed inner structure. Listed alphabetically, these are:

    RECORD TYPE DESCRIPTION -------------------------------------------------------------------- JRNL Literature citation that defines the coordinate set. REMARK General remarks, some are structured and some are free form.


    PDB Format Change Policy

    The PDB will use the following protocol in making changes to the way PDB coordinate entries are represented and archived. The purpose of the new policy is to allow ample time for everyone to understand these changes and to assess their impact on existing programs. These modifications are necessary to address the changing needs of our users as well as the changing nature of the data that is archived.

  • Comments and suggestions will be solicited from the community on specific problems and data representation issues as they arise.
  • Proposed format changes will be disseminated through pdb-l@rcsb.org and https://www.rcsb.org
  • A sixty-day discussion period will follow the announcement of proposed changes. Comments and suggestions must be received within this time period. Major changes that are not upwardly compatible will be allotted up to twice the standard amount of discussion time.
  • The PDB will then work in consultation with the wwPDB Advisory Committee and the equivalent partner Scientific Advisory Committees to evaluate and reconcile all suggestions. The final decision will be officially announced via pdb-l@rcsb.org and https://www.rcsb.org
  • Implementation will follow official announcement of the format change. Major changes will not appear in PDB files earlier than sixty days after the announcement, allowing sufficient time to modify files and programs.

  • Order of Records

    All records in a PDB coordinate entry must appear in a defined order. Mandatory record types are present in all entries. When mandatory data are not provided, the record name must appear in the entry with a NULL indicator. Optional items become mandatory when certain conditions exist. Record order and existence are described in the following table:

    RECORD TYPE            EXISTENCE   CONDITIONS IF OPTIONAL
    --------------------------------------------------------------
    HEADER                 Mandatory
    OBSLTE                 Optional    Mandatory in withdrawn entries.
    TITLE                  Mandatory
    CAVEAT                 Optional    Typically included if there are chirality errors
    COMPND                 Mandatory
    SOURCE                 Mandatory
    KEYWDS                 Mandatory
    EXPDTA                 Mandatory
    AUTHOR                 Mandatory
    REVDAT                 Mandatory
    SPRSDE                 Optional    Mandatory if a replacement entry.
    JRNL                   Optional    Mandatory if a publication Describes the experiment.
    REMARK 1               Optional
    REMARK 2               Mandatory
    REMARK 3               Mandatory
    REMARK N               Optional    Mandatory under certain conditions, as noted in the
                                       remark descriptions.
    DBREF                  Optional    Mandatory for each peptide chain with a length greater
                                       than ten (10) residues, and for nucleic acid entries
                                       that exist in the NDB.
    SEQADV                 Optional    Mandatory if sequence conflict exists.
    SEQRES                 Optional    Mandatory if ATOM records exist.
    MODRES                 Optional    Mandatory if modified group exists
                                       within the coordinates.
    HET                    Optional    Mandatory if non-standard group other
                                       than water appears in the entry.
    HETNAM                 Optional    Mandatory if non-standard group other
                                       than water appears in the entry.
    HETSYN                 Optional
    FORMUL                 Optional    Mandatory if non-standard group or water appears.
    HELIX                  Optional
    SHEET                  Optional
    TURN                   Optional    Deprecated.
    SSBOND                 Optional    Mandatory if disulfide bond is present.
    LINK                   Optional
    HYDBND                 Optional    Deprecated.
    SLTBRG                 Optional    Deprecated.
    CISPEP                 Optional
    SITE                   Optional
    CRYST1                 Mandatory
    ORIGX1 ORIGX2 ORIGX3   Mandatory
    SCALE1 SCALE2 SCALE3   Mandatory
    MTRIX1 MTRIX2 MTRIX3   Optional    Mandatory if the complete asymmetric
                                       unit must be generated from the given
                                       coordinates using non-crystallographic symmetry.
    TVECT                   Optional
    MODEL                   Optional   Mandatory if more than one model
                                       is present in the entry.
    ATOM                    Optional   Mandatory if standard residues exist.
    SIGATM                  Optional
    ANISOU                  Optional
    SIGUIJ                  Optional
    TER                     Optional   Mandatory if ATOM records exist.
    HETATM                  Optional   Mandatory if non-standard group
                                       appears.
    ENDMDL                  Optional   Mandatory if MODEL appears.
    CONECT                  Optional   Mandatory if non-standard group
                                       appears.
    MASTER                  Mandatory
    END                     Mandatory
    


    Sections of an Entry

    The following table lists the various sections of a PDB coordinate entry and the records comprising them:

    SECTION                  DESCRIPTION                   RECORD TYPE
    ----------------------------------------------------------------------------
    Title                Summary descriptive remarks       HEADER, OBSLTE, TITLE,
                                                           CAVEAT, COMPND, SOURCE,
                                                           KEYWDS, EXPDTA, AUTHOR,
                                                           REVDAT, SPRSDE, JRNL
    Remark               Bibliography, refinement          REMARKs 1, 2, 3 & annotations
    Primary structure    Peptide and/or nucleotide         DBREF, SEQADV, SEQRES MODRES
                            sequence and the
                            relationship between the PDB
                            sequence and that found in
                            the sequence database(s)
    Heterogen            Description of non-standard       HET, HETNAM, HETSYN, FORMUL
                            groups
    Secondary structure  Description of secondary          HELIX, SHEET, TURN
                            structure
    Connectivity         Chemical connectivity             SSBOND, LINK, HYDBND,
    annotation                                             SLTBRG, CISPEP
    Miscellaneous        Features within the               SITE
      features              macromolecule
    Crystallographic     Description of the                CRYST1
                            crystallographic cell
    Coordinate           Coordinate transformation         ORIGXn, SCALEn, MTRIXn, TVECT
    transformation          operators
    Coordinate           Atomic coordinate data            MODEL, ATOM, SIGATM, ANISOU,
                                                           SIGUIJ, TER, HETATM, ENDMDL
    Connectivity         Chemical connectivity             CONECT
    Bookkeeping          Summary information,              MASTER, END
                            end-of-file marker
    


    Field Formats

    Each record type is presented in a table which contains the division of the records into fields by column number, defined data type, field name or a quoted string which must appear in the field, and field definition. Any column not specified must be left blank.

    Each field contains an identified data type that can be validated by a program. These are:

    DATA TYPE                  DESCRIPTION
    ----------------------------------------------------------------------------------
    AChar                    An alphabetic character (A-Z, a-z).
    Atom                     Atom name
    Character                Any non-control character in the ASCII character set or a
                             space.
    Continuation             A two-character field that is either blank (for the first
                             record of a set) or contains a two digit number
                             right-justified and blank-filled which counts continuation
                             records starting with 2. The continuation number must be
                             followed by a blank.
    Date                     A 9 character string in the form dd-mmm-yy where DD is the
                             day of the month, zero-filled on the left (e.g., 04); MMM is
                             the common English 3-letter abbreviation of the month; and
                             YY is a year in the 20th century. This must represent a
                             valid date.
    IDcode                   A PDB identification code which consists of 4 characters,
                             the first of which is a digit in the range 0 - 9; the
                             remaining 3 are alpha-numeric, and letters are upper case
                             only. Entries with a 0 as the first character do not
                             contain coordinate data.
    Integer                  Right-justified blank-filled integer value.
    Token                    A sequence of non-space characters followed by a colon and a
                             space.
    List                     A String that is composed of text separated with commas.
    LString                  A literal string of characters. All spacing is significant
                             and must be preserved.
    LString(n)               An LString with exactly n characters.
    Real(n,m)                Real (floating point) number in the FORTRAN format Fn.m.
    Record name              The name of the record: 6 characters, left-justified and
                             blank-filled.
    Residue name             One of the standard amino acid or nucleic acids, as listed
                             below, or the non-standard group designation as defined in
                             the HET dictionary. Field is right-justified.
    SList                    A String that is composed of text separated with semi-colons.
    Specification            A String composed of a token and its associated value
                             separated by a colon.
    Specification            A sequence of Specifications, separated by semi-colons.
    List
    String                   A sequence of characters. These characters may have
                             arbitrary spacing, but should be interpreted as directed
                             below.
    String(n)                A String with exactly n characters.
    SymOP                    An integer field of from 4 to 6 digits, right-justified, of
                             the form nnnMMM where nnn is the symmetry operator number and
                             MMM is the translation vector.
    

    To interpret a String, concatenate the contents of all continued fields together, collapse all sequences of multiple blanks to a single blank, and remove any leading and trailing blanks. This permits very long strings to be properly reconstructed.


    � 2007 wwPDB