Preparing PDBx/mmCIF files for Depositing Structures Large and Small

To better support the increasing complexity and size of data submitted to the PDB archive, the wwPDB Deposition & Annotation system is based on the PDBx/mmCIF data dictionary and file format. The system accepts, processes and distributes PDBx/mmCIF data files.

This document describes available tools for generating PDBx/mmCIF format files and how to prepare "large structures" for deposition. A large structure is defined as having more than 99,999 atoms and/or more than 62 polymer chains, which are the restrictions of the traditional PDB format.

Depositors are encouraged to use the PDBx/mmCIF format for coordinate files whenever possible.

Generating PDBx/mmCIF format files automatically

PDBx/mmCIF is the official working format of the wwPDB for coordinate files. It is flexible, extensible, and can accommodate structures of any size. PDBx/mmCIF files ready for deposition can be generated using the program pdb_extract or selected structure refinement programs. Additional information about the PDBx/mmCIF format can be found in this FAQ.

PDBx/mmCIF format is especially useful:

  • When a PDBx/mmCIF file is the output of a final refinement. The developers of REFMAC, Phenix, and Buster are involved in the development of the PDBx/mmCIF format, and these programs will output PDBx/mmCIF format files that can be deposited without additional modification.
  • When the structure to be deposited is large. In this context, a large structure is defined as having more than 99,999 atoms and/or more than 62 polymer chains. These are the restrictions of the traditional PDB format. PDBx/mmCIF has no atom number restriction and virtually no chain number restriction. Please consult the "Using pdb_extract" sections of this guide for more information on depositing large structures.
  • When it is useful to avoid manual entry of additional information via the deposition interface. In addition to converting PDB to PDBx/mmCIF format, pdb_extract can be used to add sequence and other information to a coordinate file prior to deposition.

a) Refinement packages

Recent versions of refinement packages Phenix (version 1.8.2+) and REFMAC (version 5.8+) generate PDBx/mmCIF files ready for deposition:

  • Phenix: Instructions are available at the Phenix website
  • REFMAC: To output a PDBx/mmCIF file from REFMAC, add a card that reads "pdbout format mmcif". REFMAC can also read a file by specifying a PDBx/mmCIF file as an HKLIN argument.

b) pdb_extract

The pdb_extract program is available as an online interface and as a standalone command-line program. This powerful tool extracts and harvests data in PDBx/mmCIF format from structure determination programs.

Preparing PDB format data files for use with pdb_extract and the wwPDB Deposition Tool (strongly recommended for large structures)

The best format for depositing your structure is using PDBx/mmCIF, particularly for large structures. While the traditional PDB file format may be relatively easy manually manipulate, it cannot accommodate large structures that comprise more than 99,999 atoms and/or more than 62 chains.

The program pdb_extract can translate PDB format files with one or two-letter polymer chain IDs into a single PDBx/mmCIF file if they meet the following requirements.

a) Polymer chain ID assignment rules

Each polymer chain in a coordinate file represents a biopolymer in the experimental sample. A protein chain with a stretch of unmodeled amino acids in the middle of its sequence is still a single chain, and both modeled portions should have the same chain ID (chain A, for example). A protein molecule that has been physically cut in half by proteolysis, however, should be represented as two chains (chain A and chain B, for example).

Each polymer chain must have a unique chain ID. Permissible characters are uppercase letters A-Z, lowercase letters a-z, and numbers 0-9. PDB format allows only single-character chain IDs, while PDBx/mmCIF can accommodate chain IDs of up to four characters. pdb_extract, which converts PDB to PDBx/mmCIF format, can read pseudo-PDB files containing two-character chain IDs.

Ligands, ions, and solvent molecules can be deposited with any chain ID, but will have their chain IDs automatically re-assigned during processing to match the chain ID of the nearest polymer chain.

b) Two-letter chain ID format rules

To help convert multiple files into a single large structure files, pdb_extract accepts as input PDB files that bend the standard PDB format. An ATOM record in a properly-formatted PDB file has a consecutive, right-justified atom number constrained to fit within column 2, which has a width of five digits, and a single-character chain ID in column 5 (both shown in bold below):

ATOM  91563  OE1 GLU A 373       4.449  58.856  -2.941  1.00 85.83           O  

pdb_extract will accept input in which the atom number, while still constrained by the 5-digit limit of column 2, can be arbitrary and/or non-consecutive. In addition, the chain ID can be either a single character as above or two characters, as shown below:

ATOM  00A4B  OE1 GLUAA 373       4.449  58.856  -2.941  1.00 85.83           O  

pdb_extract will assign new atom numbers and separate residue labels and chain IDs from each other where they run together (as in the above example). This allows pdb_extract to accept input with unlimited atom counts and chain counts of up to 3844 (62*62), respectively.

While the PDBx/mmCIF format can accommodate chain IDs of up to four characters in length, it is advisable, due to current limitations of pdb_extract and some visualization tools, to limit chain IDs to one or two characters.

c) Other format requirements

Observe all column restrictions. The PDB format has very rigid column-width and justification rules (see example below and look here for detailed information). Neither the wwPDB deposition interface nor pdb_extract will correctly read a PDB format file that does not observe the format's column restrictions. There is one exception to this rule, detailed in the above section of this document "Two-letter chain ids rules".

123456789012345678901234567890123456789012345678901234567890123456789012345678
ATOM      1  N   MET A   0      67.840  45.068  47.509  1.00 70.12           N  

Insert TER cards only at the ends of polymer chains. In a PDB file, a line starting with or containing only "TER" signifies the termination of a polymer chain. A TER card is required at the end of a polymer chain. A TER card should not be placed in the middle of a polymer chain (regardless of the size of the gap in the sequence), nor should a TER card be preceded by any HETATM records (ligands, ions, solvent atoms) that share the same chain ID. For an example of correct TER card placement between polymer chains A and B:

ATOM   1563  OE1 GLU A 373       4.449  58.856  -2.941  1.00 85.83           O  
ATOM   1564  OE2 GLU A 373       4.119  57.934  -4.918  1.00 95.09           O  
ATOM   1565  OXT GLU A 373       8.013  55.105  -6.685  1.00 95.09           O  
TER    
HETATM 3133 ZN    ZN A 401      -2.320  35.058  -4.024  1.00 70.61          ZN  
ATOM   1567  N   ASN B 190     -28.191  85.252  -7.869  1.00 60.21           N  
ATOM   1568  CA  ASN B 190     -27.762  84.082  -7.010  1.00 68.43           C  
ATOM   1569  C   ASN B 190     -28.219  82.658  -7.477  1.00 72.07           C  

Programs (including the wwPDB deposition interface) that read PDB files rely on proper placement of TER cards to parse coordinates correctly. Improper placement of TER cards can result in numerous problems, including (but not limited to) inclusion of ligands within a protein sequence or omission of entire polymer chains during parsing.

Do not use MODEL (or ENDMDL) records. MODEL records and their accompanying ENDMDL records are designed for the representation of NMR ensembles (superimposed collections of structurally identical but conformationally diverse models) and should not be used in the representation of electron microscopy models (unless an NMR-style conformational ensemble is intended). Different polymer chains should have unique chains IDs and be terminated using TER cards, not bracketed between MODEL and ENDMDL records.

Remove header information from starting structures. If an existing PDB format file or files have been used as a starting point for fitting, remove any header information before starting, or any residual header information that might be left over after fitting. This includes everything from HEADER through SCALE3 inclusive, i.e., everything above the first ATOM record.

There can be only one END card at the end of the file.