PDBcle – An online tool for extracting chain structure and sequence of macromolecule and small molecule structure from the Protein Data Bank

Background: Protein Data Bank (PDB) is the most popular structure database that contains experimentally determined three-dimensional (3D) structures of biological macromolecules and small molecules. The rich features of PDB are keyword assisted advanced text search, structure search by sequence alignment, sequence motif search, ligand to target-ligand complex search through SMILES substructure search, JSON API query search, structure alignment, structure quality assessment, genome viewer, and 3D structure viewer. It is widely used in molecular modelling and computer-aided drug design. PDBcle is a simple tool to extract chain sequence of protein/nucleotide and 3D structure of protein/ nucleotide/ ligand from the PDB. Objectives: To construct an online tool for separating molecule-wise chain sequence and structure of polymers and non-polymer structures in a macromolecule. Moreover, the separated sequences and structures are produced to moleculespecific standard file format. Methods: The graphical web-interface of PDBcle tool has been designed using PHP, CSS, and pure JavaScript. Parsing the atomic coordinate records and sequence records from the PDBML/XML file and/or PDBx/mmJSON file through the API of PDB was done through PHP server script. Findings: The PDBcle tool retrieves and generates separate structure/sequence files for each amino acid/RNA chain, and pair of chains for DNA base pairs with/without ligand complex from the PDB. The ligand molecules are separated and sorted from the chains and produced to an SDF file. Applications: PDBcle tool is publicly accessible at https://www.biogem.org/tool/pdbcle/. 
Keywords: PDBcle; PDB; PDB Chain and Ligand Extractor; PDB Sequence Extractor; PDBML


Introduction
Theoretical and computational molecular modelling are the most promising approaches used in CADD to reduce the experimental costs and duration. The design and quality of the molecular model rely on the available 3D structure of biomolecules. In homology modelling, the templates are threaded with the target sequence and optimized to build https://www.indjst.org/ a 3D structure model. The templates are fragments of 3D experimental structures retrieved from the structure database. PDB is the worldwide repository for the 3D structure of biomolecules determined using X-ray crystallography, NMR spectroscopy, and 3D electron microscopy. The collaborative members of wwPDB are RCSB PDB, PDBe, PDBj, and BMRB (1) .
The PDB releases macromolecular structure data in three types of file formats namely PDB (.pdb), PDBx/mmCIF (.cif), or PDBML/XML (.xml) for various purpose. PDB file format is the standard file format released for displaying/analyzing macromolecules using 3D molecular visualization and modelling tools. Due to certain limitations in the PDB file format, PDBx/mmCIF and PDBML/XML file formats are introduced to extend the accessibility. Moreover, the wwPDB has stopped modifying or extending the data dictionary of the PDB file format. The PDBx/mmCIF file format is considered as standard PDB archive format and PDBML/XML file format for programming purposes (1)(2)(3) . PDB cle tool was designed to generate a 3D template structure (biomolecule chains) in PDB file format and retrieve compounds in SDF file format from PDB sever through parsing the PDBML/XML file.

Methods
The atomic coordinate records in the PDB file format starts with ATOM/HETATM identifier ( Figure 1 ). The standard residues such as proteins and nucleic acids start with ATOM record, and HETATM record for non-standard residues such as ions, solvent, cofactors, and inhibitors. The information of atomic coordinates in the PDB file is arranged in tabular format. Each row represents atom record and column represents data field. There are a total of 20 columns in a record with a fixed width for each field. The maximum length to display the atom serial number is limited to 99999, due to the width size of the field. A brief explanation of fields in the atom records is given in the tables below ( Table 1 ). The entry format and data description are given according to the current specification of PDB DDL (version 3.3). Some of the fields are either empty or ignored due to depreciation from the older version (4,5) . The PDB cle tool retrieves the PDBML/XML and/or PDBx/mmJSON file from the PDB server and generates a standard PDB file for macromolecules, FASTA file for sequence, and SDF file for small molecules. Separate PDB files are generated for DNA molecule by combining the atom records of two chains in pair or group according to the molecule type. The molecules in the PDB repository are categorized as a polypeptide, polydeoxyribonucleotide, cyclic-pseudo-peptide, polysaccharide, polydeoxyribonucleotide/ polyribonucleotide hybrid, polyribonucleotide, and small molecules (2)(3)(4) . The PDB cle tool splits the chains in the molecule under three major categories, (1) macromolecule, (2) macromolecule and small molecule complex, and (3) small molecule ( Table 2 ). Based on the category, the PDB file is generated for each chain by parsing the <PDBx:atom_site> elements and attributes/objects in the PDBML/XML file (Figure 2 ). The sequence files are generated to FASTA file for each chain by parsing the <PDBx:pdbx_seq_one_letter_code_can>, <PDBx:pdbx_gene_src_ncbi_taxonomy_id>, and <PDBx:pdbx_gene_src_scientific_name> elements and attributes/objects in the PDBML/XML file (5) . Whereas, small molecules are retrieved from the PDB server through the API interface, according to each HETATM records. https://www.indjst.org/

Results and Discussion
PDBML is a document markup language defined by PDB that consist of a set of DTDs framed according to SGML protocol. The current stable release of PDBML schema is PDBx v50 (https://pdbml.pdb.org/schema/pdbx-v50.xsd) for atomic coordinates data and wwPDB Validation v004 (https://www.wwpdb.org/validation/schema/wwpdb_validation_v004.xsd) for structure validation data. Each PDBML/XML files consist of different schema referenced by the PDB exchange data object (xmlns:PDBx, xmlns:xsi, and xsi:schemaLocation) for atomic coordinates (5) . The PDB cle tool acts as a client that submits the query (PDB ID) to the PDB server through RESTful service and retrieves the PDBML/XML archive data. Moreover, the retrieved data are parsed and converted to separate PDB files, FASTA files, and SDF files according to the molecule type and chains. The result hits of chains and ligands are properly annotated by mouse move-over tool-tip text containing a short title of the molecule and hyperlinks to the original resource.
A sample query search for the macromolecule (PDB ID: 6PL7) from the Protein Data Bank was done (Figure 3 ). The Figure 4 represents the result of separated 3D structure of human DNA polymerase eta (Pol η) complexed with DNA, 1FZ, magnesium, glycerol, and water extracted using PDB cle tool. It consists of protein chain A, nucleotide chain P and T, nucleotide base pair group (P, T), protein chain A complex with 1FZ, magnesium, glycerol, and water, nucleotide chain P complex with water, nucleotide chain T complex with water, 1FZ, magnesium, glycerol, and water. The retrieved sequence of protein chain A, nucleotide chain P, nucleotide chain T in FASTA file format is given bellow (6) .  The PDB cle tool is tested with the various complex of molecules from the PDB which include protein, DNA, RNA, hybrid, ions, solvent, co-factors, and inhibitors (Table 3 ). Moreover, the splits of chains and ligands are downloaded in standard file format.  (A, B)

Conclusion
The PDB cle tool finds structural and sequence insights of a macromolecule retrieved from the PDB. It allows downloading chain-wise molecule-specific 3D structure and sequence from the PDB. The PDB, SDF, and FASTA files are generated by PDB cle that mimic the standard biological database file format. In the future, service to download macromolecule in standard CIF file format will be available.