Guessing atom attributes based on atom names can lead to misinterpretations #452

rbdavid · 2024-03-15T16:48:43Z

Depending on the files being read into blender to instantiate a Molecule object, various atom attributes are likely to be missing that the MN Molecule object is expecting to have. For example, most if not all input file types accepted by MN will lack information about the vDW radii of atoms in the system. Since this and other attributes are used in MN visualization nodes, they are expected attributes that must be included upon creation of the Molecule object. But, when information about atoms is missing, then guesses are made. And, since there is a massive amount of diversity in atom naming and in formatting of the various input file types, the guesses can lead to incorrect atom attributes being assigned.

For example, a PDB file may or may not include element strings in the 76-78 column position. If the element information is not present, then that information is guessed at during the biotite.structure.io.pdb.get_structure() call:

MolecularNodes/molecularnodes/io/parse/pdb.py

Lines 20 to 28 in 0a9d54a

    
           def _get_structure(self): 
        
               from biotite.structure.io import pdb 
        
               from biotite.structure import BadStructureError 
        
               # TODO: implement entity ID, sec_struct for PDB files 
        
               array = pdb.get_structure( 
        
                   pdb_file=self.file, 
        
                   extra_fields=['b_factor', 'occupancy', 'charge', 'atom_id'], 
        
                   include_bonds=True 
        
               )

Within the biotite code, they've implemented a guesser (https://github.com/biotite-dev/biotite/blob/123e5332bd78fe7189bd5cb6e3147742cf5d73fe/src/biotite/structure/io/general.py#L248-L271) that maps atom names to element symbols. Without getting too bogged down with examples, this guesser may fail in numerous cases, resulting in the MN Molecule object having incorrect attribute values. AFAIK, atomic_number, vdw_radii, and mass attributes are tied to the element assignment given by biotite. If the wrong element is guessed, then those attributes will be assigned incorrect values.

When the MD import method is used, MDAnalysis is used to parse the input files and create the Molecule object instead of biotite. As above, if information is missing from the input files, then MDA has its own suite of guesser functions that are used to fill in missing information: https://docs.mdanalysis.org/2.7.0/documentation_pages/topology/guessers.html#module-MDAnalysis.topology.guessers. There are numerous issues (ex: MDAnalysis/mdanalysis#3704) and a recent pull request (MDAnalysis/mdanalysis#3753) for improving the guesser functions. The PR is expected to be included in the upcoming (soonTM) MDA 3.0 version.

All of this is to highlight that guesses are being made for attributes that propagate into visualizations. The use of guessers is a nuanced and complex cheminformatics problem made harder by a plethora of naming conventions, force fields, and coarse-grained models.

I definitely don't think its on MolecularNodes to solve this issue. But, potential improvements could be made in logging/reporting instances where guesses are made or suspected to have been made, so that users can check the results for themselves. Is there a message interface or log file that MN users can look at to see warning messages?

For example, biotite has implemented a warning

            warnings.warn(
                "{} elements were guessed from atom_name.".format(rep_num)
            )

that reports the number of instances where element guesses were made. I'm not sure where this print statement would be visible when loading a structure using MN. And since element guesses propagate to atomic_number, vdw_radii, and mass attributes in MN, there's the opportunity to warn users that guesses may have affected those attributes.

I think this is low priority but important to highlight, especially since incorrect guesses can propagate to incorrect visualizations.

The text was updated successfully, but these errors were encountered:

rbdavid · 2024-03-15T16:50:53Z

I guess I shouldn't have used the bug issue... and now I can't seem to edit labels to mark it as a feature request. oops

BradyAJohnston · 2024-03-16T05:32:22Z

@tubiana you may want to have some input on this - as you use a lot of martini files

rbdavid · 2024-03-16T20:05:01Z

Thanks for adding the label, Brady.

To add a few examples:

A calcium ion might have the atom names of CA, CAL, or CA2+, such as in a calmodulin (e.g. 1CKK). Without defined element information in the PDB file, the biotite guesser (here) would assign that atom to be a carbon atom. This happens for a number of other non-carbon atom names. Assuming a structure with cadmium present would label the cadmium atom as CD, this naming is ambiguous since CD is also used for the delta carbon in amino acid sidechains.

A similar assignment would happen from biotite for coarse-grained particles that begin with the letter C. Particles that start with 'N', 'O', 'S', or 'H' would be mapped to nitrogen, oxygen, sulfur, and hydrogen atoms, respectively; no consideration of other characters in the atom name are given. And no coarse-grained particle names are considered by biotite if the particle's name gets passed that first filter.

If a trajectory is loaded into Blender via MN, then MDAnalysis is used, which has its own method of guessing missing information from atom names (https://github.com/MDAnalysis/mdanalysis/blob/0582265996b392da382f658b7f0805ca250e1233/package/MDAnalysis/topology/guessers.py#L184-L230 and other functions therein). The guesser is a bit better because MDA has a dictionary that acknowledges a few instances of ambiguousness by mapping atom names to element symbols (https://github.com/MDAnalysis/mdanalysis/blob/0582265996b392da382f658b7f0805ca250e1233/package/MDAnalysis/topology/tables.py#L81-L173).

Would it be worth implementing a single guesser approach rather two separate approaches depending on how files are loaded into Blender? Neither biotite nor MDA guessers require using the objects associated with those modules. They take in strings and return strings. So, for example, when biotite is used to load PDB/CIF files, it will run its own guesser within the biotite.struct.pdb.get_structure() call (or equivalent for CIF files). After that, a validation run could be implemented to run the MDA guesser to check for incorrect assignments.

In regards to the logging of guessing instances, I've found the Console Window for Blender where warnings/logging could be printed. I'm not certain if this is a commonly used though. Maybe a verbosity level could be set in MN preferences that creates a log file to check when structures are loaded in and manipulated via MN nodes/functions? Just an idea.

rbdavid added the bug Something isn't working label Mar 15, 2024

BradyAJohnston added the enhancement New feature or request label Mar 16, 2024

This was referenced Apr 12, 2024

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float' #484

Open

fixing vdw_radii assignments for unexpected atom/particle names #485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guessing atom attributes based on atom names can lead to misinterpretations #452

Guessing atom attributes based on atom names can lead to misinterpretations #452

rbdavid commented Mar 15, 2024 •

edited

Loading

rbdavid commented Mar 15, 2024

BradyAJohnston commented Mar 16, 2024

rbdavid commented Mar 16, 2024

Guessing atom attributes based on atom names can lead to misinterpretations #452

Guessing atom attributes based on atom names can lead to misinterpretations #452

Comments

rbdavid commented Mar 15, 2024 • edited Loading

rbdavid commented Mar 15, 2024

BradyAJohnston commented Mar 16, 2024

rbdavid commented Mar 16, 2024

rbdavid commented Mar 15, 2024 •

edited

Loading