Springer Publishing

Wednesday 24 June 2015

SMILES notation: The Functional SMILES Perspective

SMILES Perspectives

SMILES notation is so much fun to play with! Another reason why SMILES is an appropriate acronym. Because SMILES is a graph/connectivity language in string format, there are many ways to enumerate bond paths and subgraphs in molecules.


SMILES generally finds the longest chain of atoms in a molecule and proceeds to connect the loose ends to form rings. Yet, shorter paths can be found and bonds can be connected in a great many ways while still maintaining valid SMILES notation. Therefore, there are many "perspectives" one can take for generating valid SMILES strings.

For instance, the molecule N,N-diethylethylenediamine can be easily represented by the following SMILES (beginning at the primary amine N):

NCCN(CC)CC

There are, however, many other valid SMILES strings to represent this molecule:
CCN(CC)CCN
C(N)CN(CC)CC
N1.C12.C23.N345.C56.C6.C47.C7

and the list goes on. I call each of these valid SMILES strings "SMILES perspectives".

I have been finding that molecules can be effectively represented with valid SMILES strings which are disconnected and reconnected versions of the functional groups in a molecule and this "SMILES perspective" encodes different (and possibly more) information than general SMILES. I call this the "functional SMILES perspective". The functional SMILES perspective can mirror IUPAC nomenclature but can also mirror the functional group perspectives of the individual chemist.

Let's look at the molecule propyl 5-chloro-3-fluoropentanoate (whatever that is...). The molecule is likely represented with general SMILES as:

CCCOC(=O)CC(F)CCCl

which looks like this:1,2

However, you can also represent the structure in a more verbose manner encoding each functional group from the name.

Propyl CCC
Pentanoate CCCCC(=O)O
Fluoro F
Chloro Cl

List all these separated by a period (order does not matter): CCC.CCCCC(=O)O.F.Cl
Finally, connect the fragments appropriately using numbers: CCC1.C3CC2CC(=O)O1.F2.Cl3

This produces a SMILES string whose molecule looks identical to the first structure:

Another example is biphenyl.
Biphenyl general SMILES
c1ccc(cc1)c2ccccc2

The perspective of this general SMILES string is the general perspective: find the longest continuous chain and link it to form the two rings at the appropriate locations.

Biphenyl Functional SMILES (example)
c1c3cccc1.c2c3cccc2

The perspective of this functional SMILES string is two phenyl groups c1ccccc1 and c2ccccc2 separated by a "." and connected at one carbon (denoted by the number 3).

The example of chemical reactions is also important. SMILES has a notation for chemical reactions which utilizes the ">" symbol in its notation. One can also encode reaction information using the "." symbol and numbers in the functional SMILES perspective. The following example of esterification reaction between alcohol and carboxylic acid (mediated by ethanol and HCl) is taken from the Daylight webpage.3,4

CC(=O)O.OCC>[H+].[Cl-].OCC>CC(=O)OCC

A functional SMILES perspective might want to write the reaction product not as CC(=O)OCC but as CC1(=O).O.O1CC, inserting a "." between the carboxylic C and carboxylic O to denote the breakage of that bond, as well as inserting a "1" after the ethanol O and the carboxylic C to denote the formation of a new bond between those two atoms. The resulting reaction SMILES is:

CC(=O)O.OCC>[H+].[Cl-].OCC>CC1(=O).O.O1CC

the product of which additionally includes the H2O byproduct (".O.").

The benefits of the functional SMILES perspective is that SMILES strings become more easily readable. Functional groups can be captured, as well as bond-breaking/bond-making reaction histories. The insertion of "." symbol and numbers is easily programmable for generating high-volume data. Putting SMILES strings this way may place a heavy burden on substructure searching, but, if desired, one could use OpenBabel to canonicalize these SMILES strings which may lighten the substructure search load.

References:

  1. "GIF/PNG-Creator for 2D Plots of Chemical Structures". http://cactus.nci.nih.gov/gifcreator/, NCI/CADD Group.
  2. J. Chem. Inf. Comput. Sci. (1983) 23, 61-65. http://dx.doi.org/10.1021/ci00038a002
  3. "Reaction SMILES and SMIRKS" http://www.daylight.com/meetings/summerschool01/course/basics/smirks.html, Daylight Chemical Information Systems Inc.
  4. J.Chem.Inf. Comput. Sci. (1999) 39(6), 1161-1172. http://dx.doi.org/10.1021/ci9904259

No comments:

Post a Comment