SMILES and SMARTS dialects

The toolkit supports the complete range of the Daylight SMILES , SMARTS , Reaction SMILES and SMIRKS standards, including Recursive SMARTS .

The global control variable ::cactvs(smiles_version) can be set to a Daylight release number. The setting of this variable influences various aspects of encoding and decoding of SMARTS data. The default value is 4.9 - the version best known for finally introducing the x ring bond count atom attribute. This is the most recent major Daylight SMILES / SMARTS definition update.

In SMARTS context a simple ’H’ atom attribute without a count is always interpreted by the toolkit as a hydrogen atom for explicit matching, not the hydrogen neighbour count. This behaviour is standard in Daylight tools since the 4.51 release.

Octahedral and bi-pyramidal stereochemistry in SMILES is read and written, but currently not checked by the substructure match routines. Allenes and square planar stereochemistry are fully supported.

Besides supporting the standard syntax and attributes of both atoms and bonds, a significant number of enhancements are also recognized:

Attribute ranges

In addition to a simple numerical count (as in ’[X2]’), bracketed open and closed ranges are supported, as in ’[X{1-}]’, [X{-3}]’ or ’[X{2-3}]’. This feature is available for every attribute which can take a count. It is also possible to use the Eli Lilly operator extensions for the same purpose, as in ‘[X>1]‘ or [X<=3]‘. The exception is the closed range, which cannot be expressed in Lilly syntax.

Match count prefixes

The SMARTS expression may be prefixed by a simple count, or an operator and a count. The SMARTS must then match the required number of times. The match mode is automatically adjusted if required. Example:

set ss [ens create {>4a-[F,Cl,Br,I]} smarts]

This matches compounds which contain 4 or more halogens substituting aromatic rings.

set ss [ens create {0[R]} smarts]

This matches compounds which do not contain rings.

Strict interpretation suffix

The default SMARTS interpretation in Cactvs is more lenient than the original Daylight definition. Specifically, the aliphatic attribute of upper-case element symbols is not enforced by default. Most match commands provide options to fine-tune the interpretation, and it is also possible to switch the toolkit globally into a strict SMARTS interpretation mode.

As a convenience, it is possible to request strict interpretation of a SMARTS string regardless of command options and global configuration by appending an exclamation mark to the string.

Example:

set ss [ens create C1CCCCC1!]

This SMARTS does not match benzene, which in default toolkit mode without the suffix is matched.

Additional atom attributes

Operator-chained matches

The toolkit supports to a limited degree the EliLilly extensions for chained matches. In these, multiple SMARTS fragments (which each may consist of multiple dot-disconnected parts) are linked via &&, || or ^^ two-character operators. Each fragment is handled independently, as a separate structure object, without regard to match overlaps as in Recursive SMARTS or explicit setting of the fragment overlap mode in substructure matching.

Example:

set ss [ens create {[nD3]-S(=O)(=O)&&0[aD3]-[G0;CH>0,O,N]} smarts]

The current implementation does not take operator precedence into account, as the original Lilly code does. It is possible to combine, for example, || and && parts in one query string, but the fragments are checked in strict left-to-right order, without precedence for the and part.

Only those parts of the expression are checked which are require to obtain the final match results. In case of an or expression, the match processing stops after the first fragment match has been found.

Eli Lilly extended SMARTS

As described above, the toolkit has near complete support for the published Eli Lilly SMARTS extensions, including match count prefixes, custom attributes, attribute count operators and chained matches.

Extended hydrogen handling

The H symbol may be used as an explicit hydrogen atom outside brackets, even though it is not in the official organic subset element set.