The pubchem Command Extension

The pubchem command enables the toolkit to talk directly to the NCBI PubChem database server. This means there is a direct database client connection, not an access to any of the public APIs . This command therefore only works in environments which have network access to the NCBI server farm, or an in-house clone.

This command is implemented as a Cactvs command extension, not a standard Tcl module, because it links into Cactvs -specific internal data structures and cannot be loaded into a standard Tcl interpreter.

The command extension can be explicitly loaded via a

cmdx load pubchem

The command is also auto-loaded in standard interpreters, if the command extension module can be found in the search path. It is a built-in command in some versions of the csweb CGI interpreter.

The command is currently not part of the standard toolkit distribution.

There is currently no Python interface for this extension.

These are the subcommands of the extension:

pubchem cigs

pubchem cigs cidvalue ?type?

Get structure identity group information of a CID. If no type parameter is given, or it is given as all , the full set of CIGs is returned as a list in the order tautomer , connectivity , stereo , isotope and exact .

If a type parameter is specified, as one of the allowed values all, tautomer, connectivity, isotope and exact o r its abbreviation a, t, c, i or e , only the selected group identifier is reported.

If the CID is not found in the database, an error is reported.

Example:

echo [pubchem cigs $cid t]

pubchem dump

pubchem dump ?updateonly? ?settimestamp? liveVar deadVar

This command is used to get information about changes in the main PCCompound database since the last query. By default, both the updateonly and settimestamp flags are unset. The command sets the two variables named in the arguments to a list of CID identifiers. If the updateonly flag is set, only CIDs which have changed (i.e. were added, modified, or deleted) since the last time stamp setting are reported. Records which are still in the database are stored in the live variable, deleted records are returned in the dead variable. If the updateonly flag is not set, all database identifiers are returned.

The settimestamp flag controls whether this query should automatically set the processing time stamp of the returned records to the current time, thus marking the database records as synchronized. Records with an updated time stamp are excluded from the result list of further pubchem dump commands with the updateonly flag set.

This command can take a couple of seconds to execute, and the result lists can be large (up to a million or more list elements).

pubchem fetchblob

pubchem fetchblob sid sidvalue
pubchem fetchblob cid cidvalue

This command retrieves a binary ASN.1 structure record blob from the database. The command comes in two variants. Retrieval via the SID returned the complete structure record, with all embedded structure forms and their CIDs, while access via a CID only yields that structure and its property data.

The raw blob data is returned as result. If the queried SID or CID does not exist in the database, an error is raised.

For historical reasons, the command can also be used without the sid or cid access type identifier. This form is equivalent to access via an SID.

Example:

set blob [pubchem fetchblob cid 999]
filex load asnb
set eh [molfile read [molfile open $blob s]]

This command sequence creates an ensemble object from the ASN.1 blob. Note that in most cases the pubchem fetchens command is more convenient to use for this purpose. Structure processing options should be completely disabled in order to avoid any change when reading compound data from raw blobs, as in

molfile set $fh readflags {}

pubchem fetchens

pubchem fetchens sid sidvalue
pubchem fetchens cid cidvalue

This command retrieves an ensemble from the PubChem database via an SID or CID identifier. The return value of the command is a new ensemble handle. In case an SID or CID is not found in the database, an error is raised.

If the retrieval is made via a CID, only that CID and its associated data is returned. For access via SID, the full content of the ASN.1 record is encoded in the returned ensemble. The connectivity of the structure for which its handle is returned is that of the deposited structure. Standardized compounds and other structure variants of the deposited structure are attached to this base structure as one or more properties E_NCBI_COMPOUND of datatype ensemble. These secondary property-encoded ensembles store the data registered for them in the database in their own independent set of properties. In theory, this could include further structure derivatives that are again stored as properties E_NCBI_COMPOUND .

For historical reasons, the command can also be used without the sid or cid access type identifier. This form is equivalent to access via an SID.

All structure processing options are disabled when decoding the blob into the ensemble, so the returned structure is a faithful representation of the original data, including all bond types, bond annotations and charges.

Example:

set eh [pubchem fetchens sid 999]
set ehstd [ens show $eh E_NCBI_COMPOUND]
set stdcid [ens show $ehstd E_NCBI_COMPOUND_ID(id)]

This example retrieves a full PubChem record via an SID, isolates the first structure variant encoded in the record (which is the default standardized form), and then reads out from that standardized form its CID.

pubchem setdbhosts

pubchem setdbhosts hostlist

This command changes the default set of database cluster hosts from the compiled-in default. The hostlist parameter is a list of one or more host names in standard Tcl notation.

Example:

pubchem setdbhosts [list DDDSQL10 DDDSQL11]

pubchem sidlist

pubchem sidlist cidlist

This command returns a nested list of SIDs associated with CIDs. For each CID in the cidlist parameter a list element is returned which contains the list of associated SIDs.

Example:

set cidlist [list 1 2 3]
set sidlist [pubchem sidlist $cidlist]
foreach cid $cidlist sidset $sidlist {
	puts “CID $cid is associated with [llength $sidset] SIDs”
}

pubchem sids

pubchem sids cid

This command returns a list of all SIDs a CID is associated with. An error is raised if the CID is not found int the database.

pubchem synonyms

pubchem synonyms sid sidvalue
pubchem synonyms cid cidvalue

Get the list of synonyms associated with a SID or CID. The command returns a string list. In case there are no synonyms, or the identifier is not found in the database, an error is returned.

The synonyms list contains only names registered in the global synonyms data block of the ASN.1 specification. It does not report any additional names which may be stored in property data areas of individual compounds of the record.

pubchem subcommands

pubchem subcommands

Return a list of the subcommands of the pubchem command.