[OTDev] Luca Settimo

Vedrin Jeliazkov vedrin.jeliazkov at gmail.com
Thu Aug 4 14:28:56 CEST 2011


Hi Luca,

> could you give me some more info on the databases that you collected for AMBIT?

The database dump that is available at
http://ambit.uni-plovdiv.bg/downloads/ambit2/db/ambit2-2011051401.7z
contains the following datasets:

ECHA list of pre-registered substances (143835 entries)
ChemIDplus (structures for 80468 chemicals from the ECHA list of
pre-registered substances)
Chemical Identifier Resolver (structures for 72985 chemicals from the
ECHA list of pre-registered substances)
ChemDraw (structures for 22519 chemicals from the ECHA list of
pre-registered substances)
CPDBAS (1547 entries)
DBPCAN (209 entries)
EPAFHM (617 entries)
FDAMDD (1216 entries)
HPVCSI (3548 entries)
HPVISD (1006 entries)
IRISTR (544 entries)
KIERBL (278 entries)
NCTRER (232 entries)
NTPBSI (2330 entries)
NTPHTS (1408 entries)
ISSCAN (1150 entries)
ISSMIC (151 entries)
ISSSTY (232 entries)
TOXCST (320 entries)
TXCST2 (960 entries)
ECETOC Technical Report No. 66 Skin irritation and corrosion Reference
Chemicals data base (1995) (176 entries)
Local Lymph Node Data for the Evaluation of Skin Sensitization -
Compilation of historical data (Dermatitis Vol 16 No 4 2005) (209
entries)
Local Lymph Node Data for the Evaluation of Skin Sensitization -
Second compilation (Dermatitis Vol 21 No 1 2010) (108 entries)
Bioconcentration factor (BCF) Gold Standard Database (1130 entries)
Benchmark Data Set for pKa Prediction of Monoprotic Small Molecules
the SMARTS Way (185 entries)
Benchmark Data Set for In Silico Prediction of Ames Mutagenicity (6512 entries)
Bursi AMES Toxicity Dataset (4337 entries)
EPI_AOP (818 entries)
EPI_BCF (685 entries)
EPI_BioHC (175 entries)
EPI_Biowin (1263 entries)
EPI_Boil_Pt (5890 entries)
EPI_Henry (1829 entries)
EPI_KM (631 entries)
EPI_KOA (308 entries)
EPI_Kowwin (15809 entries)
EPI_Melt_Pt (10051 entries)
EPI_PCKOC (788 entries)
EPI_VP (3037 entries)
EPI_WaterFrag (5764 entries)
EPI_Wskowwin (2348 entries)
TOXCST_ACEA (320 entries)
TOXCST_Attagene (320 entries)
TOXCST_BioSeek (320 entries)
TOXCST_Cellumen (320 entries)
TOXCST_CellzDirect (320 entries)
TOXCST_Gentronix (320 entries)
TOXCST_NCGC (320 entries)
TOXCST_Novascreen (320 entries)
TOXCST_Solidus (320 entries)
TOXCST_ToxRefDB (320 entries)
ECBPRS (structures and data for 80410 chemicals from the ECHA list of
pre-registered substances)
OPSIN (structures for 78458 chemicals from the ECHA list of
pre-registered substances)

You can also access all of the above mentioned datasets at
https://ambit.uni-plovdiv.bg:8443/ambit2/dataset after you login with
your OpenTox username and password at
https://ambit.uni-plovdiv.bg:8443/ambit2/opentoxuser (You can register
as an OpenTox user at http://www.opentox.org/join_form if you haven't
already).

In addition to these datasets, you could access at the same location
the PubChem Structures + Assays dataset (473965 entries), which is not
included in the MySQL dump that is available for download in order to
keep it more compact.

Please note that some additional datasets (not listed above, but
available in the DB) are accessible only by OpenTox partners, due to
specific licensing requirements and agreements.

> Are you aware of this paper?

[http://dx.doi.org/10.1016/j.taap.2009.08.022]

> Perhaps you will find very useful Table 1 because it shows all databases for tox that are available in the literature. Which of these
> do you have?

As you can see from the list above, there's some degree of overlap
between the references in Table 1 of this paper and the datasets
included in the OpenTox DB, but both have entries that are absent in
the other list. One major obstacle for including some of the sources
that you mention is the lack of computer-readable bulk download for
them. In addition, the AMBIT database is evolving continuously (even
as I write these lines) and it can be somehow hard to tell what's
included and what's not -- all registered users with sufficient
privileges can add datasets at any time. In general, the OpenTox
framework (and AMBIT as one particular implementation of the OpenTox
API) provides the infrastructure to store and process relevant data in
a more or less similar way as the Apache HTTP server acts for making
available web site content. It's up to the users to upload whatever
datasets, algorithms, models, etc..., they like to use or make
available to others. So, in essence, the OpenTox DB is a kind of
starting reference point, with particular emphasis on datasets that
are relevant to the European REACH legislation, mainly due to the
specific context of the OpenTox project. However, the OpenTox
framework was designed in a generic way, to enable its use in other
domains as well. It's up to the users to install, populate, run,
maintain their own instances of OpenTox services. Furthermore, due to
the common API, these services could be linked together and rely on
each other for executing specific tasks (e.g. an algorithm provided by
service A can be used to build a model by service B, using training
dataset available at service C; the model at service B could be
validated by service D and used to predict properties for a dataset
hosted at service E, etc). You can have all of these running on a
single box, or on a private cluster, or as (distributed) services that
you offer to the public to use.

> So Barry told me that you have a linux version of tox-create/tox-predict? Is that true?

See my previous and Micha's mail for a detailed answers to these
questions. The apps are platform independent and can run on any OS.
ToxPredict and its dependencies are Java-based, ToxCreate and its
dependencies are Ruby-based.

As a somehow easier first step you might want to try the OpenTox
virtual appliance, which has all of these apps pre-installed for you
on a recent version of Linux:

http://ambit.uni-plovdiv.bg/downloads/opentox/Opentox%20Virtual%20Appliance%20DC.ova

Please note that this is a large file (2730474496 bytes). Its md5
checksum which you could check to ensure that no errors have occurred
while downloading it is: 1530bb83e88c3c646bcbac3183745bab

You could import and run the appliance in VirtualBox
(http://www.virtualbox.org/).

Let us know if we can be of further assistance.

Kind regards,
Vedrin



More information about the Development mailing list