[OTDev] scripts to extract toxicity data from echa site

Vedrin Jeliazkov vedrin.jeliazkov at gmail.com
Tue Jun 7 15:10:46 CEST 2011


Hi Barry, All,

On 7 June 2011 15:29, Barry Hardy <barry.hardy at douglasconnect.com> wrote:
> Dear All:
> It might be worthwhile for the developer community to write scripts to
> extract public REACH dossier toxicity data from the ECHA website to make it
> available in a more suitable form for scientific purposes including model
> building, improving models etc.

While certainly feasible (at least to some extent), such scripts
wouldn't perform the foreseen task in a optimal way from technical
point of view. The scripts would basically involve the following
steps:

-- mirror the relevant HTML content from ECHA site;
-- run some heavy post-processing in order to extract valuable bits of
information in a structured and machine readable format;
-- populate some DB backend with the interesting data and expose it
through an OT dataset (or similar) service.

Having in mind that:

1) ECHA already has this data in a DB (iuclid5) and that this DB has a
well documented webservices interface,
2) AMBIT includes a module for data exchange with iuclid 5 through
this webservices interface,

IMHO a much more technically sound way for using this data would be
through such setup. This would require though talking to the right
people at ECHA and convincing them to publish the data through iuclid5
webservices in addition to the plain HTML they currently have.

> It should also be done in a way that is legal.

ECHA's disclaimer
(http://echa.europa.eu/disclaimer_en.asp#registration) includes the
following sentence:

"Use the information with care. Reproduction or further distribution
of the information is subject to copyright laws and might require the
permission of the owner of that information."

Obviously there are some open legal issues, apart from the technical
ones, that require further discussion with ECHA.

> What do you think?

I wouldn't bother doing anything before we have a clear statement from ECHA on:

-- whether they plan to publish dossiers data through a IUCLID5 web service;
-- whether this data could be used for model building, validation or
whatever other cheminformatics purposes one could be interested in.

Just my two cents,
Vedrin



More information about the Development mailing list