[OTDev] Are there some sample dataset services available ?

Nina Jeliazkova nina at acad.bg
Tue Feb 16 07:57:20 CET 2010


Jörg,

Jörg Kurt Wegner wrote:
> Nina, Surajit,
>
>   
>> http://ambit.uni-plovdiv.bg:8080/ambit2/dataset
>> The formats  (RDF, MOL, SMILES, CSV, arff, CML) can be retrieved via
>> specifying the corresponding mime type.
>>     
> Nice, I admit I am not reading all the posts on this list and you might have answered this already earlier.
>   

> Anyway, I gotta ask:
>
> 1. Some of the data sets are simply empty, at least the first few in the list. Why?
>   
http://ambit.uni-plovdiv.bg:8080/ambit2/dataset is a testing site and several partners have been trying testing various things, including early and  unstable implementations, just uploading data, etc.  

A new site is available at http://apps.ideaconsult.net:8180/ambit2, with the idea to be the stable one.

> 2. Cross-indexing could be clearly enriched by enabling InChIKeys http://www.iupac.org/inchi/release102final.html
> and then using one of the services around for puling more indices and data, e.g.
> http://inchis.chemspider.com/
> http://cactus.nci.nih.gov/chemical/structure
>   
There are already data retrieved from those services in ambit, with the
added value of being compared to other sources ( we just circulated
report on it between partners).  I do agree we need to extend RDF
representation to include cross-linking based on InChI (and other
identifiers) as well.
> 3. In other words just in-case some structures might need curation I would rather prefer seeing the correct ones pulled from ChemSpider and you just host identifiers and tox endpoints ;-)
>   

We are doing curation, based on retrieval of structures from 4-5
different sources and assigning quality labels.  From our experience
different sources have error rate about 1-10% and we would like not to
stick to a single source, whether it is ChemSpider or PubChem. 

One problem currently is there is no slot in the current API/RDF to show
these labels.

Just identifiers and endpoints will be less flexible and not entirely
aligned with OpenTox goals, for example preventing to host calculated
descriptors on particular 3D structure, as well as allowing (in future)
users to upload their own data with private or public access.

> 4. Finally, are there json data fetching options, too? I guess this is easier for (me) linking multiple sources in a browser, scripting, or wrapper. approach. Again, a universal chemistry ID like InChIKey or ChemSpiderID is much appreciated.
>   
No JSON yet, in the next API version we'll try to figure out how to
serialize RDF to JSON. 

InChiKeys are available for most of the compounds, but not used as
unique identifier.  Just to note, InChiKey is a hashed identifier and
theoretically not unique , thus it was decided not to use it as a
compound identifier withing OpenTox.  Links to ChemSpider, PubChemID ,
ChemIdPlus, IUCLID5 and other possible sources will be exposed in future
releases.

Thanks for the feedback!

Nina
> Cheers, Joerg
>
> http://miningdrugs.blogspot.com/
> http://www.google.com/profiles/joergkurtwegner
>   




More information about the Development mailing list