[OTDev] Datasets with Features for multi entity relationships ?
Nina Jeliazkova jeliazkova.nina at gmail.comMon Nov 29 09:35:41 CET 2010
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Christoph, Surajit, All, This discussion is very useful. As a result of myself trying to understand both points of view, now we have MCSS algorithm as ambit service (thanks to CDK SMSD package). https://ambit.uni-plovdiv.bg:8443/ambit2/algorithm/mcss It can be applied to a dataset and generates a model, where predicted features (MCSS in this case) are available via ot:predictedVariables (example https://ambit.uni-plovdiv.bg:8443/ambit2/model/26469/predicted ) The features use current API, without any change (although having ot:Substructure subclass of ot:Feature will make it more clear). All the MCSS substructures can be used by any learning algorithm , as they are standard ot:Features. Here are more details and proposal (start from *Substructure API proposal heading *) http://opentox.org/dev/apis/api-1.2/substructure-api-wishlist Best regards, Nina P.S. Please note the /mcss algorithm might be slow for large datasets, there are several improvements that we'll be applying performance wise, but this will not change the API . On 25 November 2010 18:13, Christoph Helma <helma at in-silico.ch> wrote: > Excerpts from surajit ray's message of Thu Nov 25 14:49:19 +0100 2010: > > > > This type of representation (we are using it internally) has served > well > > > for our datasets which might contain also several (10-100) thousand > > > substructures for a few thousands compounds. I also do not think, that > > > the representation is redundant: > > > - each compound is represented once > > > - each substructure is represented once > > > - each association between compound and substructure is > represented once > > > Please correct me, if I am missing something obvious. > > > > According to this representation each dataEntry for a compound will > > have to have all substructure features that were found in them. > > Therefore each dataEntry may have 1000-10000 feature/featureValue > > pairs . For 500 datasentries that means on an average of > > 500*5000(assuming 5000 substructures) = 2,500,000 feature/featureValue > > pairs - thats 2.5 million ! > > In our case it is a lot less (not completely sure about your feature > types), because only a very small subset of features occurs in a single > compound. > > > versus just having a featureset with a > > 5000 feature entries. You can imagine the difference in cost of > > bandwidth,computation etc. > > I am not sure, if I get you right, but where do you want to store the > relationships between features and compounds? If there are really 2.5 > million associations you have to assert them somewhere. And having features > without compounds seems to be quite useless for me. > > > > > > > Adding "false" occurences would not violate the current API (but would > > > add redundant information). Keep in mind that the dataset > representation > > > is mainly for exchanging datasets between services - internally you can > > > use any datastructure that is efficient for your purposes (we also do > > > that in our services). So if you need fingerprints internally, extract > > > them from the dataset. > > > > Internalizing an intermediate step completely serves the purpose but > > leads to less flexible design paradigms. If we internalize the > > workflow from substructure extraction to fingerprinting - we will lose > > the ability to provide the data to a third party server for an > > independent workflow. Of course the reasoning could be "who needs it > > ?" - well you never know !! > > I am very interested in exchanging "fingerprints" with other services, > but that can be done already with the current API. I see fingerprints as > sets of features that are present in a compound (also using set > operations to calculate similarities), and find it fairly > straightforward to parse/serialize them to/from datasets. > > > > > >> I still suggest having a FeatureSet/SubstructureSet type object within > > >> the API to make it convenient to club features without compound > > >> representations. > > > > > > I prefer to keep the API as generic as possible and not to introduce > > > ad-hoc objects (or optimizations) for special purposes - otherwise it > > > will be difficult to maintain services in the long term. Why don't you > > > use ontologies for grouping features? > > > > Grouping features using ontologies is clubbing the features Not the > > feature values > > But you cannot have feature values without relating features to > compounds. If you use the representation I proposed feature values are > "true" anyway. > > > So how do we know mcss3 occuring in compound X is with respect to > > which compound. As you said we can have arbitary fields in the feature > > definitions (for MCSS) - but that would be outside API definitions. > > features: > mcss3: > ot:componds: > - compound2 > - compound3 > ot:smarts: smarts3 > > In my understanding you can add any annotation you want to a feature. > > Yes, you can, but if this is not an agreed annotation, no other service will understand it. Best regards, Nina > Best regards, > Christoph > _______________________________________________ > Development mailing list > Development at opentox.org > http://www.opentox.org/mailman/listinfo/development >
- Previous message: [OTDev] Datasets with Features for multi entity relationships ?
- Next message: [OTDev] Datasets with Features for multi entity relationships ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Development mailing list