On this blog you are able to follow my degree project(master thesis in Bioinformatics) which have the title Pharmaceutical knowledge retrieval through reasoning of ChEMBL RDF.
My supervisor is Egon Willighagen, http://chem-bla-ics.blogspot.com/.

Topics

torsdag 18 mars 2010

Interesting SPARQL queries for QSAR and PCM data!

The following two SPARQL queries are really interesting for QSAR projects and proteochemometric(PCM) project. By accessing chEMBL data via RDF with SPARQL I can easily retrieve necessary data to build up these kind of projects.

For a QSAR project following query could be used:

var forQSAR = "\
PREFIX chembl:
\
PREFIX blueobelisk:
\
SELECT DISTINCT ?act ?ass ?conf ?mol ?SMILES ?val ?unit WHERE { \
?act chembl:type \"IC50\" ; \
chembl:onAssay ?ass; \
chembl:forMolecule ?mol;\
chembl:standardValue ?val;\
chembl:standardUnits ?unit.\
?mol blueobelisk:smiles ?SMILES. \
?ass chembl:hasTarget
; \
chembl:hasConfScore ?conf. \
}";


Since I run my queries through Bioclipse the SPARQL query is given a name to ease up the following run ¨
var qsar = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", forQSAR)
chembl.saveCsv("/QSAR/q",qsar)


The query will return unique id's for activity(?act), molecules(?mol) and assays(?ass), SMILES(?SMILES) for the molecules, values(?val) and units(?unit) for the activities and confidence values(?conf). And it is really easy expand the query to return more data!

The query for PCM returns unique id's for targets(?target), molecules(?mol) and pubmeds(?pubmed), SMILES(?SMILES), protein sequences(?seq), varoius classifications(?l4, ?l5, ?l6), activities(?type) ans activity values(?val).
The activities are narrowed down to only include IC50 and Ki and the ion channels should only be Na(the last two lines in the query).

The query looks like the following:
var kic50na ="\
PREFIX chembl: \
PREFIX blueobelisk: \
SELECT DISTINCT ?type ?target ?pubmed ?l4 ?l5 ?l6 ?mol ?SMILES ?val ?seq \
WHERE {\
?act chembl:type ?type;\
chembl:onAssay ?ass;\
chembl:forMolecule ?mol;\
chembl:standardValue ?val.\
?ass chembl:hasTarget ?target;\
chembl:extractedFrom ?journal.\
?ass chembl:hasTargetCount 1 .\
?journal ?pubmed.\
?mol blueobelisk:smiles ?SMILES.\
?target a ;\
chembl:classL3 \"VGC\" ;\
chembl:classL4 ?l4 ;\
chembl:classL5 ?l5 ;\
chembl:classL6 ?l6 ;\
chembl:sequence ?seq.\
FILTER regex(?l6, \"NA\")\
FILTER (?type = \"Ki\" || ?type = \"IC50\")\
}";


One problem that was encountered here was that the assays are not always specified for one target but for many which lead to the return of the same information for different targets. This was solved by Egon who created
?ass chembl:hasTargetCount 1 to solve this problem. That line says that the assays should only contain one target to accurate data for PCM.

Inga kommentarer:

Skicka en kommentar