On this blog you are able to follow my degree project(master thesis in Bioinformatics) which have the title Pharmaceutical knowledge retrieval through reasoning of ChEMBL RDF.
My supervisor is Egon Willighagen, http://chem-bla-ics.blogspot.com/.

Topics

onsdag 24 mars 2010

The things you can do with a wizard . . .

Now I have started to get a feeling for SPARQL but do you have one?
Well I do not want to force anyone to learn new languages all the time therefor I began to develop a wizard. This wizard is far from done but it do mange some functions at the moment which is really cool. As you write an id or keyword SPARQL queries against http://rdf.farmbio.uu.se/chembl/snorql/ is on the go returning the values to the wizard. If you change your search the old data will be deleted and the new one displayed.

A search may now be done with keywords, SMILES or chebi id to find information about compounds. This search will expand as I implement biological networking to other knowledge bases(http://chebi.bio2rdf.org/sparql as an example).
If the checkbox for target is check a search with proteins id's, keywords, ec-number etc will take place instead.
As you write the table will fill up with various data depending on what you search on.


The upper picture searches for targets that have some connection to sodium channels. The bottom picture search for a chebi id from a SMILES. Unfortunately I don't know yet how to distinguish between strings written in the box so the line have to end with a # at the moment. Working on solving that...
Lägg till bild

torsdag 18 mars 2010

Interesting SPARQL queries for QSAR and PCM data!

The following two SPARQL queries are really interesting for QSAR projects and proteochemometric(PCM) project. By accessing chEMBL data via RDF with SPARQL I can easily retrieve necessary data to build up these kind of projects.

For a QSAR project following query could be used:

var forQSAR = "\
PREFIX chembl:
\
PREFIX blueobelisk:
\
SELECT DISTINCT ?act ?ass ?conf ?mol ?SMILES ?val ?unit WHERE { \
?act chembl:type \"IC50\" ; \
chembl:onAssay ?ass; \
chembl:forMolecule ?mol;\
chembl:standardValue ?val;\
chembl:standardUnits ?unit.\
?mol blueobelisk:smiles ?SMILES. \
?ass chembl:hasTarget
; \
chembl:hasConfScore ?conf. \
}";


Since I run my queries through Bioclipse the SPARQL query is given a name to ease up the following run ¨
var qsar = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", forQSAR)
chembl.saveCsv("/QSAR/q",qsar)


The query will return unique id's for activity(?act), molecules(?mol) and assays(?ass), SMILES(?SMILES) for the molecules, values(?val) and units(?unit) for the activities and confidence values(?conf). And it is really easy expand the query to return more data!

The query for PCM returns unique id's for targets(?target), molecules(?mol) and pubmeds(?pubmed), SMILES(?SMILES), protein sequences(?seq), varoius classifications(?l4, ?l5, ?l6), activities(?type) ans activity values(?val).
The activities are narrowed down to only include IC50 and Ki and the ion channels should only be Na(the last two lines in the query).

The query looks like the following:
var kic50na ="\
PREFIX chembl: \
PREFIX blueobelisk: \
SELECT DISTINCT ?type ?target ?pubmed ?l4 ?l5 ?l6 ?mol ?SMILES ?val ?seq \
WHERE {\
?act chembl:type ?type;\
chembl:onAssay ?ass;\
chembl:forMolecule ?mol;\
chembl:standardValue ?val.\
?ass chembl:hasTarget ?target;\
chembl:extractedFrom ?journal.\
?ass chembl:hasTargetCount 1 .\
?journal ?pubmed.\
?mol blueobelisk:smiles ?SMILES.\
?target a ;\
chembl:classL3 \"VGC\" ;\
chembl:classL4 ?l4 ;\
chembl:classL5 ?l5 ;\
chembl:classL6 ?l6 ;\
chembl:sequence ?seq.\
FILTER regex(?l6, \"NA\")\
FILTER (?type = \"Ki\" || ?type = \"IC50\")\
}";


One problem that was encountered here was that the assays are not always specified for one target but for many which lead to the return of the same information for different targets. This was solved by Egon who created
?ass chembl:hasTargetCount 1 to solve this problem. That line says that the assays should only contain one target to accurate data for PCM.

måndag 8 mars 2010

Background presentation

I held this presentation for the department last week. It's basically a presentation about the background and progress of the project. Enjoy!

måndag 1 mars 2010

Update post

I have so many half-finished sub-project that I don't have anything interesting to blog about hence my update post!

My sub-projects:

Looking into other syntax languages, especially Manchester OWL syntax. In Journal Club we read the article Towards pharmacogenomics knowledge discovery with the semantic web and encountered the Manchester syntax language. I will blog about it when I'm done. And speaking of Journal Club I'm also writing a review together with Jonathan. And have to find time to read the next article....

Moss Manager needs to be rearranged since the net.bioclipse.rdf plug-in no longer returns lists of arraylist. It now amazingly returns String Matrices which will make things so much easier especially when I'm only interested in the SMILES part of the SPARQL outcome.

I'm also working on a presentation that I'm going to present on Thursday 4/3. I will try to put it up here afterwards. It's about the background and status of this project. (Spend hours on creating a gantt chart in excel..well I'm not friends with excel anymore..)

And last I'm trying to structure up a new bioclipse plug-in for drug/compound, target and other valuable info retrieval i.e. query ChEMBL in a effective and powerful way with SPARQL.

To learn from this post, use TODO lists =0)