On this blog you are able to follow my degree project(master thesis in Bioinformatics) which have the title Pharmaceutical knowledge retrieval through reasoning of ChEMBL RDF.
My supervisor is Egon Willighagen, http://chem-bla-ics.blogspot.com/.


måndag 14 juni 2010

A small but wonderful add-on

Look at the following scenarios:


> var camk = chembl.MossGetProtFamilyCompAct("camk", "IC50")

> chembl.MoSSViewHistogram(camk)

> var camkBounds = chembl.MossSetActivityBound(camk, 1,1000000)

> camkBounds.getRowCount()


>chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC501", camk)


> var camk=chembl.MossGetProtFamilyCompActBounds("CAMK","IC50",1, 1000000)

> camk.getRowCount()


> chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC502", camk)

(a)+(b) Scripts taken from the context of retrieving molecules for molecular substructure mining. (a) Collects compounds that bind to proteins from the family CAMK with the activity IC50. The activities for the compounds are looked at in a histogram and the bound is later set to involve molecules within activities between 1-1000,000. Lastly saved out to a file that supports MoSS input file.

(b)Lets say you been working with this set a couple of times and know exactly your parameters then the script in (b) would reduce unnecessary steps in retrieving molecules by simply adding the upper and lower value to the query directly. At last saving into an input file of MoSS.

Small step but wonderful when you run scripts all day!

måndag 7 juni 2010

The ChEMLB-MoSS interaction in Bioclipse

There are two ways of accessing the chEMBL- MoSS feature in Bioclipse, javascript and by wizard. I will present both ways here!

In both situation I work with an example of accessing molecules for the Kinase protein family Tyrosin Kinase also known as TK. I want to look at the compounds that bind to any protein in this family with the activity Ki. Also, to specify in what activity span my molecules should be in.

Starting of with the wizard, this is what it looks like when it is first open.

Only one box is accessible and that is the one for protein families. When a family is selected a SPARQL query run towards the endpoint and returns the available activities for that family. By simply selecting a preferred activity an other SPARQL query will update the table with compounds (with a limitation of 50, the button add all(which is done in the picture) will of course add them all=).
Now I would like to only collect the active compounds hence I first look at the graph displaying the activities.
When I know in what activity span I would like to work with I update the table with help from the lower and upper boxes and simply press update table. When I now press finish a file that supports MoSS will be produced.

Performing almost the same task now provides the following javascript.

> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki",50)
> tkki.getRowCount()
Here I collect 50 compounds from the TK family with the activity of KI.

> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki")
> tkki.getRowCount()
Here I perform the same thing as above without a limit leaving to returning 976 compounds, the same number that was returned when "add all" was pushed in the wizard.

> var tkkiActBound = chembl.MossSetActivityBound(tkki, 1,15000)
> tkkiActBound.getRowCount()
> tkkiActBound

With the specification of an activity span between 1 and 15000 nm the number of compounds are reduced to 850(as in the wizard). If I write the name of the variable a string matrix will display all the information. But in order to work with MoSS it has to be saved in a certain way. That's why we save the matrix to a file just as we did when we pressed finish in the wizard.

> chembl.saveMossFormat("/chembl/Script/tkki",tkkiActBound)

Taken from the produced file(s)(they are exactly the same).



With this shown I will
soon let you know what MoSS can do with the saved data!

måndag 17 maj 2010

A moss-chembl application

After a month of traveling I'm now back to devote my time to what's left of my project which would be about 8-9 weeks. My work is progressing and much of my time I'm working with human-computer-interaction but also advancing the SPARQL queries and test for accuracy.

MoSS as I probably mentioned a couple of times before is a molecular substructure mining software produced by Christian Borgelt, http://www.borgelt.net/moss.html. I implemented that application for Bioclipse in 2008, http://wiki.bioclipse.net/index.php?title=MoSS_in_Bioclipse, and I'm now making use of my own application.

As my chEMBL work is coming along I'm at the moment working on a specific working flow, "from chEMBL to MoSS". With the functionality of SPARQL I am now via java methods accessing compounds from various Kinase protein familes. A method could look like something like this

public IStringMatrix MossProtFamilyCompounds(String fam, String actType)
throws BioclipseException{

String sparql =
"PREFIX chembl: " +
"PREFIX bo: "+

"SELECT DISTINCT ?smiles where{ " + " ?target a chembl:Target;" +
" chembl:classL5 ?fam. " +
" ?assay chembl:hasTarget ?target . " +
" ?activity chembl:onAssay ?assay ;" +
" chembl:type ?actType ; " +
" chembl:forMo
lecule ?mol ."+
" ?mol bo:smiles ?smiles. " +
" FILTER regex(?fam, " + "\"^" + fam + "$\"" + ", \"i\")."+
" FILTER regex(?
actType, " + "\"^" + actType + "$\"" + ", \"i\")."+
" }";
IStringMatrix matrix = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql",sparql);

return matrix;

Inside this java method there is a SPARQL query which is a string named sparql. It is possible to run a query like this due to the rdf project done by Egon. I use that feature when I call rdf.sparqlRemote, what that command basically do is accessing the SPARQL endpoint(URL) with my query which is made into a String. So for this to work an internet connection must exist.
I will try to find something that can check if such a connection exist or not to improve the use of the application(no connection -> no search).

The compounds are saved into a file supported by MoSS. This makes it possible for MoSS to run on the compounds drawn from the chEMBL database. Also a java script environment is available.

The pictures shows(top) the moss-chembl wizard and (bottom) the moss wizard.

The moss-chembl applications is dynamic which means that you can search for wanted compounds and look at them directly. This ease the work a lot! Also to be mentioned is that the compounds are at the moment only compounds that bind to a protein in a Kinase Family.

When a preferred data set is chosen moss will read in the data and now you are able to perform a substructure mining on them!

Next problem to manage Visualization...

onsdag 24 mars 2010

The things you can do with a wizard . . .

Now I have started to get a feeling for SPARQL but do you have one?
Well I do not want to force anyone to learn new languages all the time therefor I began to develop a wizard. This wizard is far from done but it do mange some functions at the moment which is really cool. As you write an id or keyword SPARQL queries against http://rdf.farmbio.uu.se/chembl/snorql/ is on the go returning the values to the wizard. If you change your search the old data will be deleted and the new one displayed.

A search may now be done with keywords, SMILES or chebi id to find information about compounds. This search will expand as I implement biological networking to other knowledge bases(http://chebi.bio2rdf.org/sparql as an example).
If the checkbox for target is check a search with proteins id's, keywords, ec-number etc will take place instead.
As you write the table will fill up with various data depending on what you search on.

The upper picture searches for targets that have some connection to sodium channels. The bottom picture search for a chebi id from a SMILES. Unfortunately I don't know yet how to distinguish between strings written in the box so the line have to end with a # at the moment. Working on solving that...
Lägg till bild

torsdag 18 mars 2010

Interesting SPARQL queries for QSAR and PCM data!

The following two SPARQL queries are really interesting for QSAR projects and proteochemometric(PCM) project. By accessing chEMBL data via RDF with SPARQL I can easily retrieve necessary data to build up these kind of projects.

For a QSAR project following query could be used:

var forQSAR = "\
PREFIX chembl:
PREFIX blueobelisk:
SELECT DISTINCT ?act ?ass ?conf ?mol ?SMILES ?val ?unit WHERE { \
?act chembl:type \"IC50\" ; \
chembl:onAssay ?ass; \
chembl:forMolecule ?mol;\
chembl:standardValue ?val;\
chembl:standardUnits ?unit.\
?mol blueobelisk:smiles ?SMILES. \
?ass chembl:hasTarget
; \
chembl:hasConfScore ?conf. \

Since I run my queries through Bioclipse the SPARQL query is given a name to ease up the following run ¨
var qsar = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", forQSAR)

The query will return unique id's for activity(?act), molecules(?mol) and assays(?ass), SMILES(?SMILES) for the molecules, values(?val) and units(?unit) for the activities and confidence values(?conf). And it is really easy expand the query to return more data!

The query for PCM returns unique id's for targets(?target), molecules(?mol) and pubmeds(?pubmed), SMILES(?SMILES), protein sequences(?seq), varoius classifications(?l4, ?l5, ?l6), activities(?type) ans activity values(?val).
The activities are narrowed down to only include IC50 and Ki and the ion channels should only be Na(the last two lines in the query).

The query looks like the following:
var kic50na ="\
PREFIX chembl: \
PREFIX blueobelisk: \
SELECT DISTINCT ?type ?target ?pubmed ?l4 ?l5 ?l6 ?mol ?SMILES ?val ?seq \
?act chembl:type ?type;\
chembl:onAssay ?ass;\
chembl:forMolecule ?mol;\
chembl:standardValue ?val.\
?ass chembl:hasTarget ?target;\
chembl:extractedFrom ?journal.\
?ass chembl:hasTargetCount 1 .\
?journal ?pubmed.\
?mol blueobelisk:smiles ?SMILES.\
?target a ;\
chembl:classL3 \"VGC\" ;\
chembl:classL4 ?l4 ;\
chembl:classL5 ?l5 ;\
chembl:classL6 ?l6 ;\
chembl:sequence ?seq.\
FILTER regex(?l6, \"NA\")\
FILTER (?type = \"Ki\" || ?type = \"IC50\")\

One problem that was encountered here was that the assays are not always specified for one target but for many which lead to the return of the same information for different targets. This was solved by Egon who created
?ass chembl:hasTargetCount 1 to solve this problem. That line says that the assays should only contain one target to accurate data for PCM.

måndag 8 mars 2010

Background presentation

I held this presentation for the department last week. It's basically a presentation about the background and progress of the project. Enjoy!