On this blog you are able to follow my degree project(master thesis in Bioinformatics) which have the title Pharmaceutical knowledge retrieval through reasoning of ChEMBL RDF.
My supervisor is Egon Willighagen, http://chem-bla-ics.blogspot.com/.


måndag 14 juni 2010

A small but wonderful add-on

Look at the following scenarios:


> var camk = chembl.MossGetProtFamilyCompAct("camk", "IC50")

> chembl.MoSSViewHistogram(camk)

> var camkBounds = chembl.MossSetActivityBound(camk, 1,1000000)

> camkBounds.getRowCount()


>chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC501", camk)


> var camk=chembl.MossGetProtFamilyCompActBounds("CAMK","IC50",1, 1000000)

> camk.getRowCount()


> chembl.MossSaveFormat("/ChEMBL-MoSS/Rapport/CAMKIC502", camk)

(a)+(b) Scripts taken from the context of retrieving molecules for molecular substructure mining. (a) Collects compounds that bind to proteins from the family CAMK with the activity IC50. The activities for the compounds are looked at in a histogram and the bound is later set to involve molecules within activities between 1-1000,000. Lastly saved out to a file that supports MoSS input file.

(b)Lets say you been working with this set a couple of times and know exactly your parameters then the script in (b) would reduce unnecessary steps in retrieving molecules by simply adding the upper and lower value to the query directly. At last saving into an input file of MoSS.

Small step but wonderful when you run scripts all day!

måndag 7 juni 2010

The ChEMLB-MoSS interaction in Bioclipse

There are two ways of accessing the chEMBL- MoSS feature in Bioclipse, javascript and by wizard. I will present both ways here!

In both situation I work with an example of accessing molecules for the Kinase protein family Tyrosin Kinase also known as TK. I want to look at the compounds that bind to any protein in this family with the activity Ki. Also, to specify in what activity span my molecules should be in.

Starting of with the wizard, this is what it looks like when it is first open.

Only one box is accessible and that is the one for protein families. When a family is selected a SPARQL query run towards the endpoint and returns the available activities for that family. By simply selecting a preferred activity an other SPARQL query will update the table with compounds (with a limitation of 50, the button add all(which is done in the picture) will of course add them all=).
Now I would like to only collect the active compounds hence I first look at the graph displaying the activities.
When I know in what activity span I would like to work with I update the table with help from the lower and upper boxes and simply press update table. When I now press finish a file that supports MoSS will be produced.

Performing almost the same task now provides the following javascript.

> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki",50)
> tkki.getRowCount()
Here I collect 50 compounds from the TK family with the activity of KI.

> var tkki = chembl.MossGetProtFamilyCompAct("tk","ki")
> tkki.getRowCount()
Here I perform the same thing as above without a limit leaving to returning 976 compounds, the same number that was returned when "add all" was pushed in the wizard.

> var tkkiActBound = chembl.MossSetActivityBound(tkki, 1,15000)
> tkkiActBound.getRowCount()
> tkkiActBound

With the specification of an activity span between 1 and 15000 nm the number of compounds are reduced to 850(as in the wizard). If I write the name of the variable a string matrix will display all the information. But in order to work with MoSS it has to be saved in a certain way. That's why we save the matrix to a file just as we did when we pressed finish in the wizard.

> chembl.saveMossFormat("/chembl/Script/tkki",tkkiActBound)

Taken from the produced file(s)(they are exactly the same).



With this shown I will
soon let you know what MoSS can do with the saved data!

måndag 17 maj 2010

A moss-chembl application

After a month of traveling I'm now back to devote my time to what's left of my project which would be about 8-9 weeks. My work is progressing and much of my time I'm working with human-computer-interaction but also advancing the SPARQL queries and test for accuracy.

MoSS as I probably mentioned a couple of times before is a molecular substructure mining software produced by Christian Borgelt, http://www.borgelt.net/moss.html. I implemented that application for Bioclipse in 2008, http://wiki.bioclipse.net/index.php?title=MoSS_in_Bioclipse, and I'm now making use of my own application.

As my chEMBL work is coming along I'm at the moment working on a specific working flow, "from chEMBL to MoSS". With the functionality of SPARQL I am now via java methods accessing compounds from various Kinase protein familes. A method could look like something like this

public IStringMatrix MossProtFamilyCompounds(String fam, String actType)
throws BioclipseException{

String sparql =
"PREFIX chembl: " +
"PREFIX bo: "+

"SELECT DISTINCT ?smiles where{ " + " ?target a chembl:Target;" +
" chembl:classL5 ?fam. " +
" ?assay chembl:hasTarget ?target . " +
" ?activity chembl:onAssay ?assay ;" +
" chembl:type ?actType ; " +
" chembl:forMo
lecule ?mol ."+
" ?mol bo:smiles ?smiles. " +
" FILTER regex(?fam, " + "\"^" + fam + "$\"" + ", \"i\")."+
" FILTER regex(?
actType, " + "\"^" + actType + "$\"" + ", \"i\")."+
" }";
IStringMatrix matrix = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql",sparql);

return matrix;

Inside this java method there is a SPARQL query which is a string named sparql. It is possible to run a query like this due to the rdf project done by Egon. I use that feature when I call rdf.sparqlRemote, what that command basically do is accessing the SPARQL endpoint(URL) with my query which is made into a String. So for this to work an internet connection must exist.
I will try to find something that can check if such a connection exist or not to improve the use of the application(no connection -> no search).

The compounds are saved into a file supported by MoSS. This makes it possible for MoSS to run on the compounds drawn from the chEMBL database. Also a java script environment is available.

The pictures shows(top) the moss-chembl wizard and (bottom) the moss wizard.

The moss-chembl applications is dynamic which means that you can search for wanted compounds and look at them directly. This ease the work a lot! Also to be mentioned is that the compounds are at the moment only compounds that bind to a protein in a Kinase Family.

When a preferred data set is chosen moss will read in the data and now you are able to perform a substructure mining on them!

Next problem to manage Visualization...

onsdag 24 mars 2010

The things you can do with a wizard . . .

Now I have started to get a feeling for SPARQL but do you have one?
Well I do not want to force anyone to learn new languages all the time therefor I began to develop a wizard. This wizard is far from done but it do mange some functions at the moment which is really cool. As you write an id or keyword SPARQL queries against http://rdf.farmbio.uu.se/chembl/snorql/ is on the go returning the values to the wizard. If you change your search the old data will be deleted and the new one displayed.

A search may now be done with keywords, SMILES or chebi id to find information about compounds. This search will expand as I implement biological networking to other knowledge bases(http://chebi.bio2rdf.org/sparql as an example).
If the checkbox for target is check a search with proteins id's, keywords, ec-number etc will take place instead.
As you write the table will fill up with various data depending on what you search on.

The upper picture searches for targets that have some connection to sodium channels. The bottom picture search for a chebi id from a SMILES. Unfortunately I don't know yet how to distinguish between strings written in the box so the line have to end with a # at the moment. Working on solving that...
Lägg till bild

torsdag 18 mars 2010

Interesting SPARQL queries for QSAR and PCM data!

The following two SPARQL queries are really interesting for QSAR projects and proteochemometric(PCM) project. By accessing chEMBL data via RDF with SPARQL I can easily retrieve necessary data to build up these kind of projects.

For a QSAR project following query could be used:

var forQSAR = "\
PREFIX chembl:
PREFIX blueobelisk:
SELECT DISTINCT ?act ?ass ?conf ?mol ?SMILES ?val ?unit WHERE { \
?act chembl:type \"IC50\" ; \
chembl:onAssay ?ass; \
chembl:forMolecule ?mol;\
chembl:standardValue ?val;\
chembl:standardUnits ?unit.\
?mol blueobelisk:smiles ?SMILES. \
?ass chembl:hasTarget
; \
chembl:hasConfScore ?conf. \

Since I run my queries through Bioclipse the SPARQL query is given a name to ease up the following run ¨
var qsar = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", forQSAR)

The query will return unique id's for activity(?act), molecules(?mol) and assays(?ass), SMILES(?SMILES) for the molecules, values(?val) and units(?unit) for the activities and confidence values(?conf). And it is really easy expand the query to return more data!

The query for PCM returns unique id's for targets(?target), molecules(?mol) and pubmeds(?pubmed), SMILES(?SMILES), protein sequences(?seq), varoius classifications(?l4, ?l5, ?l6), activities(?type) ans activity values(?val).
The activities are narrowed down to only include IC50 and Ki and the ion channels should only be Na(the last two lines in the query).

The query looks like the following:
var kic50na ="\
PREFIX chembl: \
PREFIX blueobelisk: \
SELECT DISTINCT ?type ?target ?pubmed ?l4 ?l5 ?l6 ?mol ?SMILES ?val ?seq \
?act chembl:type ?type;\
chembl:onAssay ?ass;\
chembl:forMolecule ?mol;\
chembl:standardValue ?val.\
?ass chembl:hasTarget ?target;\
chembl:extractedFrom ?journal.\
?ass chembl:hasTargetCount 1 .\
?journal ?pubmed.\
?mol blueobelisk:smiles ?SMILES.\
?target a ;\
chembl:classL3 \"VGC\" ;\
chembl:classL4 ?l4 ;\
chembl:classL5 ?l5 ;\
chembl:classL6 ?l6 ;\
chembl:sequence ?seq.\
FILTER regex(?l6, \"NA\")\
FILTER (?type = \"Ki\" || ?type = \"IC50\")\

One problem that was encountered here was that the assays are not always specified for one target but for many which lead to the return of the same information for different targets. This was solved by Egon who created
?ass chembl:hasTargetCount 1 to solve this problem. That line says that the assays should only contain one target to accurate data for PCM.

måndag 8 mars 2010

Background presentation

I held this presentation for the department last week. It's basically a presentation about the background and progress of the project. Enjoy!

måndag 1 mars 2010

Update post

I have so many half-finished sub-project that I don't have anything interesting to blog about hence my update post!

My sub-projects:

Looking into other syntax languages, especially Manchester OWL syntax. In Journal Club we read the article Towards pharmacogenomics knowledge discovery with the semantic web and encountered the Manchester syntax language. I will blog about it when I'm done. And speaking of Journal Club I'm also writing a review together with Jonathan. And have to find time to read the next article....

Moss Manager needs to be rearranged since the net.bioclipse.rdf plug-in no longer returns lists of arraylist. It now amazingly returns String Matrices which will make things so much easier especially when I'm only interested in the SMILES part of the SPARQL outcome.

I'm also working on a presentation that I'm going to present on Thursday 4/3. I will try to put it up here afterwards. It's about the background and status of this project. (Spend hours on creating a gantt chart in excel..well I'm not friends with excel anymore..)

And last I'm trying to structure up a new bioclipse plug-in for drug/compound, target and other valuable info retrieval i.e. query ChEMBL in a effective and powerful way with SPARQL.

To learn from this post, use TODO lists =0)

fredag 19 februari 2010

moss + manager = true

My goal this week was to have integrated moss into a Bioclipse manager.
Well I'm almost there.=0)

So this is what I have done the later part of the week. I also managed to run some SPARQL queries and got some problems to figure out there, main focus next week.

Most parts of Moss now work, although there are some settings that involves combining masks that are not quite finished yet. I actually think that I've spend most my hours on this and still not done...grr

Since Moss have over 30 different parameters I found it important to have a method that shows them. But just now I realized that this is what man moss is for. Well its the same text so no worries there.
Taken from the method though it will look something like this:
> moss.parameterDescription()
Examplea moss.createParamteters("aromatic", "always"),
moss.createParamteters("minEmbed", 6)

aromatic: ("aromatic", "never"/"upgrade"/"downgrade") |"String"
canonic: ("canonicequiv", false/true) |boolean
canonicEquiv: ("canonic", true/false) |boolean
carbonChainLength: ("carbonChainLength", true/false) |boolean
class: not for use
closed: ("closed", true/false) |boolean
exNode: ("exNode", "Atom") |"String"
exSeed: ("exSeed", "Atom") |"String"
extPrune: ("extPrune", "none"/"full"/"partial"/) |"String"
ignoreAtomTypes: ("ignoreAtomTypes", "never"/"always"/"in rings") |"String"
ignoreBond: ("ignoreBond", "never"/"always"/"in rings") |"String"
kekule: ("kekule", true/false) |boolean
limits: not for use
matchAromaticityAtoms: ("matchChargeOfAtoms", "never"/"always"/"in rings")
matchChargeOfAtoms: ("matchAromaticityAtoms", "match"/"no match") |"String"
matom: not for use
maxEmbMemory: ("maxEmbMemory", value) |integer
maxEmbed: ("maxEmbed",value) |integer
maxRing: ("maxRing", value) |integer
maximalSupport: ("maximalSupport", value) |double
mbond: not for use
minEmbed: ("minEmbed", value) |integer
minRing: ("maxRing", value) |integer
minimalSupport: ("minimalSupport", value) |double
mode: not for use
mrgat: not for use
mrgbd: not for use
ringExtension: ("ringExtension", "none"/"full"/"merge"/"filter") |"String"
seed: ("seed", "Atom") |"String"
split: ("split", true/false) |boolean
threshold: ("threshold", value) |double
unembedSibling: ("unembedSibling", false/true) |boolean

Will immediately start working on the manager.

I figure that it would be nice to have one method that sets the parameters, in this case createParameters(). The first input specifies what you want to set and the second argument provides the value. The arguments is handled by the following method,
public String createParameters(String propertyName, Object value) throws Exception{

value= ((Double) value).intValue();
int values = (Integer) value;
mossbean.setParameters(mossbean, propertyName, values);
mossbean.setParameters(mossbean, propertyName, value);
return value +" is set to " +propertyName;

When trying out moss myself I got irritated that I forgot the values of my parameters hence the method parameterValues() was created. It returns the current values of all parameters:
> moss.parameterValues()
canonic: true
canonicEquiv: false
class: class net.bioclipse.moss.business.backbone.MossBean
closed: true
exNode: H
limits: 0.0
maxEmbMemory: 0
maxEmbed: 0
maxRing: 0
maximalSupport: 0.02
minEmbed: 0
minRing: 0
minimalSupport: 0.1
ringExtension: none
split: false
threshold: 0.5
unembedSibling: false

I will also create a method that restores the values to default since it is valuable to the end-user.
I can't figure out though how to return an arraylist in a smooth way. I returned it as a String, this is how I've done it
public String parameterValues() throws Exception{
ArrayList name = mossbean.getPropertyNames(mossbean);
String info="";
String names;
for(int i=0; i names = name.get(i);
info= info + names +": " + mossbean.getProperty(mossbean, names) + " \n";
return info;

If you know something better, please tell!

Mostly polishing left when it comes to Moss but (perhaps) bigger mask combination parts to, it depends on the outcome of my Moss tests(which I will do when it's not Friday afternoon and I have a sharp mind).

Next week main focus is to develop SPARQL queries again!

fredag 12 februari 2010

Approching substructure mining

With a simple query like the one below random compounds from kinases from the Tk family is collected. I would like to filter the standard value to be under a certain value but I have some problem with doing that in Bioclipse, via SPARQL endpoint I've managed to create this filter. Will work on it.

var allsmiles = " \
PREFIX onto: \
PREFIX blueobelisk: \
?target a onto:Target . \
?target onto:classL5 \"Tk\" . \
?target onto:classL6 ?L6 . \
?assay onto:hasTarget ?target . \
?activity onto:onAssay ?assay . \
?activity onto:standardValue ?st . \
?activity onto:forMolecule ?mol . \
?mol blueobelisk:smiles ?smiles . \
}LIMIT 20 \
var all = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", allsmiles)
var all now contains a list of molecules that I is saved in a file via the net.bioclipse.moss.business plug-in.

moss.saveMoss(String fileName, List all)
Will create a file that support moss(id, threshold value, description), not complete

This file now have to be initialized, add parameters for the run and when done simply run.

> moss.saveMoss("/Moss/Test/collected", all)
> moss.init("/Moss/Test/collected")
> moss.run("/Moss/Test/collectedOut", "/Moss/Test/collectedOutId")

Only two basic parameter settings work at the moment, this is something to be added as soon as possible. It will take time though since lots of parameters are set by combining flags which I remember to be a crucial thing to do.

To read about how MoSS works, how to understand the output files etc look at Christian Borgelt homepage, http://www.borgelt.net/doc/moss/moss.html.
Output file(not complete):

Output file Id(not complete)

Want to be able to visualize the result in tables later on, perhaps together with the input and other information collected via SPARQL.

tisdag 9 februari 2010

Fun stuff with SPARQL

I will give you an example of a SPARQL query. I've been running them on the snorql interface http://rdf.farmbio.uu.se/chembl/snorql/ which is based on ChEMBL02.

Example .
This experiment started out by me wanting to know more about activities. About its standard values and units, types. But then I kept on going looking at molecules connected to a specific activity which led me to collecting their SMILES. Through the connection between activities and resource I managed to get their pubmed id's. Via assay id I managed to get targets and filtered organism to Homo sapiens. Figure 1 displays the result from the example code.
Example code:PREFIX chemblt:
PREFIX onto:
PREFIX blueobelisk: PREFIX dbpedia:

SELECT DISTINCT ?target ?organism ?activities ?smiles ?type ?unit ?sval ?res ?pubmed
#get activities with its data
?activities a onto:Activity .
?activities onto:standardValue ?sval .
?activities onto:type ?type .

?activities onto:standardUnits ?unit .
#get compounds for those activies
?activities onto:forMolecule ?mol .?mol blueobelisk:smiles ?smiles .

# get resource id and pubmed article
?activities onto:extractedFrom ?res .?res ?pubmed.

#get assay for activity
?activities onto:onAssay ?assay .
?assay onto:hasTarget ?target .
?target onto:hasTargetType chemblt:PROTEIN .?target onto:organism ?organism .

FILTER regex(?organism, "Homo sapiens") .
FILTER regex(?type, "^Kd") .


Figure 1. The results for example 1.

I also began to run queries that are more suitable for my work. Queries that are able to differentiate different kinase protein families (http://www.sarfari.org/kinasesarfari/family). For instance:
Figure 2. An example of targets that belong to the same protein family Tk.

måndag 8 februari 2010

A whole new world...

I've seen a completely new world when looking into the functions of Bio2RDF. I see great linking between knowledge. To be able to collect information from one knowledge base and link to another obtaining more information and always extending knowledge is great!

In my work in running queries against ChEMBL to collect active (later on also inactive) compounds I find this linking valuable. For example drugbank holds lots of great information about the compound. Not only physical information but also id's such as chebi id that will make a linking to chebi possible.

Kegg is an other kb that could be useful, Kegg:ligand, Kegg:drug and Kegg:compound. Uniprot could provide article info, chebi compound info, PDB could give target protein data, etc.

I believe that users should be able to decide what kind of information they want/need. This aim could be solved when interacting with Bioclipse. My aim is to use substructure mining on the drugs but of course other aims should be possible(the use of other Bioclipse plug-ins than Moss).

Perhaps a table representation is a nice way to display the data. And if lots of information about a drug is wanted perhaps info page is the way to display it.

DBpedia is another valuable source way to get resources: names, descriptions, inchi's, smiles, images etc. One mayor disadvantage is that unknown compound will not be found.

And now I got a link chem2bio2rdf from my supervisor. It has collected all chemical URI's in one place, I will immediately look into it and run queries!

onsdag 3 februari 2010

As my project description changed a bit and there been other obstacles to get by I haven't got as far as I expected. But at least know I have primary goal.

I'm now focusing on selecting two protein families to run a substructure mining on. Well actually now I even divided that one in to looking at one family(since they are big!). I want to find a protein that have ligands that are active. As there are many different types of activities I need to dig deeper in this area. I think I also have to find a threshold for activity to be able to reduce number of ligands--the higher the affinity the better, as there are so many.

I also looked a bit on the substructure mining algorithm that I'm using, MoSS. I been trying to run random ligands but it takes for ever most of the time the run didn't finish. As I assumed that there are bugs to fix I will try to run through the an other software to be sure that it is possible to run such complex structures via this kind of algorithm. If it works great, MoSS has some improvements steps to look forward to.

I also have to update MoSS in to the current Bioclipse standard, such as implementing a manager.

Yesterday some parts of the SPARQL endpoint http://pele.farmbio.uu.se/chembl/snorql/ started to work again(!) which simplifies many things for me as I been able to run queries to find activities that are active and also to find their target id's (tid).

A question that runs in my mind is how to find family information? I can't seem to find any class..

fredag 29 januari 2010

My first week

Well this week mainly consisted in reading book, articles and tutorials. I'm really eager to start programming now! There are lots of things the semantic web touches that I never heard of before. So my reading consisted of getting to know RDF, SPARQL and also some OWL. I also got to know git, I really enjoyed http://learn.github.com/, a great tutorial. I also looked into ChEMBL http://www.ebi.ac.uk/chembl/, trying to get to know the structure of its database.

I managed to checkout Bioclipse and was also able to get my old project MoSS http://wiki.bioclipse.net/index.php?title=MoSS_in_Bioclipse, to run. Bioclipse changed a lot since I worked with it and need to catch up on it before I start working with it which I will do in the beginning of next week.

torsdag 28 januari 2010

First post!

Finally my first post on the blog!