On this blog you are able to follow my degree project(master thesis in Bioinformatics) which have the title Pharmaceutical knowledge retrieval through reasoning of ChEMBL RDF.
My supervisor is Egon Willighagen, http://chem-bla-ics.blogspot.com/.

Topics

fredag 19 februari 2010

moss + manager = true

My goal this week was to have integrated moss into a Bioclipse manager.
Well I'm almost there.=0)

So this is what I have done the later part of the week. I also managed to run some SPARQL queries and got some problems to figure out there, main focus next week.

Most parts of Moss now work, although there are some settings that involves combining masks that are not quite finished yet. I actually think that I've spend most my hours on this and still not done...grr

Since Moss have over 30 different parameters I found it important to have a method that shows them. But just now I realized that this is what man moss is for. Well its the same text so no worries there.
Taken from the method though it will look something like this:
> moss.parameterDescription()
Examplea moss.createParamteters("aromatic", "always"),
moss.createParamteters("minEmbed", 6)

aromatic: ("aromatic", "never"/"upgrade"/"downgrade") |"String"
canonic: ("canonicequiv", false/true) |boolean
canonicEquiv: ("canonic", true/false) |boolean
carbonChainLength: ("carbonChainLength", true/false) |boolean
class: not for use
closed: ("closed", true/false) |boolean
exNode: ("exNode", "Atom") |"String"
exSeed: ("exSeed", "Atom") |"String"
extPrune: ("extPrune", "none"/"full"/"partial"/) |"String"
ignoreAtomTypes: ("ignoreAtomTypes", "never"/"always"/"in rings") |"String"
ignoreBond: ("ignoreBond", "never"/"always"/"in rings") |"String"
kekule: ("kekule", true/false) |boolean
limits: not for use
matchAromaticityAtoms: ("matchChargeOfAtoms", "never"/"always"/"in rings")
|"String"
matchChargeOfAtoms: ("matchAromaticityAtoms", "match"/"no match") |"String"
matom: not for use
maxEmbMemory: ("maxEmbMemory", value) |integer
maxEmbed: ("maxEmbed",value) |integer
maxRing: ("maxRing", value) |integer
maximalSupport: ("maximalSupport", value) |double
mbond: not for use
minEmbed: ("minEmbed", value) |integer
minRing: ("maxRing", value) |integer
minimalSupport: ("minimalSupport", value) |double
mode: not for use
mrgat: not for use
mrgbd: not for use
ringExtension: ("ringExtension", "none"/"full"/"merge"/"filter") |"String"
seed: ("seed", "Atom") |"String"
split: ("split", true/false) |boolean
threshold: ("threshold", value) |double
unembedSibling: ("unembedSibling", false/true) |boolean


Will immediately start working on the manager.

I figure that it would be nice to have one method that sets the parameters, in this case createParameters(). The first input specifies what you want to set and the second argument provides the value. The arguments is handled by the following method,
public String createParameters(String propertyName, Object value) throws Exception{

if(value.getClass().equals(Double.class)){
value= ((Double) value).intValue();
int values = (Integer) value;
mossbean.setParameters(mossbean, propertyName, values);
}else{
mossbean.setParameters(mossbean, propertyName, value);
}
return value +" is set to " +propertyName;
}

When trying out moss myself I got irritated that I forgot the values of my parameters hence the method parameterValues() was created. It returns the current values of all parameters:
> moss.parameterValues()
aromatic:
canonic: true
canonicEquiv: false
carbonChainLength:
class: class net.bioclipse.moss.business.backbone.MossBean
closed: true
exNode: H
exSeed:
extPrune:
ignoreAtomTypes:
ignoreBond:
kekule:
limits: 0.0
matchAromaticityAtoms:
matchChargeOfAtoms:
maxEmbMemory: 0
maxEmbed: 0
maxRing: 0
maximalSupport: 0.02
minEmbed: 0
minRing: 0
minimalSupport: 0.1
ringExtension: none
seed:
split: false
threshold: 0.5
unembedSibling: false


I will also create a method that restores the values to default since it is valuable to the end-user.
I can't figure out though how to return an arraylist in a smooth way. I returned it as a String, this is how I've done it
public String parameterValues() throws Exception{
ArrayList name = mossbean.getPropertyNames(mossbean);
String info="";
String names;
for(int i=0; i names = name.get(i);
info= info + names +": " + mossbean.getProperty(mossbean, names) + " \n";
}
return info;


If you know something better, please tell!

Mostly polishing left when it comes to Moss but (perhaps) bigger mask combination parts to, it depends on the outcome of my Moss tests(which I will do when it's not Friday afternoon and I have a sharp mind).

Next week main focus is to develop SPARQL queries again!

fredag 12 februari 2010

Approching substructure mining

With a simple query like the one below random compounds from kinases from the Tk family is collected. I would like to filter the standard value to be under a certain value but I have some problem with doing that in Bioclipse, via SPARQL endpoint I've managed to create this filter. Will work on it.

var allsmiles = " \
PREFIX onto: \
PREFIX blueobelisk: \
\
SELECT DISTINCT ?smiles \
WHERE { \
?target a onto:Target . \
?target onto:classL5 \"Tk\" . \
?target onto:classL6 ?L6 . \
?assay onto:hasTarget ?target . \
?activity onto:onAssay ?assay . \
?activity onto:standardValue ?st . \
?activity onto:forMolecule ?mol . \
?mol blueobelisk:smiles ?smiles . \
}LIMIT 20 \
";
var all = rdf.sparqlRemote("http://rdf.farmbio.uu.se/chembl/sparql", allsmiles)
var all now contains a list of molecules that I is saved in a file via the net.bioclipse.moss.business plug-in.

moss.saveMoss(String fileName, List all)
Will create a file that support moss(id, threshold value, description), not complete
0,0,Cc1nc(N)sc1c2ccnc(Nc3cccc(c3)[N+](=O)[O-])n2
1,0,COc1cc(Nc2c(cnc3cc(OCCC4CCN(C)CC4)c(OC)cc23)C#N)c(Cl)cc1Cl
2,0,CCOc1cc(Nc2c(cnc3cc(OCC4CCN(C)CC4)c(OC)cc23)C#N)c(Cl)cc1Cl
3,0,COc1ccc(C)c(Nc2c(cnc3cc(OCC4CCN(C)CC4)c(OC)cc23)C#N)c1
4,0,COc1ccc(Cl)c(Nc2c(cnc3cc(OCC4CCN(C)CC4)c(OC)cc23)C#N)c1
5,0,COc1cc2c(Nc3ccc(C)cc3C)c(cnc2cc1OCC4CCN(C)CC4)C#N
6,0,COc1cc(Nc2c(cnc3cc(OCC4CCN(C)CC4)c(OC)cc23)C#N)c(C)cc1C


This file now have to be initialized, add parameters for the run and when done simply run.

> moss.saveMoss("/Moss/Test/collected", all)
> moss.init("/Moss/Test/collected")
done
>moss.setLimits(10,2)
> moss.run("/Moss/Test/collectedOut", "/Moss/Test/collectedOutId")


Only two basic parameter settings work at the moment, this is something to be added as soon as possible. It will take time though since lots of parameters are set by combining flags which I remember to be a crucial thing to do.

To read about how MoSS works, how to understand the output files etc look at Christian Borgelt homepage, http://www.borgelt.net/doc/moss/moss.html.
Output file(not complete):
id,description,nodes,edges,s_abs,s_rel,c_abs,c_rel
1,n1:c2:c(:c(-N-c3:c(-Cl):c:c(-Cl):c(-O-C):c:3):c(-C#N):c:1):c:c(-O-C):c(-O-C-C1-C-C-N(-C-C-1)-C):c:2,34,37,2,10.0,0,0.0
2,n1:c2:c(:c(-N-c3:c(-Cl):c:c:c(-O-C):c:3):c(-C#N):c:1):c:c(-O-C):c(-O-C-C1-C-C-N(-C-C-1)-C):c:2,33,36,3,15.0,0,0.0
3,n1:c2:c(:c(-N-c3:c(-Cl):c:c(-Cl):c(-O-C):c:3):c(-C#N):c:1):c:c(-O-C):c(-O-C-C(-C-C)-C):c:2,31,33,3,15.0,0,0.0
4,n1:c2:c(:c(-N-c3:c(-Cl):c:c:c(-O-C):c:3):c(-C#N):c:1):c:c(-O-C):c(-O-C-C(-C-C)-C):c:2,30,32,4,20.0,0,0.0


Output file Id(not complete)
id:list
1:2,10
2:2,4,10
3:2,9,10
4:2,4,9,10
5:2,7,10
6:2,4,7,10


Want to be able to visualize the result in tables later on, perhaps together with the input and other information collected via SPARQL.

tisdag 9 februari 2010

Fun stuff with SPARQL

I will give you an example of a SPARQL query. I've been running them on the snorql interface http://rdf.farmbio.uu.se/chembl/snorql/ which is based on ChEMBL02.

Example .
This experiment started out by me wanting to know more about activities. About its standard values and units, types. But then I kept on going looking at molecules connected to a specific activity which led me to collecting their SMILES. Through the connection between activities and resource I managed to get their pubmed id's. Via assay id I managed to get targets and filtered organism to Homo sapiens. Figure 1 displays the result from the example code.
Example code:PREFIX chemblt:
PREFIX hmm:
PREFIX onto:
PREFIX blueobelisk: PREFIX dbpedia:

SELECT DISTINCT ?target ?organism ?activities ?smiles ?type ?unit ?sval ?res ?pubmed
WHERE {
#get activities with its data
?activities a onto:Activity .
?activities onto:standardValue ?sval .
?activities onto:type ?type .

?activities onto:standardUnits ?unit .
#get compounds for those activies
?activities onto:forMolecule ?mol .?mol blueobelisk:smiles ?smiles .


# get resource id and pubmed article
?activities onto:extractedFrom ?res .?res ?pubmed.

#get assay for activity
?activities onto:onAssay ?assay .
?assay onto:hasTarget ?target .
?target onto:hasTargetType chemblt:PROTEIN .?target onto:organism ?organism .

FILTER regex(?organism, "Homo sapiens") .
FILTER regex(?type, "^Kd") .

}LIMIT 5

Figure 1. The results for example 1.

I also began to run queries that are more suitable for my work. Queries that are able to differentiate different kinase protein families (http://www.sarfari.org/kinasesarfari/family). For instance:
Figure 2. An example of targets that belong to the same protein family Tk.

måndag 8 februari 2010

A whole new world...

I've seen a completely new world when looking into the functions of Bio2RDF. I see great linking between knowledge. To be able to collect information from one knowledge base and link to another obtaining more information and always extending knowledge is great!

In my work in running queries against ChEMBL to collect active (later on also inactive) compounds I find this linking valuable. For example drugbank holds lots of great information about the compound. Not only physical information but also id's such as chebi id that will make a linking to chebi possible.

Kegg is an other kb that could be useful, Kegg:ligand, Kegg:drug and Kegg:compound. Uniprot could provide article info, chebi compound info, PDB could give target protein data, etc.

I believe that users should be able to decide what kind of information they want/need. This aim could be solved when interacting with Bioclipse. My aim is to use substructure mining on the drugs but of course other aims should be possible(the use of other Bioclipse plug-ins than Moss).

Perhaps a table representation is a nice way to display the data. And if lots of information about a drug is wanted perhaps info page is the way to display it.

DBpedia is another valuable source way to get resources: names, descriptions, inchi's, smiles, images etc. One mayor disadvantage is that unknown compound will not be found.

And now I got a link chem2bio2rdf from my supervisor. It has collected all chemical URI's in one place, I will immediately look into it and run queries!

onsdag 3 februari 2010

As my project description changed a bit and there been other obstacles to get by I haven't got as far as I expected. But at least know I have primary goal.

I'm now focusing on selecting two protein families to run a substructure mining on. Well actually now I even divided that one in to looking at one family(since they are big!). I want to find a protein that have ligands that are active. As there are many different types of activities I need to dig deeper in this area. I think I also have to find a threshold for activity to be able to reduce number of ligands--the higher the affinity the better, as there are so many.

I also looked a bit on the substructure mining algorithm that I'm using, MoSS. I been trying to run random ligands but it takes for ever most of the time the run didn't finish. As I assumed that there are bugs to fix I will try to run through the an other software to be sure that it is possible to run such complex structures via this kind of algorithm. If it works great, MoSS has some improvements steps to look forward to.

I also have to update MoSS in to the current Bioclipse standard, such as implementing a manager.

Yesterday some parts of the SPARQL endpoint http://pele.farmbio.uu.se/chembl/snorql/ started to work again(!) which simplifies many things for me as I been able to run queries to find activities that are active and also to find their target id's (tid).

A question that runs in my mind is how to find family information? I can't seem to find any class..