The focus of this section is set on the molecular targets where the active molecules against BrCa models in the ChEMBL database have significant impact. We will tackle the question from the empirical point of view (i.e the ChEMBL recorded impact) and the predictive analysis, where predictions from machine learning algorithms will be used for target ID. Results will be further compared to determine whether the predictive analysis in the whole dataset affords any novelty upon the empirical annotations. Targets involved in killing particular BrCa cell lines will also be determined by hyerarchical clustering.
BRCA2 protein. By Filip em – self created from PDB entry with KiNG tool http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=1n0w, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=3237307
Global Identification Process
The target identification process is initiated with compounds screened in breast cancer phenotypic assays. The results from such compounds in molecular target based assays are identified, and those with an active ChEMBL score are selected and pivoted to yield unique raw data by target. Each row, will contain, the actual ratio between the number of events per active target in all BrCa assays and the total count of events per target (countRatio). CountRatio is equivalent to the proportion of times that molecules with a positive hit on a particular protein activates BrCa as well. If always = 1, if never =0. This parameter is associated in the chart below with the total number of events per target (count(chemblActivityScore)) as a measure of confidence. Average potencies per target in Non BrCa molecular assays (Avg(chemblActivityScore)), and BrCa assays (Avg(BrCaScore)) are depicted as color and size gradients, respectively. As an example, targets with countRatio >0.65 and total events >1 are marked in the chart. This is: targets where >65% of molecular events with a positive chemblActivityScore also have active events in BrCa phenotypic assays, and results in the identification of 61 putative BrCa targets.
This selection is carried out on database with 170k events on molecular assays, but given that there are predictions from BrCa prediction of activity section, these could be used to look at potential targets among 13M events, the entire database content. In this particular case, the random forest regression output will be used as an indicator of BrCaScore. With this target dataset, by using predictions, we increment the chances of identifying unforeseen targets.
It appears that some of the targets identified from the actual data are present in the plot from predicted data, but few of the targets from the prediction are in the BrCa compound dataset. Below results are plotted in bar charts, sorted by potency.
From actual data…
From predicted data…
Little dashboard where targets marked on the left graph (actual) are displayed on the right one (prediction). Although most of the targets are also identified in in the right chart, there is an evident shift to lower countRatio values.
In this second dashboard, targets identified from predicted BrCa score are in the left, and marking them we make them to be displayed in the right plot. Just the two more frequent appear in the plot of actual targets. Should we trust on the value of prediction?
Breast Cancer is one of the subjects of research with higher amount of information published, so, let’s see in the table below how the predicted targets with an average score >6 do in literature
proteinName
Avg(chemblScore)
countRatio
Organism(s)
Events
Ref
Serine/threonine protein phosphatase 2A- 56 kDa regulatory subunit- alpha isoform
We can see how most of the predicted targets have a link to a literature reference that relates them to breast cancer biomarkers or therapies. Those that don’t, pertain to species other than human, i.e: E. coli, Pseudomonas, HIV virus or plasmodium. So the final target selection will include proteins selected by their actual and predicted scores of activity. The plot below shows the selection criterion for targets (>0.5 count ratio & > event of activity, predicted or actual).
Here, the corresponding bar charts for the average potencies and countRatios with the best values:
And here, the table with the corresponding values:
Targets identified from predicted and actual BrCa scores.
proteinName
Avg(chemblScoreC)
Count(chemblScoreC)
countRatio
Organisms(s)
proteinClassDescription
BrCaActivities
CDK6/cyclin D2
4.301029995664
4
1
Homo sapiens
cytosolic other, enzyme kinase protein kinase cmgc
Prediction
DNA topoisomerase type IB small subunit
4.39
3
1
Leishmania major
enzyme
Actual
DNA topoisomerase
4.3948239965312
5
0.6
Bos taurus, Yersinia pestis
enzyme
Actual
Orexin receptor 2
4.397940008672
2
1
Homo sapiens
membrane receptor 7tm1 peptide short peptide orexin receptor
Actual
Histone deacetylase 5
4.4098528577623
7
0.57142857142857
Homo sapiens
epigenetic regulator eraser hdac hdac class iia
Actual
Heat shock protein HSP90
4.4317861297169
208
0.94230769230769
Homo sapiens
cytosolic other
Actual
Folylpoly-gamma-glutamate synthetase
4.5006826451167
19
0.68421052631579
Homo sapiens, Mus musculus
enzyme
Actual
MAP kinase p38
4.5705908707523
160
0.6
Homo sapiens
enzyme kinase protein kinase cmgc mapk p38
Actual
Type 1 fimbiral adhesin FimH
4.613
10
0.6
Escherichia coli (strain UTI89 / UPEC)
unclassified
Prediction
Cytochrome P450 165B3
4.615
2
1
Amycolatopsis orientalis
enzyme
Prediction
Histone deacetylase 10
4.625
2
1
Homo sapiens
epigenetic regulator eraser hdac hdac class iib
Actual
Phosphate system positive regulatory protein PHO81
enzyme kinase protein kinase cmgc cdk cdk5, unclassified
Prediction
Multidrug resistance-associated protein 7
4.9535565938864
2
1
Homo sapiens
transporter ntpase atp binding cassette mrp
Actual
Adenosine deaminase
4.9826503431423
7
0.57142857142857
Bos taurus, Homo sapiens
enzyme, enzyme hydrolase
Actual
DNA topoisomerase I
4.9861420070915
226
0.85398230088496
Homo sapiens, Leishmania donovani donovani, Mus musculus
enzyme, enzyme isomerase, unclassified
Actual
DNA topoisomerase II
4.9903630390779
101
0.8019801980198
Drosophila melanogaster, Homo sapiens
enzyme, enzyme isomerase
Actual
Uroporphyrinogen-III synthase
5
11
0.90909090909091
Homo sapiens
enzyme
Prediction
Candidapepsin-1
5
3
0.66666666666667
Candida albicans
enzyme protease aspartic aa a1a
Prediction
V-type proton ATPase subunit B- brain isoform
5
2
1
Homo sapiens
enzyme hydrolase, transporter ntpase f-type and v-type v-type atpase
Prediction
Complex of retinoic acid binding (CRABPII) and inhibitor of apoptosis (cIAP1) proteins
5
36
0.77777777777778
Homo sapiens
auxiliary transport protein fabp, enzyme
Prediction
Kallikrein 8
5
2
1
Homo sapiens
enzyme protease serine pas s1a
Prediction
Thioredoxin reductase 2
5
2
1
Homo sapiens
enzyme
Prediction
Pho80/Pho85
5
4
1
Saccharomyces cerevisiae S288c
enzyme kinase protein kinase cmgc cdk cdk5, unclassified
Prediction
Chitinase-3-like protein 3
5
4
1
Mus musculus
unclassified
Prediction
Alpha-1-antiproteinase
5
2
1
Mus caroli
unclassified
Prediction
Pro-cathepsin H
5
2
1
Mus musculus
enzyme
Prediction
Lipase
5
9
0.55555555555556
Thermomyces lanuginosus
enzyme
Prediction
Toll-like receptor 4/MD-2
5
120
1
Homo sapiens
membrane receptor, surface antigen
Prediction
Toll-like receptor 4/MD-2/CD14
5
9
1
Homo sapiens
membrane receptor, surface antigen
Prediction
Proteasome subunit beta type-2
5
2
1
Mus musculus
enzyme
Prediction
JNK1/JNK2
5
4
1
Mus musculus
enzyme
Prediction
Heme oxygenase 1
5
3
1
Mus musculus
enzyme
Actual
Mitogen-activated protein kinase; ERK1/ERK2
5
8
0.5
Homo sapiens
enzyme kinase protein kinase cmgc mapk erk
Actual
ORAI 1/2/3
5.0076190476191
567
0.76190476190476
Homo sapiens
ion channel other misc crac-c
Prediction
Latent membrane protein 1
5.0106697555004
16
0.5625
Human herpesvirus 4 (strain B95-8)
unclassified
Actual
Histone deacetylase
5.0258640869132
10364
0.59108452335006
Homo sapiens, Plasmodium falciparum, Rattus norvegicus
enzyme, epigenetic regulator eraser hdac hdac class i, epigenetic regulator eraser hdac hdac class iia, epigenetic regulator eraser hdac hdac class iib, epigenetic regulator eraser hdac hdac class iv
Serine/threonine protein phosphatase 2A- 56 kDa regulatory subunit- alpha isoform
10.013333333333
3
1
Homo sapiens
enzyme phosphatase protein phosphatase reg
Prediction
All targets with more than 50% events with a positive chembl score inhibiting BrCa cell lines growth either on real or virtual experiments.
Target Selection Upon Specific Cell Lines.
So far we have been using the average BrCa scores from experiments carried out with 8 BrCa cell lines, but it is well known that there are genotypic and phenotypic differences among them that translate to differences in pharmacology. This may be relevant for discriminative treatments of tumors, for which the cell lines are representative, and the procedure is applicable on clinical databases with extensive and updated tumor treatment outcomes.
For this purpose, let’s do a simple classification of the tumors via hierarchical clustering performed on the global results of the BrCa compounds in the whole ChEMBL database, having removed all relatived to tumors or phenotypic (non molecular target) assays. The chart below shows the results of the clustering with a focus on three cell lines (MCF7, MDA-MB-435, and MDA-MB-231). The cell line specific clusters are areas where the activity on the cell line is the greatest (red) and minimal for the rest (yellow). Experiments with specific areas of activity are marked in red on the chart.
Once the results are collected, we proceed with the cell line specific results similar to the global target identification procedure. Results are pivoted by target, and the average potency in the specific cell line is calculated alongside the number of experimental data (countChemblScore). This is compared to the average potency of such experimental events at the average BrCaScore calculated from all BrCa cell lines. To facilitate selection views and interpretation, a selectivity index between the specific cell line and the averageBrCaScore is added to the plots.
Charts below represent the dashboards used for selection. On the left, the selectivity index is compared to the number of events in the DB. Compounds with the highest indexes are then marked, which makes them to be plotted in the right chart, that compares the potency in each particular cell line to the average BrCaAverageScore with the y=x, y=x+1 and y=x-1 lines.
And the corresponding tables containing the selected targets:
PROTEIN_NAME
Avg(BrCaScore)
Count(ChemblActivityScore)
AvgMCF7Score
selectivityIndex
Dihydrofolate reductase
3.9242332834135
7
7.833
3.9087667165865
Dihydrofolate reductase
5.8513333333333
15
7.833
1.9816666666667
Dihydrofolate reductase
5.6658823529412
17
7.833
2.1671176470588
Thymidylate synthase
4.2058770028771
13
7.833
3.6271229971229
Thymidylate synthase
5.0226639688564
22
7.833
2.8103360311436
Thymidylate synthase
4.4166666666667
6
7.2349166666667
2.81825
GAR transformylase
4.6822333069284
4
7.833
3.1507666930716
Dihydrofolate reductase
4.1387041229868
4
7.833
3.6942958770132
6-O-methylguanine-DNA methyltransferase
3
1
6.019
3.019
Thymidylate synthase
4.7
1
7.833
3.133
Dihydrofolate reductase
3.6382721639824
1
7.833
4.1947278360176
Acetylcholinesterase
3.09087893653
11
5.8754615384615
2.7845826019316
GAR transformylase
6.6521464575467
6
7.833
1.1808535424533
Folate transporter 1
5.7792182518111
5
7.833
2.0537817481889
Arachidonate 5-lipoxygenase
3.9958037874723
2
6.251
2.2551962125276
Monoamine oxidase A
2
1
6.146
4.146
Monoamine oxidase B
2
2
4.2445
2.2445
Cytochrome P450 3A4
2.978947368421
19
6.14765
3.168702631579
DNA topoisomerase II alpha
4.61
2
6.436
1.826
Indoleamine 2,3-dioxygenase
3.8624693683042
2
5.6645
1.8020306316959
Proton-coupled folate transporter
6.0120986299462
12
7.833
1.8209013700538
Folate receptor alpha
5.034
5
7.833
2.799
Folate receptor beta
5.2733333333333
3
7.833
2.5596666666667
Cytochrome P450 1A2
3.9766666666667
12
6.04671875
2.0700520833333
Quinone reductase 2
3.77
2
6.48
2.71
Menin/Histone-lysine N-methyltransferase MLL
4.3625
16
6.85
2.4875
Thyroid hormone receptor beta-1
4.2374999514771
4
7.2075
2.9700000485229
Endoplasmic reticulum-associated amyloid beta-peptide-binding protein
4.95
2
6.8235
1.8735
Lysine-specific demethylase 4D-like
4.5
2
6.8235
2.3235
Signal transducer and activator of transcription 6
5.4666666666667
3
7.617
2.1503333333333
Ras-related protein Rab-9A
4.25
2
6.8185
2.5685
Survival motor neuron protein
2
8
5.997
3.997
Pyruvate kinase
3.525
2
7.617
4.092
Pyruvate kinase isozymes M1/M2
3.425
2
7.617
4.192
Aldehyde dehydrogenase 1A1
4.75
4
6.65875
1.90875
Nonstructural protein 1
2.4664795276306
5
6.7508
4.2843204723694
MAP kinase ERK2
2.2181818181818
11
5.8644545454546
3.6462727272727
Caspase-1
2.3181818181818
11
5.8600909090909
3.5419090909091
Neuropeptide S receptor
5
2
6.8185
1.8185
Cytochrome P450 2D6
2.8666666666667
15
5.9415666666667
3.0749
Cytochrome P450 2C9
3.2
16
6.04628125
2.84628125
Nitric oxide synthase, inducible
2.1763001963315
11
5.7262727272727
3.5499725309412
Matrix metalloproteinase 9
2.3226974474426
11
5.7262727272727
3.4035752798302
Matrix metalloproteinase-1
2.3204329899232
11
5.7262727272727
3.4058397373496
Beta-glucocerebrosidase
4.4000001860566
1
7.617
3.2169998139434
Lysine-specific demethylase 4A
4.3549999847922
30
6.1392
1.7842000152078
Beta-2 adrenergic receptor
2.2227273160387
11
5.8644545454546
3.6417272294159
Putative hexokinase HKDC1
5.45
2
8
2.55
Tumor susceptibility gene 101 protein
4.6933316382564
2
7.08
2.3866683617436
Cytochrome P450 2C19
2.4285714285714
14
5.9393928571429
3.5108214285714
Serine hydroxymethyltransferase, cytosolic
4.4253555070293
4
7.833
3.4076444929707
Thymidylate synthase (EC 2.1.1.45) (TS) (TSase)
4.12
1
7.833
3.713
Chromobox protein homolog 1
4.3000000202531
7
6.5681428571429
2.2681428368898
Delta opioid receptor
2.1849346390687
11
6.0444545454545
3.8595199063858
DNA polymerase iota
4.5881903578905
26
6.4524230769231
1.8642327190326
Glucose-6-phosphate 1-dehydrogenase
2
1
6.232
4.232
mRNA interferase MazF
4.2882736040518
4
8
3.7117263959482
Thioredoxin reductase 1, cytoplasmic
4.7692325381619
13
6.4742307692308
1.7049982310688
Regulator of G-protein signaling 4
4.3026465986627
9
6.2887777777778
1.9861311791151
Serotonin 5a (5-HT5a) receptor
4.0342810297558
1
8
3.9657189702442
Opioid receptors; mu & delta
5.01
2
8
2.99
DNA polymerase kappa
4.6227151804016
13
6.5556923076923
1.9329771272907
Luciferin 4-monooxygenase
4.9459800515598
2
6.818
1.8720199484402
Adenosine A1 receptor
2
10
5.8489
3.8489
Adenosine A2a receptor
2
10
5.8489
3.8489
Adenosine A3 receptor
2.676
10
5.8489
3.1729
Alpha-1a adrenergic receptor
2
10
5.8489
3.8489
Alpha-1b adrenergic receptor
2
10
5.8489
3.8489
Alpha-1d adrenergic receptor
2
10
5.8489
3.8489
Alpha-2a adrenergic receptor
2
10
5.8489
3.8489
Alpha-2b adrenergic receptor
2
10
5.8489
3.8489
Alpha-2c adrenergic receptor
2
10
5.8489
3.8489
Beta-1 adrenergic receptor
2
10
5.8489
3.8489
Beta-3 adrenergic receptor
2
10
5.8489
3.8489
Norepinephrine transporter
2.506
10
5.8489
3.3429
Aldose reductase
2
10
5.8489
3.8489
Angiotensin II type 2 (AT-2) receptor
2
10
5.8489
3.8489
Bradykinin B2 receptor
2
10
5.8489
3.8489
Calcitonin receptor
2
10
5.8489
3.8489
Cannabinoid CB1 receptor
2
10
5.8489
3.8489
Carbonic anhydrase II
2
10
5.8489
3.8489
C-C chemokine receptor type 2
2
10
5.8489
3.8489
C-C chemokine receptor type 4
2
10
5.8489
3.8489
C-C chemokine receptor type 5
2
10
5.8489
3.8489
Interleukin-8 receptor A
2
10
5.8489
3.8489
Interleukin-8 receptor B
2
10
5.8489
3.8489
Cholecystokinin A receptor
2
10
5.8489
3.8489
Cyclooxygenase-1
2
10
5.8489
3.8489
Cyclooxygenase-2
2
10
5.8489
3.8489
Cytochrome P450 2A6
2
10
5.8489
3.8489
Cytochrome P450 2E1
2
10
5.8489
3.8489
Dopamine D1 receptor
2
10
5.8489
3.8489
Dopamine D2 receptor
2
10
5.8489
3.8489
Dopamine D3 receptor
2
10
5.8489
3.8489
Dopamine D4 receptor
2
10
5.8489
3.8489
Dopamine transporter
2
10
5.8489
3.8489
Endothelin receptor ET-A
2
10
5.8489
3.8489
Estrogen receptor alpha
2
10
5.8489
3.8489
Estrogen receptor beta
2
10
5.8489
3.8489
Glucocorticoid receptor
2
10
5.8489
3.8489
Glycine receptor
2
40
5.8489
3.8489
Histamine H1 receptor
2
10
5.8489
3.8489
Histamine H2 receptor
2
10
5.8489
3.8489
HMG-CoA reductase
2
10
5.8489
3.8489
Insulin receptor
2
10
5.8489
3.8489
Leukotriene C4 synthase
2
10
5.8489
3.8489
Cysteinyl leukotriene receptor 1
2
10
5.8489
3.8489
Arachidonate 15-lipoxygenase
2
10
5.8489
3.8489
Melanocortin receptor 3
2
10
5.8489
3.8489
Melanocortin receptor 4
2
10
5.8489
3.8489
Melanocortin receptor 5
2
10
5.8489
3.8489
Monoamine oxidase A
2.3577777777778
9
5.8489
3.4911222222222
Muscarinic acetylcholine receptor M1
2
10
5.8489
3.8489
Muscarinic acetylcholine receptor M2
2
10
5.8489
3.8489
Muscarinic acetylcholine receptor M3
2
10
5.8489
3.8489
Muscarinic acetylcholine receptor M4
2
10
5.8489
3.8489
Muscarinic acetylcholine receptor M5
2
10
5.8489
3.8489
Neuropeptide Y receptor type 1
2
10
5.8489
3.8489
Neuropeptide Y receptor type 2
2
10
5.8489
3.8489
Nitric-oxide synthase, brain
2
10
5.8489
3.8489
Kappa opioid receptor
2
10
5.8489
3.8489
Mu opioid receptor
2.1849346390687
11
6.0444545454545
3.8595199063858
Phosphodiesterase 5A
2
10
5.8489
3.8489
Platelet activating factor receptor
2
10
5.8489
3.8489
HERG
2
10
5.8489
3.8489
Progesterone receptor
2
10
5.8489
3.8489
Angiotensin-converting enzyme
2
10
5.8489
3.8489
Cathepsin G
2
10
5.8489
3.8489
Leukocyte elastase
2
10
5.8489
3.8489
Protein kinase C alpha
2
10
5.8489
3.8489
MAP kinase ERK1
2.535
12
5.87725
3.34225
MAP kinase p38 alpha
2
10
5.8489
3.8489
Serine/threonine protein phosphatase 2B catalytic subunit, alpha isoform