Assignment 2 MATH1309
worth 40%
DUE date June 7th 2020 11.59pm
Ensure you put your SAS code and output into your answer pdf. One pdf ONLY please.
Show your SAS code, output and answers within the ONE attached assignment pdf that you
submit in Canvas.
Late submissions will incur a penalty of 10% per day.
Data file: MATH1309 Drug Bank DATA for Assignment2.xlsx
Rubric is at the end of the Assignment
Resources: uploaded to Canvas for Assignment 2 and guidance for SAS
SAS notes for Assignment 2 Iris DA and STEPDISCRIM.pdf (Week 10)
Irene’s SAS notes for Assignment 2 & Lab for PCA Week 8-9.pdf (sent in Week 8)
Table 2: Graphs Produced by PROC PRINCOMP (PCA)
ODS Graph Name |
Plot Description | Statement and Option |
PaintedScorePlot | Score plot of component i versus component j, painted by component k |
PLOTS=SCORE when number of variables |
PatternPlot | Component pattern plot | PLOTS=PATTERN |
PatternProfilePlot | Component pattern profile plot | PLOTS=PATTERNPROFILE |
ScoreMatrixPlot | Matrix plot of component scores |
PLOTS=MATRIX |
ScorePlot | Component score plot | PLOTS=SCORE |
ScreePlot | Scree and variance plots | Default and PLOTS=SCREE |
VariancePlot | Variance proportion explained plot |
PLOTS=SCREE(UNPACKPANEL) |
HINTS AND NOTES TO LEARN AND TO INTERPRET THE DISCRIM OUTPUT: In SAS
• By including pool=test, SAS will decide what kind of discriminant analysis to carry out based
on the results of this test.
• If the test fails to reject, then SAS will automatically do a linear discriminant analysis (LDF).
• If the test rejects, then SAS will do a quadratic discriminant analysis (QDF).
• There are two other options also. If we put pool=yes then SAS will conduct a linear discriminant
analysis whether it is warranted or not. It will pool the variance-covariance matrices of the 2
classes/groups and do a linear discriminant analysis without reporting Bartlett’s test.
MATH1309 ASSIGNMENT 2 OVERVIEW
Assessing Druggability in drug discovery: A Bioinformatics study
Description
(refer to MATH1309 Drug Bank DATA for Assignment2.xlsx)
• Drug-likeness is not a precisely defined concept in drug discovery. Predicting druggability is of
high practical relevance in pharmaceutical research. In vitro absorption, distribution, metabolism
and elimination (ADME) assays are now being conducted throughout the drug discovery process,
but there is still need to develop faster and better analytic methods to enhance the ‘developability’
of drug leads, and to formalise strategies for ADME assessment of good molecular candidates in
the drug discovery and pre-clinical stages.
• This study involves 1,279 small molecules data retrieved from the DrugBank3.0 database a unique
chem-informatics resource analysed by Hudson et al., (2014, 2017, 2019, 2020).
• The data set contains 9 physico-chemical variables (MW, PSA, log P, Log D, etc), and the
molecule’s mode of delivery (oral versus non-oral). See Table 1 below.
WE WRITE ESSAYS FOR STUDENTS
Tell us about your assignment and we will find the best writer for your project
Write My Essay For MeMolecular Weight (MW) |
LogP |
HB donors |
HB acceptors |
Polar Surface Area (PSA) |
ROT BONDS |
Number of N,O atoms (NATOM) |
Rings number (NRING) |
Log D |
Table 1
In addition the data set contains new druggability rules (score functions counting up violations for each
molecule on each of the 9 variables) developed by Hudson et al. These account for the molecule’s size,
permeability etc., but use new cutpoints for each of 9 molecular parameters (Table 2), different to those
conventionally used by the FDA (Lipinski’s rule Table 2).
Work by Hudson et al based on the 9 molecular variables found distinct clusters of the molecules identified
as “poor” versus “good” druggables. The data set contains the 9 ADME variables, 1 scoring function
(score9_LogD) along with the molecule’s mode of delivery (oral versus non-oral). The score is denoted
as score9_ LogD.
Note that the function score9_LogD is a continuous variable of range 0 to 9 – comprised of the 4 traditional
parameters of the rule of five (Ro5) (Lipinski, 2016) (Table 1) plus 4 extra parameters (PSA, number of
rotatable bonds, rings, N and O atoms) with an extra candidates lipophicility, log P or logD, the latter is
the distribution coefficient, recently suggested as a possible preferable predictor for permeation, to
Lipinski’s traditional partition coefficient, Log P, a predictor for permeation.
We dichotomise the score9_LogD_ into 2 groups based on the cutpoint of 4 violations:
Cutpoint <=4 – a non-violator molecule
Cutpoint >4 – a violator (non-druggable) molecule
This is equivalent to:
Score9 _Log D_group <=4 (non-violators) versus Score9 _log D_group >4 (violators)
Table 2 Property |
Ro5 Lipinski |
Hudson’s cutpoint |
Molecular Weight (MW) | ≤ 500 | ≤ 305 |
LogP | ≤ 5 | ≤ 1.9 |
HB donors | ≤ 5 | ≤ 4 |
HB acceptors | ≤ 10 | ≤ 7 |
Polar Surface Area (PSA) | ≤ 65 | |
ROT BONDS | ≤ 7 | |
Number of N,O atoms (NATOM) |
≤ 40 | |
Rings number (NRING) | ≤ 2 | |
Log D | ≤ 3.5 |
Table 2: values above the cutpoints score a 1.0
Description of the drug bank data set N= 1,279 molecules
column | Drug#Card |
1 | MW |
2 | LogP |
LogD | |
Hdonors | |
Hacceptors | 9 molecular properties (Continuous data) |
PSA | |
ROT | |
NATOM | |
NRING | |
Oral#Corrected | Oral or non-oral status |
oral_status | |
Score based on Log D range 0 to 9 | |
Score9_logD | |
Score9_Log D_group | Log D score dichotomised as Cutpoint <=4 or >4 |
score9_logD_group | score9_logD_group | |
<=4 | 1 | non-violator |
>4 | 2 | violator |
A sample of the first 12 molecules’ data is given below.
Drug#Card | MW | LogP | LogD | Hdonors | Hacceptors | PSA | ROT | NATOM | NRING |
114 | 247.1419 | -1.2 | -2.14174 | 3 | 6 | 126.76 | 4 | 26 | 1 |
116 | 445.4292 | -2.7 | -3.28938 | 8 | 12 | 207.27 | 9 | 55 | 3 |
117 | 155.1546 | -3.4 | -3.76809 | 3 | 4 | 92 | 3 | 20 | 1 |
119 | 88.0621 | -0.5 | 0.065874 | 1 | 3 | 54.37 | 1 | 10 | 0 |
120 | 165.1891 | -1.4 | -1.32103 | 2 | 3 | 63.32 | 3 | 23 | 1 |
121 | 244.311 | 0.5 | 0.319424 | 3 | 4 | 103.73 | 5 | 32 | 2 |
123 | 146.1876 | -2.9 | -3.7566 | 3 | 4 | 89.34 | 5 | 24 | 0 |
125 | 174.201 | -3.6 | -3.68594 | 4 | 6 | 127.72 | 5 | 26 | 0 |
126 | 176.1241 | -0.5 | -1.26274 | 4 | 6 | 107.22 | 2 | 20 | 1 |
127 | 202.3402 | -0.7 | -1.45401 | 4 | 4 | 76.1 | 11 | 40 | 0 |
128 | 133.1027 | -3.7 | -3.63921 | 3 | 5 | 100.62 | 3 | 16 | 0 |
129 | 132.161 | -3.3 | -4.01744 | 3 | 4 | 89.34 | 4 | 21 | 0 |
Drug#Card | Oral#Corrected | oral_status | Score9_logD | score9_logD_group | score9_logD_group | |
114 | 0 | non_oral | 1 | <=4 | 1 | non violator |
116 | 0 | non_oral | 7 | >4 | 2 | violator |
117 | 1 | oral | 1 | <=4 | 1 | |
119 | 0 | non_oral | 0 | <=4 | 1 | |
120 | 0 | non_oral | 0 | <=4 | 1 | |
121 | 1 | oral | 1 | <=4 | 1 | |
123 | 0 | non_oral | 1 | <=4 | 1 | |
125 | 0 | non_oral | 2 | <=4 | 1 | |
126 | 1 | oral | 2 | <=4 | 1 | |
127 | 0 | non_oral | 4 | <=4 | 1 | |
128 | 0 | non_oral | 1 | <=4 | 1 | |
129 | 1 | oral | 1 | <=4 | 1 |
Resource to use and revise: Hudson’s SAS notes and code as extra notes (Week 8) about plots
you need for your PCA:
Table 2: Graphs Produced by PROC PRINCOMP
ODS Graph Name |
Plot Description | Statement and Option |
PaintedScorePlot | Score plot of component i versus component j, painted by component k |
PLOTS=SCORE when number of variables |
PatternPlot | Component pattern plot | PLOTS=PATTERN |
PatternProfilePlot | Component pattern profile plot | PLOTS=PATTERNPROFILE |
ScoreMatrixPlot | Matrix plot of component scores |
PLOTS=MATRIX |
ScorePlot | Component score plot | PLOTS=SCORE |
ScreePlot | Scree and variance plots | Default and PLOTS=SCREE |
VariancePlot | Variance proportion explained plot |
PLOTS=SCREE(UNPACKPANEL) |
Question 1. PCA analysis with 5 plots
Answer the following from your SAS output (ensure to include your code and outputs and justifications)
i. Prepare the dataset for input for a PCA via SAS. (2 marks)
ii. Perform a principal component analysis using SAS on the correlation matrix for the p=9
variables. Show your full SAS code and output. Perform a PCA on the whole data set of
molecules using SAS. (6 marks)
iii. Also perform the procedures to obtain the following 5 plots related to PROC PCA.
Refer to Irene’s SAS notes for Assignment 2 & Lab for PCA Week 8-9.pdf (sent in Week 8)
• Scree plot
• Profile plot
• Component Pattern plots
• Score plots
• Loading Plots
Using the plots and SAS notes and your SAS outputs report and answer the following (justify your
answers).
a) Report the eigenvalues and the eigenvectors. (2 marks)
b) What percentage of the total sample variation is accounted for by each of the first PC, 2nd PC to
the ninth PC? (5 marks)
c) What percentage of the total sample variation is accounted for by the first PC to the ninth PC? (1
mark)
d) Write out the formulation for the PCs. (5 marks)
e) Interpret the PCs via eigen values. (5 marks)
f) Interpret the PCs using your component pattern profiles from SAS. (4 marks)
g) Can the data be effectively summarised in fewer than 9 dimensions? Justify your answer using
BOTH relevant plots and eigenvalues. (5 marks)
Question 2: PCA with reduced k <p for plots
Choose the reduced dimensionality k < 9, you think appropriate for data reduction from 9 to k,
based on your PCA findings in Question1. Justify your choice of k carefully.
a) Recreate the 5 plots related to PROC PCA for your given k. (5 marks)
b) Using the plots based on your reduced dimensionality k from part a) and outputs interpret the
first to k PC’s via eigenvalues. (10 marks)
c) Using the plots based on your reduced dimensionality k from part a) and outputs interpret the
first to k PC’s via the outputs (you choose the optimal k). (10 marks)
d) Which of the k PCs are skewed? Use your plots to answer this. (5 marks)
Question 3: DISCRIM ON 2 GROUPS OF MOLECULES
1. Prepare the dataset for input for a Discriminant analysis via SAS. (1 mark)
2. Generate the means, standard deviations and the variance-covariance matrix of the data for
the violators. (1 mark)
3. Generate the means, standard deviations and the variance-covariance matrix of the data for
the non-violators (1 mark)
4. Produce the correlation matrix and an associated scatterplot of the inputted data for the
violators. (1 mark)
5. Produce the correlation matrix and an associated scatterplot of the inputted data for the nonviolators. (1 mark)
6. Using the SAS DISCRIM and your resultant outputs answer the following questions. Use
priors “violators”=0.30 “non-violators”=0.70. (10 marks)
7. Is Σ1= Σ2 Justify your answer. (5 marks)
8. How is a molecule with X0 T = (MW, LogP, LogD, Hdonors, Hacceptors, PSA, ROT,
NATOM, NRING) = (445.429, -2.7, -3.28938, 8, 12, 207.27, 9, 55, 3) allocated? i.e.
allocates it to either the violators or the non-violators group. (5 marks)
9. Write down the resultant confusion matrix. (5 marks)
Question 4: STEPWISE DISCRIM ON 4 GROUPS OF MOLECULES
STEPWISE DICRIM using oral by violatory status groups defined below.
1. For Question 4 you will need to create the following variable i.e. an interaction term between
oral status and score 9_ Log D violation status at 4 levels as defined below: (3 marks)
oral_score | Oral status by _violatory status |
1 | oral_violator |
2 | oral_nonviolator |
3 | nonoral_violator |
4 | nonoral_nonviolator |
2. Crosstabulate in SAS or otherwise oral by violatory status for the whole group. How many
molecules in each of these 4 levels? Create a table or histogram. (2 marks)
3. Run a STEPWISE DISCRIM analysis using the above 4 level grouping variable. (20
marks)
4. Which variables best discriminate the 4 oral by violatory groups/classes? See notes on
STEPDISC below and extra SAS notes (Week 10). (10 marks)
5. Write a clear description of your conclusions include the SAS code and outputs. (10
marks)
1.1. Overview: STEPDISC Procedure
Given a classification variable and several quantitative variables, the STEPDISC procedure performs a
stepwise discriminant analysis to select a subset of the quantitative variables for use in discriminating
among the classes. The set of variables that make up each class is assumed to be multivariate normal
with a common covariance matrix. The STEPDISC procedure can use forward selection, backward
elimination, or stepwise selection.The STEPDISC procedure is a useful prelude to further analyses with
the DISCRIM procedure.
With PROC STEPDISC, variables are chosen to enter or leave the model according to one of two
criteria:
• the significance level of an F test from an analysis of covariance, where the variables already
chosen act as covariates and the variable under consideration is the dependent variable
• the squared partial correlation for predicting the variable under consideration from the CLASS
variable, controlling for the effects of the variables already selected for the model
Forward selection begins with no variables in the model. At each step, PROC STEPDISC enters the
variable that contributes most to the discriminatory power of the model as measured by Wilks’ lambda,
the likelihood ratio criterion. When none of the unselected variables meet the entry criterion, the forward
selection process stops.
Backward elimination begins with all variables in the model except those that are linearly dependent on
previous variables in the VAR statement. At each step, the variable that contributes least to the
discriminatory power of the model as measured by Wilks’ lambda is removed. When all remaining
variables meet the criterion to stay in the model, the backward elimination process stops.
Stepwise selection begins, like forward selection, with no variables in the model. At each step, the model
is examined. If the variable in the model that contributes least to the discriminatory power of the model
as measured by Wilks’ lambda fails to meet the criterion to stay, then that variable is removed.
Otherwise, the variable not in the model that contributes most to the discriminatory power of the model
is entered. When all variables in the model meet the criterion to stay and none of the other variables
meet the criterion to enter, the stepwise selection process stops. Stepwise selection is the default method
of variable selection.
It is important to realize that, in the selection of variables for entry, only one variable can be entered into
the model at each step. The selection process does not take into account the relationships between
variables that have not yet been selected. Thus, some important variables could be excluded in the
process. Also, Wilks’ lambda might not be the best measure of discriminatory power for your
application. However, if you use PROC STEPDISC carefully, in combination with your knowledge of
the data and careful cross validation, it can be a valuable aid in selecting variables for a discrimination
model.
As with any stepwise procedure, it is important to remember that when many significance tests are
performed, each at a level of, for example, 5% (0.05), the overall probability of rejecting at least one true
null hypothesis is much larger than 5%. If you want to prevent including any variables that do not
contribute to the discriminatory power of the model in the population, you should specify a very small
significance level. In most applications, all variables considered have some discriminatory power,
however small. To choose the model that provides the best discrimination by using the sample estimates,
you need only to guard against estimating more parameters than can be reliably estimated with the given
sample size.
The significance level and the squared partial correlation criteria select variables in the same order,
although they might select different numbers of variables. Increasing the sample size tends to increase
the number of variables selected when you are using significance levels, but it has little effect on the
number selected by using squared partial correlations.
RUBRIC
Marks poss. |
Marks gained |
Reason for marks lost | Marks lost |
Q1 | 45 marks | ||
(i) | 2 | ||
(ii) | 6 | ||
(iii) | 10 | ||
a) | 2 | ||
b) | 5 | ||
c) | 1 | ||
d) | 5 | ||
e) | 5 | ||
f) | 4 | ||
g) | 5 | ||
Q2 | 30 marks | ||
a) | 5 | ||
b) | 10 | ||
c) | 10 | ||
d) | 5 | ||
Q3 | 30 marks | ||
1 | 1 | ||
2 | 1 | ||
3 | 1 | ||
4 | 1 | ||
5 | 1 | ||
6 | 10 | ||
7 | 5 | ||
8 | 5 | ||
9 | 5 | ||
Q4 | 45 marks | ||
1 | 3 | ||
2 | 2 | ||
3 | 20 | ||
4 | 10 | ||
5 | 10 |
[Button id=”1″]
If you are seeking for fast and reliable essay help, you got on the right page. You can order essays, discussion, article critique, coursework, projects, case study, term papers, research papers, reaction paper, movie review, research proposal, capstone project, speech/presentation, book report/review, annotated bibliography, and more. From now on, you can stop worry and forget about writing assignments: your college papers are safe with our expert writers