Best online Nursing Writing Service agency

Assignment 2 MATH1309

Assignment 2 MATH1309
worth 40%
DUE date June 7th 2020 11.59pm
Ensure you put your SAS code and output into your answer pdf. One pdf ONLY please.
Show your SAS code, output and answers within the ONE attached assignment pdf that you
submit in Canvas.
Late submissions will incur a penalty of 10% per day.
Data file: MATH1309 Drug Bank DATA for Assignment2.xlsx
Rubric is at the end of the Assignment
Resources: uploaded to Canvas for Assignment 2 and guidance for SAS
SAS notes for Assignment 2 Iris DA and STEPDISCRIM.pdf (Week 10)
Irene’s SAS notes for Assignment 2 & Lab for PCA Week 8-9.pdf (sent in Week 8)
Table 2: Graphs Produced by PROC PRINCOMP (PCA)

ODS Graph
Name
Plot Description Statement and Option
PaintedScorePlot Score plot of
component i versus
component j, painted by
component k
PLOTS=SCORE when number of
variables
PatternPlot Component pattern plot PLOTS=PATTERN
PatternProfilePlot Component pattern profile plot PLOTS=PATTERNPROFILE
ScoreMatrixPlot Matrix plot of component
scores
PLOTS=MATRIX
ScorePlot Component score plot PLOTS=SCORE
ScreePlot Scree and variance plots Default and PLOTS=SCREE
VariancePlot Variance proportion explained
plot
PLOTS=SCREE(UNPACKPANEL)

HINTS AND NOTES TO LEARN AND TO INTERPRET THE DISCRIM OUTPUT: In SAS
• By including pool=test, SAS will decide what kind of discriminant analysis to carry out based
on the results of this test.
• If the test fails to reject, then SAS will automatically do a linear discriminant analysis (LDF).
• If the test rejects, then SAS will do a quadratic discriminant analysis (QDF).
• There are two other options also. If we put pool=yes then SAS will conduct a linear discriminant
analysis whether it is warranted or not. It will pool the variance-covariance matrices of the 2
classes/groups and do a linear discriminant analysis without reporting Bartlett’s test.
MATH1309 ASSIGNMENT 2 OVERVIEW
Assessing Druggability in drug discovery: A Bioinformatics study
Description
(refer to MATH1309 Drug Bank DATA for Assignment2.xlsx)
• Drug-likeness is not a precisely defined concept in drug discovery. Predicting druggability is of
high practical relevance in pharmaceutical research. In vitro absorption, distribution, metabolism
and elimination (ADME) assays are now being conducted throughout the drug discovery process,
but there is still need to develop faster and better analytic methods to enhance the ‘developability’
of drug leads, and to formalise strategies for ADME assessment of good molecular candidates in
the drug discovery and pre-clinical stages.
• This study involves 1,279 small molecules data retrieved from the DrugBank3.0 database a unique
chem-informatics resource analysed by Hudson et al., (2014, 2017, 2019, 2020).
• The data set contains 9 physico-chemical variables (MW, PSA, log P, Log D, etc), and the
molecule’s mode of delivery (oral versus non-oral). See Table 1 below.

WE WRITE ESSAYS FOR STUDENTS

Tell us about your assignment and we will find the best writer for your project

Write My Essay For Me
Molecular Weight (MW)
LogP
HB donors
HB acceptors
Polar Surface Area (PSA)
ROT BONDS
Number of N,O atoms (NATOM)
Rings number (NRING)
Log D

Table 1
In addition the data set contains new druggability rules (score functions counting up violations for each
molecule on each of the 9 variables) developed by Hudson et al. These account for the molecule’s size,
permeability etc., but use new cutpoints for each of 9 molecular parameters (Table 2), different to those
conventionally used by the FDA (Lipinski’s rule Table 2).
Work by Hudson et al based on the 9 molecular variables found distinct clusters of the molecules identified
as “poor” versus “good” druggables. The data set contains the 9 ADME variables, 1 scoring function
(score9_LogD) along with the molecule’s mode of delivery (oral versus non-oral). The score is denoted
as score9_ LogD.
Note that the function score9_LogD is a continuous variable of range 0 to 9 – comprised of the 4 traditional
parameters of the rule of five (Ro5) (Lipinski, 2016) (Table 1) plus 4 extra parameters (PSA, number of
rotatable bonds, rings, N and O atoms) with an extra candidates lipophicility, log P or logD, the latter is
the distribution coefficient, recently suggested as a possible preferable predictor for permeation, to
Lipinski’s traditional partition coefficient, Log P, a predictor for permeation.
We dichotomise the score9_LogD_ into 2 groups based on the cutpoint of 4 violations:
Cutpoint <=4 – a non-violator molecule
Cutpoint >4 – a violator (non-druggable) molecule
This is equivalent to:
Score9 _Log D_group <=4 (non-violators) versus Score9 _log D_group >4 (violators)

Table 2
Property
Ro5
Lipinski
Hudson’s
cutpoint
Molecular Weight (MW) ≤ 500 ≤ 305
LogP ≤ 5 ≤ 1.9
HB donors ≤ 5 ≤ 4
HB acceptors ≤ 10 ≤ 7
Polar Surface Area (PSA) ≤ 65
ROT BONDS ≤ 7
Number of N,O atoms
(NATOM)
≤ 40
Rings number (NRING) ≤ 2
Log D ≤ 3.5

Table 2: values above the cutpoints score a 1.0
Description of the drug bank data set N= 1,279 molecules

column Drug#Card
1 MW
2 LogP
LogD
Hdonors
Hacceptors 9 molecular properties (Continuous data)
PSA
ROT
NATOM
NRING
Oral#Corrected Oral or non-oral status
oral_status
Score based on Log D range 0 to 9
Score9_logD
Score9_Log D_group Log D score dichotomised as Cutpoint <=4 or >4
score9_logD_group score9_logD_group
<=4 1 non-violator
>4 2 violator

A sample of the first 12 molecules’ data is given below.

Drug#Card MW LogP LogD Hdonors Hacceptors PSA ROT NATOM NRING
114 247.1419 -1.2 -2.14174 3 6 126.76 4 26 1
116 445.4292 -2.7 -3.28938 8 12 207.27 9 55 3
117 155.1546 -3.4 -3.76809 3 4 92 3 20 1
119 88.0621 -0.5 0.065874 1 3 54.37 1 10 0
120 165.1891 -1.4 -1.32103 2 3 63.32 3 23 1
121 244.311 0.5 0.319424 3 4 103.73 5 32 2
123 146.1876 -2.9 -3.7566 3 4 89.34 5 24 0
125 174.201 -3.6 -3.68594 4 6 127.72 5 26 0
126 176.1241 -0.5 -1.26274 4 6 107.22 2 20 1
127 202.3402 -0.7 -1.45401 4 4 76.1 11 40 0
128 133.1027 -3.7 -3.63921 3 5 100.62 3 16 0
129 132.161 -3.3 -4.01744 3 4 89.34 4 21 0
Drug#Card Oral#Corrected oral_status Score9_logD score9_logD_group score9_logD_group
114 0 non_oral 1 <=4 1 non
violator
116 0 non_oral 7 >4 2 violator
117 1 oral 1 <=4 1
119 0 non_oral 0 <=4 1
120 0 non_oral 0 <=4 1
121 1 oral 1 <=4 1
123 0 non_oral 1 <=4 1
125 0 non_oral 2 <=4 1
126 1 oral 2 <=4 1
127 0 non_oral 4 <=4 1
128 0 non_oral 1 <=4 1
129 1 oral 1 <=4 1

Resource to use and revise: Hudson’s SAS notes and code as extra notes (Week 8) about plots
you need for your PCA:
Table 2: Graphs Produced by PROC PRINCOMP

ODS Graph
Name
Plot Description Statement and Option
PaintedScorePlot Score plot of
component i versus
component j, painted by
component k
PLOTS=SCORE when number of
variables
PatternPlot Component pattern plot PLOTS=PATTERN
PatternProfilePlot Component pattern profile plot PLOTS=PATTERNPROFILE
ScoreMatrixPlot Matrix plot of component
scores
PLOTS=MATRIX
ScorePlot Component score plot PLOTS=SCORE
ScreePlot Scree and variance plots Default and PLOTS=SCREE
VariancePlot Variance proportion explained
plot
PLOTS=SCREE(UNPACKPANEL)

Question 1. PCA analysis with 5 plots
Answer the following from your SAS output (ensure to include your code and outputs and justifications)
i. Prepare the dataset for input for a PCA via SAS. (2 marks)
ii. Perform a principal component analysis using SAS on the correlation matrix for the p=9
variables. Show your full SAS code and output. Perform a PCA on the whole data set of
molecules using SAS. (6 marks)
iii. Also perform the procedures to obtain the following 5 plots related to PROC PCA.
Refer to Irene’s SAS notes for Assignment 2 & Lab for PCA Week 8-9.pdf (sent in Week 8)
• Scree plot
• Profile plot
• Component Pattern plots
• Score plots
• Loading Plots
Using the plots and SAS notes and your SAS outputs report and answer the following (justify your
answers).
a) Report the eigenvalues and the eigenvectors. (2 marks)
b) What percentage of the total sample variation is accounted for by each of the first PC, 2nd PC to
the ninth PC? (5 marks)
c) What percentage of the total sample variation is accounted for by the first PC to the ninth PC? (1
mark)
d) Write out the formulation for the PCs. (5 marks)
e) Interpret the PCs via eigen values. (5 marks)
f) Interpret the PCs using your component pattern profiles from SAS. (4 marks)
g) Can the data be effectively summarised in fewer than 9 dimensions? Justify your answer using
BOTH relevant plots and eigenvalues. (5 marks)
Question 2: PCA with reduced k <p for plots
Choose the reduced dimensionality k < 9, you think appropriate for data reduction from 9 to k,
based on your PCA findings in Question1. Justify your choice of k carefully.
a) Recreate the 5 plots related to PROC PCA for your given k. (5 marks)
b) Using the plots based on your reduced dimensionality k from part a) and outputs interpret the
first to k PC’s via eigenvalues. (10 marks)
c) Using the plots based on your reduced dimensionality k from part a) and outputs interpret the
first to k PC’s via the outputs (you choose the optimal k). (10 marks)
d) Which of the k PCs are skewed? Use your plots to answer this. (5 marks)
Question 3: DISCRIM ON 2 GROUPS OF MOLECULES
1. Prepare the dataset for input for a Discriminant analysis via SAS. (1 mark)
2. Generate the means, standard deviations and the variance-covariance matrix of the data for
the violators. (1 mark)
3. Generate the means, standard deviations and the variance-covariance matrix of the data for
the non-violators (1 mark)
4. Produce the correlation matrix and an associated scatterplot of the inputted data for the
violators. (1 mark)
5. Produce the correlation matrix and an associated scatterplot of the inputted data for the nonviolators. (1 mark)
6. Using the SAS DISCRIM and your resultant outputs answer the following questions. Use
priors “violators”=0.30 “non-violators”=0.70. (10 marks)
7. Is Σ1= Σ2 Justify your answer. (5 marks)
8. How is a molecule with X0 T = (MW, LogP, LogD, Hdonors, Hacceptors, PSA, ROT,
NATOM, NRING) = (445.429, -2.7, -3.28938, 8, 12, 207.27, 9, 55, 3) allocated? i.e.
allocates it to either the violators or the non-violators group. (5 marks)
9. Write down the resultant confusion matrix. (5 marks)
Question 4: STEPWISE DISCRIM ON 4 GROUPS OF MOLECULES
STEPWISE DICRIM using oral by violatory status groups defined below.
1. For Question 4 you will need to create the following variable i.e. an interaction term between
oral status and score 9_ Log D violation status at 4 levels as defined below: (3 marks)

oral_score Oral status by _violatory
status
1 oral_violator
2 oral_nonviolator
3 nonoral_violator
4 nonoral_nonviolator

2. Crosstabulate in SAS or otherwise oral by violatory status for the whole group. How many
molecules in each of these 4 levels? Create a table or histogram. (2 marks)
3. Run a STEPWISE DISCRIM analysis using the above 4 level grouping variable. (20
marks)
4. Which variables best discriminate the 4 oral by violatory groups/classes? See notes on
STEPDISC below and extra SAS notes (Week 10). (10 marks)
5. Write a clear description of your conclusions include the SAS code and outputs. (10
marks)
1.1. Overview: STEPDISC Procedure
Given a classification variable and several quantitative variables, the STEPDISC procedure performs a
stepwise discriminant analysis to select a subset of the quantitative variables for use in discriminating
among the classes. The set of variables that make up each class is assumed to be multivariate normal
with a common covariance matrix. The STEPDISC procedure can use forward selection, backward
elimination, or stepwise selection.The STEPDISC procedure is a useful prelude to further analyses with
the DISCRIM procedure.
With PROC STEPDISC, variables are chosen to enter or leave the model according to one of two
criteria:
• the significance level of an F test from an analysis of covariance, where the variables already
chosen act as covariates and the variable under consideration is the dependent variable
• the squared partial correlation for predicting the variable under consideration from the CLASS
variable, controlling for the effects of the variables already selected for the model
Forward selection begins with no variables in the model. At each step, PROC STEPDISC enters the
variable that contributes most to the discriminatory power of the model as measured by Wilks’ lambda,
the likelihood ratio criterion. When none of the unselected variables meet the entry criterion, the forward
selection process stops.
Backward elimination begins with all variables in the model except those that are linearly dependent on
previous variables in the VAR statement. At each step, the variable that contributes least to the
discriminatory power of the model as measured by Wilks’ lambda is removed. When all remaining
variables meet the criterion to stay in the model, the backward elimination process stops.
Stepwise selection begins, like forward selection, with no variables in the model. At each step, the model
is examined. If the variable in the model that contributes least to the discriminatory power of the model
as measured by Wilks’ lambda fails to meet the criterion to stay, then that variable is removed.
Otherwise, the variable not in the model that contributes most to the discriminatory power of the model
is entered. When all variables in the model meet the criterion to stay and none of the other variables
meet the criterion to enter, the stepwise selection process stops. Stepwise selection is the default method
of variable selection.
It is important to realize that, in the selection of variables for entry, only one variable can be entered into
the model at each step. The selection process does not take into account the relationships between
variables that have not yet been selected. Thus, some important variables could be excluded in the
process. Also, Wilks’ lambda might not be the best measure of discriminatory power for your
application. However, if you use PROC STEPDISC carefully, in combination with your knowledge of
the data and careful cross validation, it can be a valuable aid in selecting variables for a discrimination
model.
As with any stepwise procedure, it is important to remember that when many significance tests are
performed, each at a level of, for example, 5% (0.05), the overall probability of rejecting at least one true
null hypothesis is much larger than 5%. If you want to prevent including any variables that do not
contribute to the discriminatory power of the model in the population, you should specify a very small
significance level. In most applications, all variables considered have some discriminatory power,
however small. To choose the model that provides the best discrimination by using the sample estimates,
you need only to guard against estimating more parameters than can be reliably estimated with the given
sample size.
The significance level and the squared partial correlation criteria select variables in the same order,
although they might select different numbers of variables. Increasing the sample size tends to increase
the number of variables selected when you are using significance levels, but it has little effect on the
number selected by using squared partial correlations.
RUBRIC

Marks
poss.
Marks
gained
Reason for marks lost Marks
lost
Q1 45 marks
(i) 2
(ii) 6
(iii) 10
a) 2
b) 5
c) 1
d) 5
e) 5
f) 4
g) 5
Q2 30 marks
a) 5
b) 10
c) 10
d) 5
Q3 30 marks
1 1
2 1
3 1
4 1
5 1
6 10
7 5
8 5
9 5
Q4 45 marks
1 3
2 2
3 20
4 10
5 10

[Button id=”1″]

If you are seeking for fast and reliable essay help, you got on the right page. You can order essays, discussion, article critique, coursework, projects, case study, term papers, research papers, reaction paper, movie review, research proposal, capstone project, speech/presentation, book report/review, annotated bibliography, and more. From now on, you can stop worry and forget about writing assignments: your college papers are safe with our expert writers

STUCK with your assignments? Hire Someone to Write Your papers. 100% plagiarism-free work Guarantee!

PLACE YOUR ORDER