Assignment Rules: Please read carefully!
- Assignments are to be treated as “limited open-computer” take-home exams. That is, you must not discuss your assignment solutions with anyone else (including your classmates, paid/unpaid tutors, friends, parents, relatives, etc.) and the submission you make must be your own work. In addition, no member of the teaching team will assist you with any issues that are directly related to your assignment solutions.
- For other assignment Codes of Conduct, please refer to this web page on Canvas:
https://rmit.instructure.com/courses/67061/pages/assignments-summary-purpose-code-of-conduct-and-assessment-criteria - You must document all your work in Jupyter notebook format. Please submit one Jupyter notebook file & one HTML file per question. Specifically, you must upload the following 4 files for this assignment:
- StudentID_A3_Q1.html (example: s1234567_A3_Q1.html)
- StudentID_A3_Q1.ipynb
- StudentID_A3_Q2_AUC.html (here, AUC needs to be the highest AUC you can get for Q2; example: s1234567_A3_Q2_0.632.html)
- StudentID_A3_Q2_AUC.ipynb
- Please put your Honour Code at the top in your answer for the first question. At least one of your HTML files must contain the Honour Code.
- Please make sure your online submission is consistent with the checklist below:
https://rmit.instructure.com/courses/67061/pages/online-submissions-checklist - For full Assignment Instructions and Summary of Penalties, please see this web page on Canvas:
https://rmit.instructure.com/courses/67061/pages/instructions-for-online-submission-assessments - So that you know, there are going to be penalties for any assignment instruction or specific question instruction that you do not follow.
Programming Language Instructions
You must use Python 3.6 or above throughout this entire Assignment 3. Use of Microsoft Excel is prohibited for any part of any question in this assignment. For plotting, you can use whatever Python module you like.
Question 1
(65 points)
This question is inspired from Exercise 5 in Chapter 6 in the textbook. Our problem is based on the US Census Income Dataset
that we have been using in this course. Here, the annual_income
target variable is binary, which is either high_income
or low_income
. As usual, high income will be the positive class for this problem.
WE WRITE ESSAYS FOR STUDENTS
Tell us about your assignment and we will find the best writer for your project
Write My Essay For MeFor this question, you will use different variations of the Naive Bayes (NB) classifier for predicting the annual_income
target feature. You will present your results as Pandas data frames.
Bayesian classifiers are some of the most popular machine learning algorithms out there. The goal here for you is then two-fold:
- To gain valuable skills on how to use popular variants of the Naive Bayes classifier using Scikit-Learn and
- Be able to identify which variant to use for a given particular dataset.
Throughout this question,
- Use the “A3_Q1_train.csv” dataset (with 500 rows) to build NB models.
- Assume that the “A3_Q1_train.csv” dataset is clean in the sense that there are no outliers or any unusual values.
- Use accuracy as the evaluation metric to train models.
NOTE: In practice, you should never train and test using the same data. This is cheating (unless there is some sort of cross-validation involved). However, throughout this entire Question 1, you are instructed to do just that to make coding easier. Besides, NB is a simple parametric model and the chances that it will overfit for this particular problem is relatively small.
Part A (10 points): Data Preparation
TASK 1 (5 points):
Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low
, mid
, and high
. Once you do that, all the 5 descriptive features in your dataset will be categorical. Your dataset’s name after Task 1 needs to be df_all_cat. Please make sure to run the following code for marking purposes:
HINT: You can use the cut()
function in Pandas
for equal-width binning.
TASK 2 (5 points):
Next, perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Your dataset’s name after Task 2 needs to be df_all_cat_ohe. Please make sure to run the following code for marking purposes:
You will provide your solutions for Parts B, C, and D below after you have taken care of the above two data preparation tasks.
MARKING NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.
Part B (5 points): Bernoulli NB
In Chapter 6 PPT Presentation, we recently added some explanation on a useful variant of NB called Bernoulli NB
. Please see the updated Chapter 6 PPT Presentation on Canvas.
For this part, train a Bernoulli NB model (with default parameters) using the train data and compute its accuracy on again train data.
Official documentation on Bernoulli NB:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html
Part C (5 points): Gaussian NB
For this part, train a Gaussian NB model (with default parameters) using the train data and compute its accuracy on again train data.
As you know, the Gaussian NB assumes that each descriptive feature follows a Gaussian probability distribution. However, this assumption no longer holds for this problem because all features will be binary after the data preparation tasks in Part A. Thus, the purpose of this part is to see what happens if you apply Gaussian NB on binary-encoded descriptive features.
Official documentation on Gaussian NB:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Part D (20 points): Tuning your Models
In this part, you will fine-tune the hyper-parameters of the Bernoulli and Gaussian NB models in the above two parts to see if you can squeeze out a bit of additional performance by hyper-parameter optimization.
TASK 1 (5 points each): Tuning:
Fine-tune the alpha
parameter of the Bernoulli NB model and the var_smoothing
parameter of the Gaussian NB model.
TASK 2 (5 points each): Plotting:
Display a plot (with appropriate axes labels and a title) that shows the tuning results. Specifically, you will need to include two plots:
- One plot for Bernoulli NB tuning results
- One plot for Gaussian NB tuning results
You must clearly state the respective optimal hyper-parameter values and the corresponding accuracy scores.
There are no hard rules for hyper-parameter fine-tuning here except that you should follow fine-tuning best practices.
HINT: You can perform these fine-tuning tasks in simple “for” loops.
Part E (20 points): Hybrid NB
In real world, you will usually work with datasets with a mix of categorical and numerical features. On the other hand, we covered two NB variants so far:
- Bernoulli NB that assumes all descriptive features are binary, and
- Gaussian NB that assumes all descriptive features are numerical and they follow a Gaussian probability distribution.
The purpose of this part is to implement a Hybrid NB Classifier on the “A3_Q1_train.csv” dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data. This part will require you to think about how NB classifiers work in general and how Bernoulli and Gaussian NB classifiers can be combined via the “naivety” assumption of the Naive Bayes classifier.
Part F (5 points): Wrapping Up
For this part, you will summarize your results as a Pandas data frame called df_summary with the following 2 columns:
method
accuracy
(please round these accuracy results to 3 decimal places)
As for the method
, you will need to include the following methods in the order given below:
- Part B (Bernoulli NB)
- Part C (Gaussian NB)
- Part D (Tuned Bernoulli NB)
- Part D (Tuned Gaussian NB)
- Part E (Hybrid NB)
After displaying df_summary, please briefly explain the following:
(i) Whether hyper-parameter tuning improves the performance of the Bernoulli and Gaussian NB models respectively.
(ii) Whether your Hybrid NB model has more predictive power than the (untuned) Bernoulli and Gaussian NB models respectively.
Question 2:
(35 points)
This question is actually a class competition.
The purpose of this question is to come up with a machine learning algorithm that will maximize the AUC (Area Under Curve) score for a loan default prediction
problem. You will use the “loan_default_train.csv” (with 40,000 rows) and “loan_default_test.csv” (with 20,000 rows) datasets for training and testing respectively, which you will read in from the Cloud. You will assume that these datasets are clean in the sense that there are no outliers or any unusual values.
A brief description of the features in these datasets are given below:
loan_ID
: ID of the loanloan_amount
: amount of loan in dollarslog_annual_income
: log of annual income in dollarsdelinq_2yrs
: number of delinquent accounts in the past 2 yearsdti
: debt-to-income ratiolog_credit_age
: log of the customer’s credit age in yearsemp_length
: length of employment in yearshome_ownership
: home ownership statuspurpose
: purpose of loaninq_last_6mths
: number of credit inquires on the customer’s accounts in the past 6 monthsopen_accounts
: number of open accountstotal_accounts
: total number of accountslog_inc_payment_ratio
: log of income to payment ratiolog_revol_income_ratio
: log of revolving income ratiorevolving_util_rate
: revolving utilization rateterm
: term of the loan (36 or 60 months)target
: loan status (Paid or Default) with Default being the positive class.- (Data Source: Not disclosed)
Your goal here will be to use the training dataset to build a powerful ML algorithm that will give the highest AUC on the test dataset. Remember, by accurately identifying the customers who are likely to default (that is, people who will take the money and never come back), you can save your company millions of dollars.
For coming up with the best algorithm, you are free to choose WHATEVER algorithm you like, e.g., decision trees, Naive Bayes, random forests, SVMs, neural networks, deep learning, gradient boosting, ensemble methods, custom hybrid methods, whatever. Sky is the limit! If you like, you can also use customized feature selection/ extraction/ construction, hyper-parameter fine-tuning, pipelines, or whatever.
For simplicity, you are hereby instructed to use the prepare_dataset()
function below for preparating both the training and the test datasets for modeling. You can add additional data preparation steps, but these need to be done after running our prepare_dataset()
function.
Part A (30 points): Your Model’s Test Performance
We will set the “lowest AUC” as the AUC of a decision tree classifier (with default values) built on the train data and evaluated on the test data. We will then identify the highest test AUC among student submissions. We will use a linear scale in between for marking your model’s performance. For example, suppose the lowest test AUC is 0.55 and highest is 0.65, and your model’s test AUC is 0.63. You mark for Part B will be set as (0.63-0.55)/(0.65-0.55) = 0.08/0.1 of 30 points = 24 points.
Once we release the marks, we will announce the best algorithm without mentioning the name of the winning student (unless he/she is OK with sharing this information).
PART B (5 points): Documentation of Your Model
As part of your submission, you will need to document your algorithm with sufficient detail. You need to explain any additional data preparation steps, feature selection (if any), your pipeline (if any), any other relevant details, and your actual model. You also need to include any relevant code that you have written. The idea here is that any other student in this class should be able to replicate your results based on your documentation & your code.
For this part, please keep it short and sweet! We do not need to know how you came up with your algorithm. So, please do not document the other algorithms you tried, or how you fine-tuned your algorithm, etc. We just would like to know what worked (and we are not interested in what didn’t work).
As a clarification, your documentation will be marked separately from your model’s performance. That is, your model’s performance can be terrible, but you can still get the full mark for this part if your documentation is done properly.
[Button id=”1″]
If you are seeking for fast and reliable essay help, you got on the right page. You can order essays, discussion, article critique, coursework, projects, case study, term papers, research papers, reaction paper, movie review, research proposal, capstone project, speech/presentation, book report/review, annotated bibliography, and more. From now on, you can stop worry and forget about writing assignments: your college papers are safe with our expert writers