Homework 2

Problem 1: (20 points)

Consider a binary classification problem where there is a single feature X ∈ R and the dependent variable Y ∈ 1. Let PX,Y denote the joint distribution over pairs (X, Y ), and let h : R → 1 denote a generic classifier. We define the error rate of h as R(h) := Pr(Y ̸= h(X)), where the probability is computed from the joint distribution PX,Y . Suppose that we collect data (x1, y1), . . . , (xn, yn), which are assumed to be independent and identically distributed from the distribution PX,Y , and that we train a classifier ˆh using this data. Below we consider two precise specifications of the joint distribution PX,Y . In both cases, derive the largest numerical value ε∗ that you can for which it holds that R(ˆ h) ≥ ε∗. Carefully explain how you arrived at your specific value of ε∗ in both cases.

a) (10 points) X is restricted to the set 1 (i.e., is categorical) and PX,Y is given by the distribution in Table 1.
b) (10 points) The marginal distribution of Y is given by Pr(Y = 0) = 0.3 and Pr(Y = 1) = 0.7. Given that Y = 0, the distribution of X is normal with mean 5 and variance 2. Given that Y = 1, the distribution of X is normal with mean −3 and variance 2. (It is OK for the final answer to be written in terms of the CDF of a normal distribution and/or to numerically approximate this number up to 5 digits)

Problem 2: (10 points)

Consider a binary classification problem where X ∈ Rp and Y ∈ 1. For a fixed x ∈ Rp, suppose that Pr(Y = 1 | X = x) = p for some p ∈ [0, 1]. Consider a prediction problem where there is a loss LFN > 0 associated with predicting Y = 0 when the actual outcome is Y = 1, and another loss LFP > 0 associated with predicting Y = 1 when the actual outcome is Y = 0. There is no loss associated with true positives or true negatives (i.e., a correct answer). Show that there is a threshold value ¯p such that following the expected loss criterion for making a prediction is equivalent to predicting Y = 1 if p ≥ ¯p and predicting Y = 0 otherwise. What is the value of ¯p?

Problem 3: Framingham Heart Study (Adapted from Bertsimas Chapter 7) (70 points)

Heart disease is one of the leading causes of death worldwide. Over 8 million people died from coronary heart disease (CHD) in 2019, which was the leading cause of death that year. In the late 1940s, the U.S. government took steps to study cardiovascular disease. In order to develop high quality data for their study, they decided to track a large cohort of initially-healthy people over time. The town of Framingham, Massachusetts (a suburb of Boston) was selected as the site for the study, which commenced in 1948. The study enrolled 5,209 participants aged 30-62. Participants were given a questionnaire and a medical exam every two years. They also collected data on the participants’ physical characteristics and behavioral characteristics, in addition to the medical test data. Over the years, the study has expanded to include multiple generations and has collected many more factors including genetic information. This data is now famously known and is simply called the Framingham Heart Study. In this exercise, you are asked to build models using Framingham Heart Study data in order to predict CHD and to make recommendations to better prevent heart disease. There are 3,658 total observations in our data, with each observation representing the data from a particular study participant. There are 16 variables in the dataset, which are described in Table 2. You will be asked to predict TenYearCHD (whether the patient experiences coronary heart disease within 10 years of their first examination). As a consequence of your modeling efforts, you should be able to identify risk factors, which are the variables that increase the risk of CHD.

a) (40 points) To lower the risk of CHD, physicians can prescribe preventive medication such as blood-pressure-lowering or cholesterol-lowering medications. Many policy makers, when recommending certain preventive medications to patients at risk of developing CHD, rely on evidence-based analysis that weighs the pros and cons of such interventions. Health economic evaluation is a commonly applied methodology for decision-making that takes both medical costs and health benefits (a monetized version of improved life longevity) into consideration. In fact, many countries establish clinical practice guidelines using such formalized health economic evaluation methodologies (the National Institute for Health and Clinical Excellence in England, for example). As prior work, let us suppose that a colleague of yours has completed a health economics study analyzing the costs and benefits of a recently approved medication aimed at preventing CHD. The colleague determined that patients who experience CHD within the next 10 years are expected to incur a lifetime cost of $955,000 associated with the disease; this cost includes both the costs of treatment for CHD, $330,000, as well as a cost intended to capture the decreased quality and length of life experienced by patients with CHD, which is $625,000. Also, your colleague has determined that patients who take the preventative medicine being studied will have their probability of developing CHD within the next 10 years reduced by 90%; in other words, if their current 10-year risk (probability) of developing CHD is p without taking the medication, then their 10-year risk (probability) with the medicine would instead be (0.10 ∗ p). Regardless of whether a patient eventually develops CHD, there is a $95,000 cost associated with taking this recently approved medication. A decision tree capturing your colleague’s analysis is shown in Figure 1 (below). Using all of the provided independent variables, build a logistic regression model to predict the probability that a patient will experience CHD within the next 10 years. Use dataset framingham train.csv to train your model. This training set has 2560 data points, which are randomly selected from the original framingham.csv dataset (around 70%). Use dataset framingham test.csv to test your model. This test set has the remaining 1098 data points. Please answer the following questions concerning your model:
- i ) What is the fitted logistic regression model? Do not simply copy the results of your code, but instead state the equation used by the model to make predictions. Use all features from Table 1 to build your model.
- ii ) What are the most important risk factors for 10-year CHD risk identified by the model? Pick one of these variables and describe its impact on a patient’s predicted odds of developing CHD in the next 10 years.

...

Wechat

QQ

Telegram

Homework 2

Problem 1: (20 points)

Problem 2: (10 points)

Problem 3: Framingham Heart Study (Adapted from Bertsimas Chapter 7) (70 points)