Assignment 2: Machine Learning Model Training 1
• Two multi-part, multiple-choice questions.
• AI in Healthcare with Phase 2 data set (HTML file)
• Details of the Q1 & Q2 m/c questions are shown in the attached question files.
• Lecture notes on Machine Learning in Healthcare for your reference
Phase 2:
M
odel Training, Part 1
¶
Welcome to Phase 2 of the capstone project. This section will be the first of two parts that concerns the model training process of the model development cycle. You continue to play the role of a bioinformatics professor. The questions will relate to the various challenges faced by the teams working on the two projects introduced in the first section.
Your two research teams have begun working on the projects, and have some preliminary results. Both teams have e-mailed you summaries of progress thus far, which are shown below.
Project 1: CXR-based COVID-19 Detector
¶
Hi,
We are super excited to get this project kicked off! We have implemented the data pipeline and trained a few preliminary models, but there is still lots of room for improvement. Here is what we’ve done so far:
We split the data randomly into a training and test set. We are placing 90% of the data into the training set and 10% of the data into the test set. Additionally, the images were initially massive, on the order of 3000 by 3000 pixels. So, we re-sized the images to 224 by 224 pixels.
We are using the ResNet-50 CNN architecture. During training, we are applying data augmentation. Concretely, on a given image, with 50% probability, we are zooming in on a small, randomly selected region before feeding it to the model. Here is an example:
So far, we have seen the following training curves from our model. The loss for neither the training set nor the test set goes down very much.
As you can see, there is plenty of room for improvement. We’ll keep working on it, but let us know if you have any suggestions. Thanks.
Project 2: EHR-based Intubation Predictor
¶
Hello,
We are in the process of cleaning up the COVID EHR data, and expect to get a model training soon. We attempted to train a set of preliminary models, but ran into some data issues. We were wondering if you could take a look at some of the problems we’ve found in the data and let us know what you think.
irst, we noticed that the EHR data is actually quite sparse relative to what we thought we have. We only have about 3,000 EHR records– not 30,000, as we originally thought. This leaves us with about 300 COVID-positive and 2,700 COVID-negative exams. We might not be able to train a model on this data alone.
We are noticing some very strange patterns in the data, particularly in the lab values. For example, see the following histogram of D-DIMER lab values found for each exam across the entire dataset. The x-axis is the D-DIMER lab values, and the y-axis is the number of exams with that count. We use a log-scale on the y-axis improve readability.
We saw this in several CSV columns, including Ferritin, and Procalcitonin lab values. **We suspect that there is some underlying phenomena affecting all three lab values.**
Another issue we were running into were missing column values. We can’t create a feature vector for Logistic Regression if we are missing some values. How do you suggest we proceed regarding both the large outlier values and the missing values? Below is an example of the data once again, this time a sample of 30 exams (with the observed symptoms excluded). Note that
in the CSV means that the value is missing. Please take a look and let us know if you see something that we might have missed.
In [1]:
pd.read_csv(‘COVID_19_sample_data.csv’)[
[‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘,
‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘,
‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘, ‘
‘,
‘
‘, ‘
‘, ‘
‘]].iloc[5:35]
Out[1]:
8f9539f2-e6ad-4e00-ad45-4abc2bff2214 | 2020-03-04 | 2020-03-14 | Clinic B | 1965-11-11 | non | hispanic | white | 1.0 | 51.6 | 232.4 | 0.0 | 400000.0 | 503900.0 | 16.4 | 0.8 | 88.1 | 136.9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
d5dd13c4-c31e-419c-8c02-47e4ca1ac5e2 | 2020-03-02 | 2020-03-20 | 2018-08-16 | nonhispanic | 1.1 | 52.9 | 30 | 0.7 | 0.0030 | 700000.0 | 579 | 600.0 | 100.0 | 1 | 0.3 | 16.2 | 0.83 | 8 | 0.6 | 32.8 | 140.0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
91369e11-b944-4132-be0f-af46e880936b | 2020-03-21 | Clinic C | 1972-09-22 | 1.2 | 45.0 | 0.0047 | 562.3 | 0.2 | 12.8 | 10.7 | 0.85 | 84.4 | 141.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c70992c9-ff13-467b-9032-1901506edeef | 2020-02-29 | 2020-03-05 | 1959-06-17 | 2020-03-11 | 84.2 | 321.8 | 0.0092 | 1100.0 | 87 | 7.6 | 13.5 | 88.6 | 88.3 | 141.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020-03-12 | 0.5 | 29.7 | 391.9 | 0.0553 | 17900000.0 | 1786 | 400.0 | 200.0 | 12.7 | 11.5 | 0.9 | 75.6 | 143.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9ec7d743-96e7-47c8-b2ee-6336633beb39 | 2020-03-10 | 2020-03-23 | 1969-03-22 | 35.7 | 3 | 12.5 | 0.0045 | 600000.0 | 615900.0 | 11.1 | 15.6 | 0.98 | 83.7 | 143.8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020-03-25 | 24.4 | 254.3 | 0.0021 | 406.2 | 0.1 | 11.8 | 0.80 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a527bcf0-3746-476c-90f2-dbab8868385e | 2020-03-03 | 2020-03-17 | 1978-11-27 | 25.7 | 228.1 | 0.0038 | 500.0 | 459.5 | 1.14 | 7 | 7.7 | 13 | 8.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7f4ef129-1511-47a9-a9b7-8b0b2d02ad50 | 1952-05-06 | 321.5 | 700.0 | 450.1 | 12.1 | 15.0 | 1.20 | 75.8 | 139.6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7078ae9a-4c79-4b30-b127-f76aabb6763e | 2020-02-17 | 2020-03-07 | 1968-04-26 | 48.7 | 306.5 | 509000.0 | 12.0 | 18.4 | 77.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a5c39700-6bf3-4984-af46-31344695e21b | 2020-03-13 | Clinic A | 1940-01-09 | 2020-03-15 | 85.5 | 312.9 | 0.0257 | 1552.8 | 14.9 | 1.23 | 7 | 9.2 | 64.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020-03-16 | 170.6 | 390.5 | 0.0238 | 15000.0 | 1755.9 | 14.0 | 1.15 | 40.9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ddb2d5e2-643e-4374-ac19-f6ca3c0d16f5 | 2020-02-25 | 2020-03-09 | 1967-12-24 | 51.5 | 234.4 | 0.0029 | 527.4 | 0.90 | 88.9 | 42.0 | 138.8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
21505aac-f219-43a8-ab3c-f57c6d8f1d1f | 2020-03-08 | 1940-05-03 | 38.7 | 229.3 | 536500.0 | 12.6 | 1.07 | 77.0 | 137.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7992bf94-feee-4728-9187-2c911df2819b | 2004-07-04 | 27.3 | 238.9 | 0.0022 | 11.0 | 10.9 | 138.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
d2f6d528-39db-4b7e-8389-abd27af9a710 | 1996-06-26 | 31.5 | 0.0034 | 455300.0 | 10.3 | 14.6 | 1.08 | 76.6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
fa0b58e6-6817-4d49-8211-1dd34abf0c15 | 2020-03-28 | 2008-11-21 | 23.0 | 232.0 | 542.0 | 10.8 | 10.6 | 0.88 | 86.3 | 141.6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b83237f3-9ff5-491e-aab4-d63ccff85f85 | 2020-03-30 | 2012-11-17 | 47.2 | 32 | 9.8 | 535.0 | 12.4 | 10.1 | 1.13 | 75.3 | 137.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
46988a9c-9c86-429a-bc4a-b3d14ff321b0 | 1957-03-13 | asian | 37.0 | 235.7 | 535000.0 | 10.2 | 19.1 | 0.86 | 79.1 | 137.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020-03-24 | 29.6 | 304.2 | 0.0044 | 500000.0 | 553400.0 | 98.7 | 86.8 | 143.5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
785b484d-7060-4d17-bf18-ef8bbafc6f04 | 2020-02-28 | 1942-08-24 | 2 | 1.3 | 226.3 | 0.0023 | 468600.0 | 17.2 | 0.91 | 81.7 | 138.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
edad31f3-5a08-4678-8d31-271a41a2aad5 | 2020-03-19 | 78.6 | 306.4 | 0.0256 | 2600.0 | 1764.2 | 13.8 | 18.0 | 83.0 | 62.0 | 141.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
184.4 | 370.1 | 0.0639 | 20600.0 | 1804.0 | 142.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4607a669-4a97-4f0a-9661-856569905047 | 1993-11-26 | 0.0037 | 590.9 | 13.7 | 0.97 | 81.1 | 142.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c1800ba1-7cba-45d7-bdc4-0e0b583932e4 | 2020-02-23 | 2018-01-20 | 464.0 | 77.8 | 141.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
d2718050-2e9c-4d5b-842e-52d910c1563f | 1997-06-01 | 50.6 | 339.8 | 634.5 | 13.6 | 81.5 | 137.9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020-03-22 | 40.5 | 186.6 | 322.6 | 16.5 | 99.6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
818566cb-c89b-42d8-a6af-1a1ef13ed7cf | 1984-10-11 | 27.0 | 239.3 | 474.7 | 0.94 | 85.7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
000e7adf-cbaa-4fad-ab2f-658c32f7d4d3 | 1959-01-03 | 175.1 | 326.5 | 0.0063 | 1600000.0 | 1010100.0 | 1.37 | 81.4 | 79.6 | 143.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5a2f02ce-0286-45ae-b992-05331cb88379 | 2020-03-29 | 1973-06-30 | 27.7 | 349.5 | 14.8 | 87.6 | 143.7 |
In the following quiz, you will answer questions examining the issues of Team 1 and Team 2.
In [ ]:
Phase II Model Training Question Sheet
Click on the HTML file attached to read the scenario. After reading through the case, please review the two questions in this assignment.
Keep the HTML file open so that it is easier for you to look for the information questioned in the quiz/exercise. Check all the correct answers with an explanation in 2 or 3 sentences.
Q 1 Part 1.
The team split the data into two partitions: the training set and the test set. It is considered best practice to have a third partition– the validation set. What added utility is there in having a validation set?
Check all that apply.
The validation set can be used for tuning hyperparameters
The validation set can be used for early stopping
The test set should be used for final evaluation only
The validation set can be used for updating the model directly
Part 2.
The team split the data randomly, without accounting for the patient to whom each exam belongs. Why would this be a problem?
Recall: “The COVID dataset consists of 30,000 exams across 21,000 patients (some patients may be associated with multiple exams)”
Patient overlap between the training and test sets may lead to problems with model convergence due to exposure to the test set
Patient overlap between the training and test sets may lead to problems with model bias because of the underrepresentation of certain patient demographics in the training set
Patient overlap between the training and test sets may lead to the leakage of PHI or other sensitive data
Patient overlap between the training and test sets may lead to inflated model performance due to unrealistic evaluation conditions
Part 3.
The team downsized the images to 224 by 224 pixels. Why might this lead to worse model performance?
The discriminative features in the image may be too small to identify without a higher resolution
Many publicly available models use 224 by 224 pixel images
Memory constraints may limit the model’s ability to process high-resolution images
224 by 224 pixel chest x-rays are easier to classify than 3000 by 3000 pixel chest x-rays
Part 4.
Why are Convolutional Neural Networks (CNN) particularly well suited for image classification tasks?
Check all that apply.
CNN architectures take advantage of feature locality through the use of filters
CNN architectures leverage multiple decision trees in order to make their predictions more robust
CNN architectures are parameter-efficient because they use the same set of weights on each region of the image
CNN architectures can condition on previous timesteps, which it takes as input in addition to the images themselves
Part 5.
What learning phenomena is the team observing?
Convergence
Overfitting
Underfitting
Generalization
Part 6
(i)
A colleague approaches you and suggests that it would be better if you created a model that relied only on observable features and exam metadata (patient age, gender, ethnicity, etc.). What trade-offs must be considered when using lab values as features?
Answer in 3- 5 sentences
(ii)
Before using the new public COVID dataset, you want to verify that there is no PHI in the data. What are some privacy issues that could come into play with imaging data?
Answer in 3- 5 sentences
Q 2 Part 1
The D-DIMER values are highly concentrated <1k, but there are many samples that are several orders of magnitude apart from the rest of the samples. What is the most likely explanation for this? (Hint: look at the data samples, particularly the exam metadata.)
There is a large disparity in D-DIMER lab values across patient gender
There is a large disparity in D-DIMER lab values across patient age
The data collected from one the clinics may use different units
The data was collected from two cohorts from two different time periods
Part 2.
Which of the following strategies can be used in order to accommodate for the missing values in the EHR dataset?
Check all that apply.
A logistic regression model can be trained after the missing values are synthetically generated, using a process known as
imputation
A tree-based model, such as random forest, can be trained directly on the data with missing values
A tree-based model, such as random forest, can be trained after the missing values are synthetically generated, using a process known as
imputation
A logistic regression model can be trained directly on the data with missing values
Part 3.
Which of the following is FALSE regarding logistic regression models?
Logistic regression uses the sigmoid activation function
Logistic regression can take unstructured inputs, such as images or text
Logistic regression produces values between 0 and 1, regardless of the scale of the features
Logistic regression is commonly used for classification problems
Part 4.
Which of the following is FALSE regarding random forest models?
Random forest models are a type of decision tree algorithm
Random forest models are highly interpretable
Random forest models learn multiple decision trees that each learn on a subset of the available features
Random forest models require feature normalization (i.e. scaling the features such that they are between 0 and 1) in order to work effectively
image1.wmf
image2.wmf
Class notes on Machine Learning and AI in a healthcare setting
Artificial intelligence (AI) has transformed industries around the world, and has the potential to radically alter the field of healthcare. Imagine being able to analyze data on patient visits to the clinic, medications prescribed, lab tests, and procedures performed, as well as data outside the health system — such as social media, purchases made using credit cards, census records, Internet search activity logs that contain valuable health information, and you’ll get a sense of how AI could transform patient care and diagnoses.
In this course, we’ll discuss the current and future applications of AI in healthcare with the goal of learning to bring AI technologies into the clinic safely and ethically. Here is a list of the learning objectives for a quick reference.
1) Solving the problems and challenges within the U.S. healthcare system requires a deep understanding of how the system works. Successful solutions and strategies must take into account the realities of the current system.
This course explores the fundamentals of the U.S. healthcare system. It will introduce the principal institutions and participants in healthcare systems, explain what they do, and discuss the interactions between them. The course will cover physician practices, hospitals, pharmaceuticals, and insurance and financing arrangements. We will also discuss the challenges of healthcare cost management, quality of care, and access to care. While the course focuses on the U.S. healthcare system, we will also refer to healthcare systems in other developed countries.
AI in healthcare use case: Natural language processing
When subject matter experts help train AI algorithms to detect and categorize certain data patterns that reflect how language is actually used in their part of the health industry, this natural language processing (NLP) enables the algorithm to isolate meaningful data. This helps decision-makers with the information they need to make informed care or business decisions quickly.
Healthcare payers
For healthcare payers, this NLP capability can take the form of a virtual agent using conversational AI to help connect health plan members with personalized answers at scale.
View the resource
.
Government health and human service professionals
For government health and human service professionals, a case worker can use AI solutions to quickly mine case notes for key concepts and concerns to support an individual’s care.
Clinical operations and data managers
Clinical operations and data managers executing clinical trials can use AI functionality to accelerate searches and validation of medical coding, which can help reduce the cycle time to start, amend, and manage clinical studies.
2) This course introduces you to a framework for successful and ethical medical data mining. We will explore the variety of clinical data collected during the delivery of healthcare. You will learn to construct analysis-ready datasets and apply computational procedures to answer clinical questions. We will also explore issues of fairness and bias that may arise when we leverage healthcare data to make decisions about patient care.Only by training AI to correctly perceive information and make accurate decisions based on the information provided, can you ensure your AI will perform the way it’s intended.
3 Machine learning and artificial intelligence hold the potential to transform healthcare and open up a world of incredible promise. But we will never realize the potential of these technologies unless all stakeholders have basic competencies in both healthcare and machine learning concepts and principles.
This course will introduce the fundamental concepts and principles of machine learning as it applies to medicine and healthcare. We will explore machine learning approaches, medical use cases, metrics unique to healthcare, as well as best practices for designing, building, and evaluating machine learning applications in healthcare. The course will empower those with non-engineering backgrounds in healthcare, health policy, pharmaceutical development, as well as data science with the knowledge to critically evaluate and use these technologies.
4 With artificial intelligence applications proliferating throughout the healthcare system, stakeholders are faced with both opportunities and challenges of these evolving technologies. This course explores the principles of AI deployment in healthcare and the framework used to evaluate downstream effects of AI healthcare solutions.
5 This last course includes a project that takes you on a guided tour exploring all the concepts we have covered in the different classes up till now. We have organized this experience around the journey of a patient who develops some respiratory symptoms and given the concerns around COVID19 seeks care with a primary care provider. We will follow the patient’s journey from the lens of the data that are created at each encounter, which will bring us to a unique de-identified dataset created specially for this specialization. The data set spans EHR as well as image data and using this dataset, we will build models that enable risk-stratification decisions for our patient. We will review how the different choices you make — such as those around feature construction, the data types to use, how the model evaluation is set up and how you handle the patient timeline — affect the care that would be recommended by the model. During this exploration, we will also discuss the regulatory as well as ethical issues that come up as we attempt to use AI to help us make better care decisions for our patient. This course will be a hands-on experience in the day of a medical data miner.
6 How does AI training work?
AI training starts with data. While the actual size of the dataset needed is dependent on the project, all machine learning projects require high-quality, well-annotated data in order to be successful. It’s the old GIGO rule of computer science — garbage in, garbage out. If you train your AI using poor-quality or incorrectly tagged data, you’ll end up with poor-quality AI.
Once the quality assurance phase is complete, the AI training process has three key stages:
1. Training
2. Validation
3. Testing
Keys to successful AI training
You need three ingredients to train AI well: high-quality data, accurate data annotation, and a culture of experimentation.
High-quality data
Bad data skews AI’s judgment and produces undesirable results. It can even create AI that is biased.
Accurate data annotation
Not only do you need to have plenty of high-quality data, but you must also accurately annotate it. Otherwise, your AI will have no contextual guidance to help it properly interpret the data, let alone learn from it. For example, correctly annotated images can help teach AI programs to tell the difference between suspected skin cancer and benign birthmarks.
A culture of experimentation
7 Why is it important to evaluate your machine learning algorithm?
Evaluating your machine learning algorithm is an essential part of any project. Your model may give you satisfying results when evaluated using a metric say accuracy score but may give poor results when evaluated against other metrics such as logarithmic loss or any other such metric.
Choose an evaluation metric depending on your use case. Different metrics work better for different purposes. Selecting the appropriate metrics also allows you to be more confident in your model when presenting your data and findings to others. On the flip side, using the wrong evaluation metric can be detrimental to a machine learning use case. A common example is focusing on accuracy, with an imbalanced dataset.
Healthcare privacy is a central ethical concern involving the use of big data in healthcare, with vast amounts of personal information widely accessible electronically.
Medical records and prescription data are being used, and even sold, for a variety of purposes. As long as it’s de-identified, patients’ permission isn’t needed. However, some medical ethicist argues, “There is a need for sound regulation to guide and oversee this brave new world of algorithmic-based healthcare.”