Assignment 4: Machine Learning Model Evaluation
• Three multi-part, multiple-choice questions.
• AI in Healthcare with Phase 2 data set (HTML file)
• Details of the Q1 & Q2 m/c questions are shown in the attached question files.
• Lecture notes on Machine Learning in Healthcare for your reference
Phase 4: Model Evaluation
¶
Welcome to Phase 4 of the capstone project. This section will be focused on the evaluation of the trained models. The questions will relate to the various challenges faced by the teams working on the two projects introduced in the first section.
You have made recommendations (based on your answers in the prior phase) to both of the research teams. They have taken your suggestions into account, and have since refined their results. Both teams have e-mailed you summaries of recent progress, which are shown below.
Project 1: CXR-based COVID-19 Detector
¶
Hi,
Thank you for your excellent suggestions. We addressed the overfitting problem through your excellent recommendations. We applied dropout with probability 0.5 and added a small amount of random rotations to our data augmentation pipeline. We now have a couple of models, and we are interested in your opinion regarding the selection of the model with which we should proceed.
We trained models with early stopping twice– once using validation AUROC as the stopping criterion and once using validation loss as the stopping criterion, and we plotted the ROC curve of both models.
The first model, model A, is the model chosen because it had the highest validation AUROC. The second model, model B, is the model chosen because it had the lowest validation loss. Let us know which model you think that we should use.
We understand that this is meant to be a COVID detector for the purposes of worklist prioritization. Could you help us pick a model given this use case? Let us know what you think, thank you.
Project 2: EHR-based Intubation Predictor
¶
Hello!
We are happy to report that our new method of training on non-COVID samples has given us strong results on the COVID dataset. Our results are very promising and we are excited to bring this project to its conclusion. In order to do that, we need to determine an operating point.
Given that this model will be used for automated triage, how should we reason about choosing a threshold? Below is the precision-recall curve for this model.
Let us know your thoughts, thank you.
In the following quiz, you will answer questions that examine the model use cases of both Team 1 and Team 2.
In [ ]:
Phase IV Model Evaluation and Deployment Question Sheet
Open the HTML file attached to read the scenario. After reading through the case, please review the three questions in this assignment.
Keep the HTML file open so that it is easier for you to look for the information questioned in the following question. Select the correct answer with an explanation in 2 to 3 sentences.
Q 1
.
Part 1
Which of the following definitions best describes the AUROC metric?
How many of the samples predicted positive are actually positive
How well the model is able to segment objects
How many positive samples are correctly predicted as positive
How well separated the predicted probabilities of negatives are from positives
Part 2.
Why might model B be more useful than model A in worklist prioritization?
Model B is not more useful than model A
Model B places more abnormal exams early in the worklist than model A
Model B places more normal exams early in the worklist than model A
Model B places fewer abnormal exams at the end of the worklist than model A
Part 3.
How might both models be leveraged in order to produce a better performance
immediately overall?
Each model can be deployed at separate clinics
For new exams, the predictions of both models can be averaged together
The models can be re-trained using the same hyperparameters
The trained models can be used to train other models
Part 4
When models are evaluated, they are often trained multiple times. Why might this be the case?
Q 2
Part 1
For automated triage, which of the following is the
most important metric to satisfy when choosing an operating point?
Specificity, as it measures how many negative samples are correctly predicted as negative
PPV / Precision, as it measures how many of the samples predicted positive are actually positive
Sensitivity / Recall, as it measures how many positive samples are correctly predicted as positive
Intersection Over Union (IOU), as it measures how well the model is able to segment the lesions
Part 2.
Who is the beneficiary of the EHR-based invasive mechanical ventilation predictor?
1/ The provider – the output helps the clinician manage their patients
2/ The patient – the output will help the patient make informed decisions
3/The hospital – the output will identify the need for additional ICU rooms
4
/
None of the above
Part 3.
Hypoxemia is an important feature in the EHR-based invasive mechanical ventilation predictor model that was validated in the independent sample. This can be used as a component of the:
Valid clinical association
Analytical validation
Clinical validation
None of the above
Part 4.
The model has identified that patient X is at high risk of invasive mechanical ventilation within the next 48 hours and this information was sent to the clinical team. What action can be taken based on the prediction and any mitigation strategy.
Reserve a ventilator and bed in ICU to ensure the patient has the resources they need when they begin to clinically decline
Administration of an experimental drug currently under investigation for COVID deterioration
Notify the insurance company to ensure payment of the anticipated extended length of stay
None of the above
Q3
The question will no longer be tied to the HTML file , as it applies more broadly to many deployment settings.
A multi-part multiple-choice question that deals with issues and hypothetical scenarios related to model deployment in the clinical setting.
Part 1.
Before you deploy the model, you want to test the fairness of the model based on anti-classification. What variables would you use to test fairness?
Gender
Race
County of residence
Gender and Race
All of the above
Part 2.
You provide the model output to the clinical team. What would be the risk category?
Category I
Category II
Category III
Category IV
Part 3.
Given the novelty of COVID, what do you think could be used for a valid clinical association argument?
Performance of a clinical trial based on your AI solution
Literature Searches
Examples of how your model can generate new evidence
All of the above
Part 4.
Your model uses past symptoms as a predictor for invasive mechanical ventilation, however 40% of your population are on public insurance and likely do not have the same access to care as those on private insurance. How would this bias your results?
Under reporting of symptoms in public insured patients
Patients on public insurance are likely to have more symptoms than patients on private insurance because they are known to have more comorbidities
This will not bias the outcome, need for invasive mechanical ventilation because symptoms were not a top predictor of the outcome
None of the above
Gender
Part 5
In real-time, it takes 24 hours to obtain and process the images needed for the CXR-based COVID detector. Would this time lag affect the clinical utility of the AI solution? Why or Why not?
image1.wmf
Class notes on Machine Learning and AI in a healthcare setting
Artificial intelligence (AI) has transformed industries around the world, and has the potential to radically alter the field of healthcare. Imagine being able to analyze data on patient visits to the clinic, medications prescribed, lab tests, and procedures performed, as well as data outside the health system — such as social media, purchases made using credit cards, census records, Internet search activity logs that contain valuable health information, and you’ll get a sense of how AI could transform patient care and diagnoses.
In this course, we’ll discuss the current and future applications of AI in healthcare with the goal of learning to bring AI technologies into the clinic safely and ethically. Here is a list of the learning objectives for a quick reference.
1) Solving the problems and challenges within the U.S. healthcare system requires a deep understanding of how the system works. Successful solutions and strategies must take into account the realities of the current system.
This course explores the fundamentals of the U.S. healthcare system. It will introduce the principal institutions and participants in healthcare systems, explain what they do, and discuss the interactions between them. The course will cover physician practices, hospitals, pharmaceuticals, and insurance and financing arrangements. We will also discuss the challenges of healthcare cost management, quality of care, and access to care. While the course focuses on the U.S. healthcare system, we will also refer to healthcare systems in other developed countries.
AI in healthcare use case: Natural language processing
When subject matter experts help train AI algorithms to detect and categorize certain data patterns that reflect how language is actually used in their part of the health industry, this natural language processing (NLP) enables the algorithm to isolate meaningful data. This helps decision-makers with the information they need to make informed care or business decisions quickly.
Healthcare payers
For healthcare payers, this NLP capability can take the form of a virtual agent using conversational AI to help connect health plan members with personalized answers at scale.
View the resource
.
Government health and human service professionals
For government health and human service professionals, a case worker can use AI solutions to quickly mine case notes for key concepts and concerns to support an individual’s care.
Clinical operations and data managers
Clinical operations and data managers executing clinical trials can use AI functionality to accelerate searches and validation of medical coding, which can help reduce the cycle time to start, amend, and manage clinical studies.
2) This course introduces you to a framework for successful and ethical medical data mining. We will explore the variety of clinical data collected during the delivery of healthcare. You will learn to construct analysis-ready datasets and apply computational procedures to answer clinical questions. We will also explore issues of fairness and bias that may arise when we leverage healthcare data to make decisions about patient care.Only by training AI to correctly perceive information and make accurate decisions based on the information provided, can you ensure your AI will perform the way it’s intended.
3 Machine learning and artificial intelligence hold the potential to transform healthcare and open up a world of incredible promise. But we will never realize the potential of these technologies unless all stakeholders have basic competencies in both healthcare and machine learning concepts and principles.
This course will introduce the fundamental concepts and principles of machine learning as it applies to medicine and healthcare. We will explore machine learning approaches, medical use cases, metrics unique to healthcare, as well as best practices for designing, building, and evaluating machine learning applications in healthcare. The course will empower those with non-engineering backgrounds in healthcare, health policy, pharmaceutical development, as well as data science with the knowledge to critically evaluate and use these technologies.
4 With artificial intelligence applications proliferating throughout the healthcare system, stakeholders are faced with both opportunities and challenges of these evolving technologies. This course explores the principles of AI deployment in healthcare and the framework used to evaluate downstream effects of AI healthcare solutions.
5 This last course includes a project that takes you on a guided tour exploring all the concepts we have covered in the different classes up till now. We have organized this experience around the journey of a patient who develops some respiratory symptoms and given the concerns around COVID19 seeks care with a primary care provider. We will follow the patient’s journey from the lens of the data that are created at each encounter, which will bring us to a unique de-identified dataset created specially for this specialization. The data set spans EHR as well as image data and using this dataset, we will build models that enable risk-stratification decisions for our patient. We will review how the different choices you make — such as those around feature construction, the data types to use, how the model evaluation is set up and how you handle the patient timeline — affect the care that would be recommended by the model. During this exploration, we will also discuss the regulatory as well as ethical issues that come up as we attempt to use AI to help us make better care decisions for our patient. This course will be a hands-on experience in the day of a medical data miner.
6 How does AI training work?
AI training starts with data. While the actual size of the dataset needed is dependent on the project, all machine learning projects require high-quality, well-annotated data in order to be successful. It’s the old GIGO rule of computer science — garbage in, garbage out. If you train your AI using poor-quality or incorrectly tagged data, you’ll end up with poor-quality AI.
Once the quality assurance phase is complete, the AI training process has three key stages:
1. Training
2. Validation
3. Testing
Keys to successful AI training
You need three ingredients to train AI well: high-quality data, accurate data annotation, and a culture of experimentation.
High-quality data
Bad data skews AI’s judgment and produces undesirable results. It can even create AI that is biased.
Accurate data annotation
Not only do you need to have plenty of high-quality data, but you must also accurately annotate it. Otherwise, your AI will have no contextual guidance to help it properly interpret the data, let alone learn from it. For example, correctly annotated images can help teach AI programs to tell the difference between suspected skin cancer and benign birthmarks.
A culture of experimentation
7 Why is it important to evaluate your machine learning algorithm?
Evaluating your machine learning algorithm is an essential part of any project. Your model may give you satisfying results when evaluated using a metric say accuracy score but may give poor results when evaluated against other metrics such as logarithmic loss or any other such metric.
Choose an evaluation metric depending on your use case. Different metrics work better for different purposes. Selecting the appropriate metrics also allows you to be more confident in your model when presenting your data and findings to others. On the flip side, using the wrong evaluation metric can be detrimental to a machine learning use case. A common example is focusing on accuracy, with an imbalanced dataset.
Healthcare privacy is a central ethical concern involving the use of big data in healthcare, with vast amounts of personal information widely accessible electronically.
Medical records and prescription data are being used, and even sold, for a variety of purposes. As long as it’s de-identified, patients’ permission isn’t needed. However, some medical ethicist argues, “There is a need for sound regulation to guide and oversee this brave new world of algorithmic-based healthcare.”