Google AI Claims 99 Percent Accuracy In Metastatic Breast Cancer Detection

← Back to Stories (view on slashdot.org)

Google AI Claims 99 Percent Accuracy In Metastatic Breast Cancer Detection

Posted by BeauHD on Friday October 12, 2018 @01:00PM from the promising-solutions dept.

Researchers at the Naval Medical Center San Diego and Google AI, a division within Google dedicated to artificial intelligence research, are using cancer-detecting algorithms to detect metastatic tumors by autonomously evaluating lymph node biopsies. VentureBeat reports: Their AI system -- dubbed Lymph Node Assistant, or LYNA -- is described in a paper titled "Artificial Intelligence-Based Breast Cancer Nodal Metastasis Detection," published in The American Journal of Surgical Pathology. In tests, it achieved an area under the receiver operating characteristic (AUC) -- a measure of detection accuracy -- of 99 percent. That's superior to human pathologists, who according to one recent assessment miss small metastases on individual slides as much as 62 percent of the time when under time constraints. LYNA is based on Inception-v3, an open source image recognition deep learning model that's been shown to achieve greater than 78.1 percent accuracy on Stanford's ImageNet dataset. As the researchers explained, it takes as input a 299-pixel image (Inception-v3's default input size), outlines tumors at the pixel level, and, in the course of training, extracts labels -- i.e., predictions -- of the tissue patch ("benign" or "tumor") and adjusts the model's algorithmic weights to reduce error.

In tests, LYNA achieved 99.3 percent slide-level accuracy. When the model's sensitivity threshold was adjusted to detect all tumors on every slide, it exhibited 69 percent sensitivity, accurately identifying all 40 metastases in the evaluation dataset without any false positives. Moreover, it was unaffected by artifacts in the slides such as air bubbles, poor processing, hemorrhage, and overstaining. LYNA wasn't perfect -- it occasionally misidentified giant cells, germinal cancers, and bone marrow-derived white blood cells known as histiocytes -- but managed to perform better than a practicing pathologist tasked with evaluating the same slides. And in a second paper published by Google AI and Verily, Google parent company Alphabet's life sciences subsidiary, the model halved the amount of time it took for a six-person team of board-certified pathologists to detect metastases in lymph nodes.

4 of 34 comments (clear)

Min score:

Reason:

Sort:

Re:Why focus on breasts? by PPH · 2018-10-12 14:14 · Score: 2

Very large training data set available on line.

--
Have gnu, will travel.
statistics by bigtreeman · 2018-10-12 16:02 · Score: 2, Informative

1 in 99 is really bad
1000 women, about 120 will get breast cancer, if we miss-diagnose 10 cases, that could be as bad as 8% failure
fuck statistics

--
Go well
1. Re:statistics by religionofpeas · 2018-10-12 20:05 · Score: 4, Insightful
  
  99% is pretty good for a notoriously difficult problem.
  Yeah, sucks if you're part of the 1%, but you'd be part of the 100% if there wasn't any test.
2. Re:statistics by Anonymous Coward · 2018-10-13 02:19 · Score: 2, Insightful
  
  I wanted to comment on a few things about medical statistics that are easy to misunderstand. Unfortunately, the summary of the article misuses some terminology which further obfuscates the issue.
  Some basic measures of a test are its sensitivity and specificity.
  1. Sensitivity is a measure of false negatives. It means that if you have 100 people with the disease, the test catches this percentage. So a test with a sensitivity of 99% would be positive on 99/100 patients with the disease.
  2. Specificity is a measure of how many false positives there are. So if you are using a test with 99% specificity on 100 people without the disease then you will get 1/100 false positive.
  3. An ROC curve graphs sensitivity at different test-cutoff thresholds (because most tests give a continuous, not binary result) and 1-specificty at different thresholds on the other axis. It is a general measure of the accuracy of a test but not the be-all-end-all. A perfect test has a area-under-the-curve (AUC) of 1, a useless test has and AUC of 0.5 or less.
  Usually if you increase the sensitivity (to lower false negatives) you must choose a threshold that corresponds to more false positives, and vice versa. Where you put your thresholds depends on what you want to use the test for. This is because while us doctors look at sensitivity and specificity to have a general idea of how a test works, what we really care about is the positive and negitive predictive values. A positive predictive value is what percentage of positive tests mean the patient has the disease. To calculate this you need to know what percentage of the tested population has the disease (the disease prevelence) - which is hard to know exactly but we try to eyeball to choose the right test. While the sensitivity and specificity are test characteristics at a specific threshold, the PPV and NPV apply to a specific patient population, for example 40-50yo women with a painless breast lump. 120/1000 is NOT the prevalence in the population tested, to refer to a previous poster (aside: in the article they talk about PPV and NPV but they are of course using the prevalence of cancer in the sample slide set used to calculate this).
  An example - let's say you have two tests for kryptonite sensitivity, a rare disease seen only in comic book heroes. Lets say you are using a test with 99% sensitivity and 99% specificity, and you are testing 1,000,000 people, only one of whom has the disease. You will get 1000 positives (99% specificity), but you'll have a 99% chance (99% sensitivity) that Superman will be among those 1000. Your positive predictive value is quite low, 1/1000 or 0.1%, so you're not that confident that a given person with a positive test has kryptonite sensitivity. Your negative predictive value is quite good, approaching 100%, so you're fairly confident that a negative test means that person is an earthling. This is just an example that shows that sensitivity and specificity must be interpreted in context of the population being tested. What we usually end up doing for problems like this is starting with a very sensitive test and then following it up with a more specific test.
  The article (not the summary) says they were able to get 100% sensitivity with the algorithm but they had only 84% specificity.
  When they reoptimized the algorithm for specificity of 100% it had a sensitivity of 69%.
  Human pathologists resemble the specific but less sensitive algorithm.
  Thus, the likely utility of this would be in the sensitive mode, flagging the slides for manual review. Sawa?