Extractive vs Abstractive Summarization in Healthcare
There are two approaches to summarize information: extractive summarization which copies the most relevant sentences from a text, and abstractive summarization which generates new sentences. Abstractive summarization is the most promising method for automated text summarization and has recently been possible thanks to the advancement of the NLP transformer models.
Nov 28, 2022
Summarizing text is surprisingly hard. While there are an infinite number of ways you can distill text into the most important parts, doing it well requires you to master conciseness, coherence and comprehension. We create summaries regularly for numerous activities such as academic papers, wikipedia entries, movies, books, legal documents, business ideas, and even ourselves (the “tell me about yourself” in interviews). It’s no wonder that an automated summary has been pursued for over 70 years in the fields of statistics and computer science, and 20 years within healthcare. Recently, results for automated computerized summaries have been impressive. And for the first time, a good abstractive summary is now possible in healthcare.
So what is abstractive summarization? In the field of summarization, there are two approaches: extractive and abstractive methods. Extractive summarization copies (or extracts) the most important words or phrases from a text to concatenate the content: i.e. imagine selecting the top 3 sentences in a document and presenting those as the summary. Abstractive summarization generates new sentences that never existed by synthesizing the salience of the original text: i.e. paraphrasing the central idea in your own words.
So what is abstractive summarization? In the field of summarization, there are two approaches: extractive and abstractive methods. Extractive summarization copies (or extracts) the most important words or phrases from a text to concatenate the content: i.e. imagine selecting the top 3 sentences in a document and presenting those as the summary. Abstractive summarization generates new sentences that never existed by synthesizing the salience of the original text: i.e. paraphrasing the central idea in your own words.

An obvious problem with extractive summarization is that it lacks fluency; the sentences don’t flow naturally from one sentence to the next. It is generally jarring since there are no transitions between topics and the next sentence. Secondly and most importantly, the main idea of the text might be buried in the original source text and thus cannot be captured in one individual sentence, so comprehension might be lacking. Extractive summarization generally works well for a structured source text, like a news article, where the author presents the most important content to the reader in a key thesis sentence (that topic statement we were trained to write for the five-paragraph essay). Where extractive methods fail is for more artistic and unstructured text when the main idea is a crescendo over numerous pages. Such as when we read a great novel and come to understand the main idea as we reach the denouement. For example, extraction would work poorly for a novel like Moby Dick which opens with the iconic line “Call me Ishmael''. While a beautiful and popular line, the sentence by itself provides little context that the novel is ultimately about the destructive nature of Ahab’s obsessive quest of a gigantic sperm whale.
In healthcare, extractive summarization is great to get the high level diagnoses, allergies, and past procedures for a patient, and then structure all that content into a simple rule-based algorithm. Some general weakness of this approach is the summary reads like a computer wrote it and maintaining all those rules becomes quickly difficult. The most glaring weakness is that these summaries lack context for how the patient is progressing over their treatment. For example, about 10% of the US population has type 2 diabetes, it’s a very common disease; but the disease can be life threatening if not managed properly. A typical extractive summary of a patient would inform you that a patient has the ICD-10 code for diabetes, but it provides no context if the patient has been managing their blood sugar levels well or is at risk for hospitalization. The course of their treatment is not captured in the extractive summary and the physician is still left to search through the hundreds of notes to understand their patient. This is where abstractive summarization techniques excel.
Abstractive summarization is relatively new in healthcare and has coincided with the advancement of the NLP transformer models that have taken off since 2017 (the release of BERT). Because healthcare is particularly challenging, the only commercial applications to date are for automating the radiology impression section for radiologists. The impression section summarizes the key findings of a radiology report, so a computerized version saves the radiologists’ time by not needing to manually write out that summary. The findings section of a radiology report is generally less than 500 words, so a computerized summary does not have to worry about the challenges of longform documentation (i.e. summarizing thousands of words from the whole medical record). That said, these commercial applications still need to address other challenges with a good factual summary in healthcare designed for an individual physician; so the technology is definitely impressive.
With our current pilot with Abstractive Health with Weill Cornell Medical Center, we are building the first commercial abstractive summary of the full patient record in healthcare (so hundreds of notes and not just the radiology report). Our summarization structure is based on those same NLP transformer models from 2017 with some significant modifications. And one of our core research assessments that we are demonstrating is that our automated summary of the patient chart is a close equivalent to a physician written summary. Thus, our tool could be used as a supplement for physicians at patient admission, transfer, and discharge workflows.
In healthcare, extractive summarization is great to get the high level diagnoses, allergies, and past procedures for a patient, and then structure all that content into a simple rule-based algorithm. Some general weakness of this approach is the summary reads like a computer wrote it and maintaining all those rules becomes quickly difficult. The most glaring weakness is that these summaries lack context for how the patient is progressing over their treatment. For example, about 10% of the US population has type 2 diabetes, it’s a very common disease; but the disease can be life threatening if not managed properly. A typical extractive summary of a patient would inform you that a patient has the ICD-10 code for diabetes, but it provides no context if the patient has been managing their blood sugar levels well or is at risk for hospitalization. The course of their treatment is not captured in the extractive summary and the physician is still left to search through the hundreds of notes to understand their patient. This is where abstractive summarization techniques excel.
Abstractive summarization is relatively new in healthcare and has coincided with the advancement of the NLP transformer models that have taken off since 2017 (the release of BERT). Because healthcare is particularly challenging, the only commercial applications to date are for automating the radiology impression section for radiologists. The impression section summarizes the key findings of a radiology report, so a computerized version saves the radiologists’ time by not needing to manually write out that summary. The findings section of a radiology report is generally less than 500 words, so a computerized summary does not have to worry about the challenges of longform documentation (i.e. summarizing thousands of words from the whole medical record). That said, these commercial applications still need to address other challenges with a good factual summary in healthcare designed for an individual physician; so the technology is definitely impressive.
With our current pilot with Abstractive Health with Weill Cornell Medical Center, we are building the first commercial abstractive summary of the full patient record in healthcare (so hundreds of notes and not just the radiology report). Our summarization structure is based on those same NLP transformer models from 2017 with some significant modifications. And one of our core research assessments that we are demonstrating is that our automated summary of the patient chart is a close equivalent to a physician written summary. Thus, our tool could be used as a supplement for physicians at patient admission, transfer, and discharge workflows.

Never miss out on the latest news!