Accuracy of an AI-Generated Summary: A Deep Dive into ROUGE Scores
TLDR from ChatGPT: The article discusses the importance of using ROUGE scores to evaluate the accuracy of AI-generated summaries. While ROUGE scores provide insights into word overlap and similarity, they have limitations as they do not consider semantic coherence and factual accuracy, highlighting the need for human evaluation as a gold standard
With so much information overload, our ability to extract the essential content - or the TLDR (too long didn't read) - from text using generative AI is incredibly helpful. For instance, the TLDR at the top of this page was created using ChatGPT. Whether it’s sifting through news articles, research papers, or electronic health records (EHRs), we are constantly facing information overload; so making our lives more time efficient through summarization tools is in hot demand right now. AI is stepping up to fulfill this desire by quickly generating a summary that condenses all the information while still preserving key concepts. But how do we know that we can trust these AI summaries? And are they as good as humans?
To answer these questions, NLP scientists use a metric known as the ROUGE score. In this post, we delve into this metric, exploring both its significance in measuring the performance of AI-generated summaries and these systems, as well as its limitations.
What is ROUGE?
Introduced in 2004 at the University of South Carolina, ROUGE (or Recall-Oriented Understudy for Gisting Evaluation) has since become a cornerstone for evaluating AI-generated summaries because it’s an objective and flexible measurement. ROUGE is mainly three separate measurements. These three metrics calculate the similarity between an AI-generated summary and a reference summary (generally a human-written “gold standard”).
The ROUGE metrics are the following:
Example Reference Summary: “The puppy was so excited to see his owners he jumped up and down.”
Example AI-Generated Summary: “The puppy jumped up and down excitedly.”
- ROUGE-N: Measures the word overlap. For example:
○ ROUGE-1: Compares the number of overlapping individual words. In the example above, that is “the”, “excited”, “puppy”, “jumped”, “up”,
“and”, “down’
○ ROUGE-2: Compares the number of overlapping consecutive words and targets
the co-occurrences of word sequences. In the example above that is “the puppy”, “jumped up”, “and down”
- ROUGE-L: Measures the longest common subsequence (LCS). LCS doesn’t require
consecutive matches, rather it reflects in-sequence matches which rely on sentence
level word order. In the example above, the LCS is “jumped up and down”
- ROUGE-S: Measures the structural organization by considering any pair of words in a sentence (or skip-bigrams) where the words appear in their original order, but with the allowance of gaps between them. In the example above, some of the skip-bigrams that match between the two summaries are “the puppy jumped” “jumped up and down”, “up and down”
The three most common ROUGE metrics are ROUGE-1, ROUGE-2, and ROUGE-L.
Interpreting ROUGE scores
ROUGE scores serve as an indicator of similarity based on shared words, either in the form of n-grams or word sequences. The score ranges from 0 to 1, where a higher score indicates a greater similarity. This gives insight into how well the automated summary captures relevant information.
Recall, precision, and F1-scores can also be computed to evaluate the generated text. Precision assesses the accuracy of relevant information by, reflecting the proportion of correct information out of all the information that the AI predicts as relevant. On the other hand, recall quantifies the model’s ability to avoid omitting crucial details by calculating the ratio of correct information to the total number of important information (both the information that the AI got correct, and falsely omitted). To strike a balance between precision and recall the F1-score is commonly referenced, offering a comprehensive evaluation metric.
By leveraging ROUGE scores, we get a deeper insight into the effectiveness of our automated summary; it helps reveal what essential information is included in our summary and what is over looked. The metrics also allow us to compare the effectiveness of different models and answer a question like how much better is GPT-4 than GPT-3.


Limitations
Although ROUGE score provides insights into the success of AI-generated summaries, it’s not great at measuring a summary’s quality. ROUGE fails to take into account semantic meaning and factual accuracy. Here are some examples below:
- Reference Summary1: “The brown bunny slept while the energized squirrel scavenged for food.”
AI-Generated Summary1: “The bunny slept while the squirrel scavenged for food.”
Semantically accurate. High ROUGE score.Good overlap of words and the summary is semantically accurate. ROUGE did a good job.
- Reference Summary2: “The brown bunny slept while the energized squirrel scavenged for food.”
AI-Generated Summary2: “The brown squirrel slept while the energized bunny scavenged for food.”
Semantically Inaccurate. High ROUGE score.High overlap of words but the sentences are factually inaccurate. ROUGE gave us a false impression.
- Reference Summary3: “The brown bunny slept while the energized squirrel scavenged for food.”
AI-Generated Summary2: “The rabbit rested as the lively squirrel searched for nourishment.”
Semantically accurate. Low ROUGE score.While there is low word overlap, the AI-generated summary is semantically accurate, meaning it captures the main ideas of the reference summary. Since ROUGE focuses on similarity, the rephrasing and synonyms resulted in a low ROUGE score. ROUGE did not give us a good impression because the summary is better than what the metric leads us to believe.
This third example is a little different than the first two as it is more abstractive. To learn more about this, check out our blog post “Extractive vs Abstractive Summarization in Healthcare.”
As you can see, although the ROUGE score is insightful, it does not fully always capture all aspects of a summary’s quality. While ROUGE successfully measures word overlap, it does not account for factual accuracy which can be misleading.
Conclusion
ROUGE scores are incredibly useful for measuring AI-generated summaries and comparing performance between transformer models. Its objectivity and flexibility makes it a popular choice for gaining insights of the similarities between an AI-generated and human-authored summary.
ROUGE has limitations though. For one, it misses aspects of a summary’s quality: it fails to
account for the semantic coherence and factual accuracy. While there are some other
algorithmic measurements that are making attempts to overcome the limitations of ROUGE, the gold standard in the field of NLP is still a human evaluation. Human evaluation involves the rating of summaries based on criteria such as sentence quality, readability, semantic and factual accuracy, relevance and other useful measures. While human evaluations are exceedingly helpful for understanding true readability, factuality, and contextual understanding, such an assessment can be time consuming and cost prohibitive to perform.