Why is summarizing so difficult in NLP?
Summarization is a difficult task for natural language processing due to the complexity of understanding the complete meaning of the source text. Long document summarization is even harder due to computational complexity, limited access to training data, and challenges with evaluating performance. To get started with NLP summarization, I recommend finding a good pre-trained NLP model (bigger is better), fine tuning the model on a small set of high quality reference summaries, and removing extraneous data ahead of time to reduce computational complexity and potential errors.
NLP Summarization is all about context
In school, teaching kids to summarize is one of the hardest lessons taught; there are so many factors in writing a good summary. For example, imagine a summary of a scientific paper about a certain disease. One reader might be a doctor looking for information about the success rate of surgeries. While another is interested in the disease itself. Would a single universal summary be useful for all three readers? Probably not. A good summary is designed for your audience. Context is what makes summarization so difficult for AI generation and natural language processing (NLP).
Summarization requires understanding the complete meaning of the source text, as well as the ability to identify and extract the important information from it. The meaning of a text can be highly nuanced and context-dependent, and determining what information is important requires a deep understanding of the topic at hand. Summarizing long documents is even more difficult. So, why is summarizing long documents using NLP so hard? Here are a few reasons:
- Computational Complexity: Current machine learning algorithms are designed to understand the relationships between every word in a page of text. Those relationships become countless as you scale the amount of words that you feed into your algorithm. Hence, the complexity of summarizing Moby Dick is orders of magnitude harder than creating a summary for this blog post. You could try truncating what you feed into your algorithm, but then you risk missing important content in the original text.
- Access to Training Data: There is a lot more data out there for summaries of the news, wikipedia pages, and blog posts than there are for books (and long text in general). And while there are 100 million books in the world, the average book is generally short and doesn't have a reference summary. If you wanted to build an algorithm that could summarize a recently published 1000 page novel into a book report for your high school English teacher, there just isn’t much labeled data out there for this task. So if you want a good dataset, you’ll need to start paying mTurk workers to write some summaries for you.
- User Perception: As you scale the output for length of your summaries, you also are increasing the chances your model will say something wrong. With short sentences, you may have had 95% of your sentences perfect, but now with page long responses, you only have 60% perfection for that same accuracy you had previously. As the length of your summaries increases, you need an incredible level of accuracy to make sure everything produced is perfect.
- Evaluating Performance: machine learning algorithms can measure how many words your automated summaries match in comparison to your reference summaries (it’s called ROUGE). But as there are a million ways to write a good summary, this measure is very inaccurate at validating the quality of a long summary. You’ll need to have humans (i.e. those mTurk workers again) rate your automated summaries to truly measure the accuracy. And then if you have a lot of money and resources and feel motivated to compete against Bit Tech on this challenge, you can build a reinforcement learning layer like OpenAI did for ChatGPT.

How to get started with NLP summarization
Spend the time to find a good pre-trained NLP model. If your goal is to summarize medical journals, then a model that was trained with the task of summarization with PubMed will generally do better than one trained on CNN articles. That said - if the NLP model was not trained on the task of summarization - then it will perform worse than a general model. Lastly, using the biggest language model possible (i.e. GPT-3) is almost always best, regardless of the pre-training data. ChatGPT has shown the NLP community that the most important variable in a great language model is the number of training parameters; so it’s almost always best to go with the biggest. For the thousands of journal articles in the NLP domain that offer solutions to factuality and hallucination, OpenAI will most likely make all those solutions irrelevant once they release GPT-4. A lot of problems just get solved with bigger models.
Fine tune your model on a small set of really good reference summaries. 10k human summaries that are very well written and don’t introduce new concepts will go a long way compared to 100k summaries that have occasional grammatical and factual errors. For NLP AI models, it definitely is garbage in -> garbage out. If you have the resources, spend the time and money paying for human annotators to create your reference summaries.
Lastly, if you can solve part of the problem without NLP, then you should 100% do so before the article is fed into your NLP summarization model. For example, if you know that the acknowledgement section is not needed when summarizing a scientific article journal article, then just use a rule based approach to remove that content ahead of time. While an NLP model will learn during training that this content is not useful, you are saving your model computational complexity and reducing the potential for errors by removing content that you know is never needed.
Good luck and start experimenting with NLP summarization. Let us know if you have any good sage wisdom also!
