THE PROBLEM



Our project tackles a fundamental challenge in AI: teaching machines common sense. Specifically we are exploring how Large Language Models (LLM) deal with common sense and intuition. Commonsense typically comes easily to humans. While humans naturally understand that "dropping a glass cup will break it," machines struggle with this type of intuitive knowledge.
To help with this, data scientists use knowledge graphs, which are structured graph networks of information that connect concepts to each other along with their relationships. For example, how a “dog” (concept 1) “is a” (relationship) "mammal" (concept 2).

But there's one “big” problem: these graphs are REALLY REALLY BIG.
Feeding all this information into LLMs can overwhelm them, leading to confusion and poor performance. This mirrors much like how humans also get overwhelm with a huge overflow of information. So, how can we compress these knowledge graphs to be smaller and still effective at teaching common sense to LLMs?
Our Solution: Transformers
Transformers (specifically Graph Transformers) may be the solution for compressing knowledge graphs. Graph Transformers are a type of deep learning model that extends the transformer architecture to work with graph structured data. It uses the power of self-attention mechanisms and graph learning to complex complex relationships between nodes in a graph. Although they require hefty computational power, transformers are especially great at handling long-range and complex relationships especially on large graphs.
Being able to feed commonsense into LLMs in a more effective manner through compressed knowledge graphs can lead to better virtual assistants, improved AI decision making, and more natural interactions with technology.
THE DATA
The graph data we are specifically working with are Common Sense Knowledge Graphs (CSKGs). CSKGs are a specialized type of knowledge graph designed to encode general world knowledge. They play a crucial role in various applications, including reasoning, decision-making, and natural language understanding. They can assist LLMs in generating commonsense explanations beyond what is explicitly mentioned in context. And compressing CSKGs can ensure that a LLM is fed concise knowledge without redundant or irrelevant information.
The data we used consisted of two CSKGs: ComVE and α-NLG.
ComVE
In ComVE the goal is to generate explanations on why a nonsensical sentence does not make sense. Each sample comes with three reference output sentences, which are human-written explanations. The dataset has a training size of 20k, and a test and validation size of 1000.
Example: "We use books to tell the time" would result in the LLM generating a response like "books are used to read, not to tell the time."
α-NLG
For α-NLG, the task is to generate a plausible explanation for what might have happened between a past and future observation, which is also known as abductive reasoning. Each sample in the dataset includes up to 5 reference outputs. This dataset has 50k training points, over 1,500 validation points, and over 3,500 test data points.
Example: The LLM is fed a past action: "Clouds appeared in the sky." And a future action: "The sidewalk was wet." These input prompts would then result in a response like: "It rained."
THE PROJECT
Methods
We aimed to help AI understand common sense better by making large collections of knowledge easier to handle. Below is a general outline of how our methodology works:

1. Finding Important Information
We started by picking out the key ideas from a sentence. For example, from "A person cannot walk across water because water is not solid," the important words are "person," "walk," "water," and "solid." We then looked for related ideas connected to these words in the knowledge graph.
2. Compression: Choosing the Best Connections
We used a special model called a graph transformer to decide which connections were the most important. Think of it like a highlighter that picks out the most relevant ideas while ignoring less useful details. This helps the computer focus on what really matters for understanding the meaning.
About Transformers
Traditional NLP transformers were built to work on fully connected and often sequential sequences. However, graph data has a topology that is often large, complex and does not guarantee full connectivity. NLP transformer models largely use positional encodings for words to ensure unique representation and preserve distance information about each word. This has been adapted in graph transformers by fusing node positional features using Laplacian eigenvectors for graph data. This technique is an effective way to encode node positional information in complex graph data.
Transformers excel on large graph data due to their ability to capture long-range dependencies and complex relationships between nodes, thanks to the self-attention mechanism that considers all node interactions simultaneously. This global receptive field allows them to effectively model intricate structures and dependencies that traditional Graph Neural Networks (GNNs) struggle with due to limited neighborhood aggregation. Additionally, transformers are highly parallelizable and scalable, making them well-suited for processing large graphs efficiently, especially when combined with sparse attention techniques to handle graph sparsity.
3. Turning Connections into Text
Instead of just feeding the computer a list of words, we turned the connections into short text explanations. This gave the computer more context and helped it understand the relationships better.
4. Training the LLM
We trained our a BART-Based LLM using two tasks:
Explaining Nonsense: We gave the model strange sentences and asked it to explain why they didn't make sense.
Guessing What Happened: We showed it two events and asked it to guess what happened in between (abductive reasoning).
5. Measuring Success
We checked how well our model did by measuring three things:
Variety across multiple LLM generated sentences.
Did the LLM come up with different explanations each time?
This is also known as pairwise diversity. The metric used to measure this is known as Self-BLEU, which evaluates how a sentence is similar to other generated sentences (from the same input prompt) based on n-gram overlap.
Variety within a singular LLM generated sentence.
Is there a variety of words within the generated sentences?
This is also known as corpus diversity, and is measured with Entropy-k and Distinct-k. Entropy-k evaluates evenness of empirical n-gram distribution within generated sentence Distinct-k looks at the number of unique k-grams in the generated sentences and divides it by total number of generated tokens; this prevents preference towards longer sentences in the LLM's output.
Quality of LLM generated vs reference sentences.
Were the explanations accurate and reasonable?
This was measured by BLEU and ROUGE, which looks specifically at precision and recall in the human answers with the LLM generated answers. BLEU is the precision of n-grams in the generated output against the reference text. And ROUGE is the recall of n-grams in generated output against the reference text.
We trained and tested 4 types of BART-Based LLM architectures involving a mixture of compression and no compression to see how a transformer based compression architecture would perform in comparison. These 4 models were: a LLM with no knowledge graph, a LLM with an uncompressed knowledge graph, a LLM with a RGCN-compressed knowledge graph, and our LLM with a transformer-compressed knowledge graph. Our RGCN compression model is based off the research performed by Hwang et al (2023) . Further details about the transformer model we created are discussed further in our report.
Results
Using the table below, you can see how the 4 tested models performed with various datasets. An arrow going up indicates if having a higher number between 0 and 100 indicates better performance. And an arrow pointing down indicates the opposite.

The results of our experimentation resulted in a transformer compression model that could relatively better than a simpler RGCN compression model. However (looking at models with the same training compute time) the performance of the transformer model is mostly observed in some diversity metrics and in our recall metric (ROUGE-L).
Noteably, the transformer model exhibits decreased accuracy in the ComVE dataset compared to the αNLG dataset likely due the innate structure of a transformer based model. Transformers specialize in capturing longer range relationships between different relationships between concepts, which may not be necessary for a simpler common sense reasoning task. However for an abductive reasoning task which requires the LLM to infer what happens between two events, such relationships may be more beneficial.
We also included a sneak peek into what the transformer compression model may behave like with more time dedicated for training and tuning. These better results in "+20k steps" indicate the potential for better performance in our transformer model with increased training time. There is also some ambiguity of results between the baseline models (no KG and uncompressed KG) with the compressed KG models which should be addressed in further research. Unfortunately, due to time constraints further comparison and research will have to be delegated another time.
Conclusion
The varying performance of our model at this stage of our research right now suggests that performance in transformer compression may be tied to task and limitations within the dataset (size, existence of long range dependencies, etc). Transformers often perform at their peak when a dataset is larger and has long range dependencies (interactions) available for it to capture and then learn off of.
However, our project shows that there may be benefits for employing a transformer-based architecutre for the compression of knowledge graphs. Being able to tailor compress a knowledge graph will allow a LLM to perform at a higher level in terms of interacting with humans and understanding user intent.
Ultimately the goal of effective compression of common sense knowledge graphs is not just about teaching AI to “know” more, but how to “understand” more intelligently and intuitively.
Effectively incorporating common sense knowledge in LLMs can lead to smarter virtual assitants that can better understand user intent. As well as improved AI decision making for more accurate and logical responses. It can also allow machines to interact more naturally for human thinking and logic, specifically when it comes to the nuances of human intuition.
THE TEAM
Quentin Callahan
Data Science Major
Quentin is passionate about machine learning, spicy food, reading, and making code run fast. He is interested in how graph theory can be used to make existing algorithms more efficient.
Esther Cho
Data Science Major & Math Minor
Esther enjoys exploring the practical applications of machine learning, especially the mathematical aspects like graph theory. In her free time, she enjoys playing volleyball and flag football, working on puzzles, and hiking new trails.
Penny King
Data Science Major
Penny is curious about how graph-based algorithms can enhance data science applications. She also enjoys reading and exploring tea cultures from around the world. Learn more about her here.
We would like to give a special thanks to our mentors at HDSI:
Yusu Wang & Gal Mishne