There’s no doubt that in today’s data management and artificial intelligence landscape, Knowledge Graphs and Retrieval-Augmented Generation (RAG) techniques are becoming key players.
RAG frameworks are particularly powerful for generating text by leveraging large datasets. However, a common challenge with RAG models is their tendency to lose context, leading to outputs that may lack coherence or relevance. For instance, while RAG can produce text that includes relevant information, it sometimes fails to correctly reference the related context or details.
Consider this scenario: you have a paragraph that says, "Eagerworks is a company that was founded in 2015." Further down, there is another paragraph that says, "The company is located at Juan Paullier 1218." If you ask, "Give me information about Eagerworks," using only a RAG approach might return just the first paragraph, as it more closely matches the term "Eagerworks." Consequently, you might get the response: "It is a company that was founded in 2015."
This is where Knowledge Graphs come into play. By organizing and structuring all the information into interconnected entities and relationships, Knowledge Graphs enhance an AI system's ability to maintain and utilize context more effectively. In this case, the Knowledge Graph understands that "the company" refers to Eagerworks (mentioned in the first paragraph). Therefore, when generating a response, it can aggregate all relevant information and answer: "Eagerworks is a company that was founded in 2015 and is located at Juan Paullier 1218."
In this blog post, we will explore how Knowledge Graphs and RAG models function, their interplay, and how you can use them right now in your projects using GraphRAG.
Retrieval-Augmented Generation (RAG) is a model architecture that combines the strengths of retrieval-based and generation-based approaches in Natural Language Processing. By blending those two systems, it helps enhance the quality and relevance of generated responses.
(Note: to dive deeper into RAG, check our blog post about it)
Imagine a Knowledge Graph as a structured representation of information that models relationships between entities (like people, places, and concepts) in a network-life format. These entities are represented as nodes, and their connections are known as edges; together, they illustrate their interactions. This makes it easier to understand complex interdependencies within the data.
This diagram visually represents the nodes (entities) and edges (relationships) of a Knowledge Graph, aiming to facilitate the understanding of how data is structured and connected.
Knowledge Graphs lay down a structured, semantic foundation for data, while RAG models leverage this structure to produce precise and contextually rich responses. Together, they enable more intelligent and dynamic interactions with data, leading to substantial improvements in AI and data-driven applications, setting the stage for more responsive and intelligent systems.
Here’s how to enhance a RAG architecture with knowledge graphs using Microsoft’s GraphRAG library. GraphRAG is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using LLMs and then query it via a RAG architecture with some modifications.
# First, install the library
pip install graphrag
# Create a folder to store our RAG + knowledge graph (GraphRAG) pipeline:
mkdir -p ./ragtest/input
# Get an example document
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt
# Initialize the directory
python -m graphrag.index --init --root ./ragtest
This will create two files: .env and settings.yaml in the ./ragtest directory. .env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key.
settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline. Follow this link for more information.
The library has two main components: one for indexing the data and the other for making queries. Indexing pipelines are configurable and are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. The standard pipeline is designed to:
To index our data, run the following:
python -m graphrag.index --root ./ragtest
Depending on the size of your input data, this process will take some time to run. Once the pipeline is complete, you should see a new folder called ./ragtest/output/<timestamp>/artifacts with a series of parquet files.
Once you have indexed the data, you can start making queries. GraphRAG supports two querying modes: local and global.
python -m graphrag.query \
--root ./ragtest \
--method global \
"What are the top themes in this story?"
python -m graphrag.query \
--root ./ragtest \
--method local \
"Who is Scrooge, and what are his main relationships?"
If you want to know more about what’s new in the AI world, visit our blog and stay up-to-date with the latest advancements in the field