This post was published by Towards AI.
Ever since Google unveiled their new language model called Data Gemma, I’ve been itching to dive in and see what makes it tick. As someone who’s constantly tinkering with LLMs and grappling with the quirks of Retrieval Augmented Generation (RAG), I was particularly intrigued by their approach to tackling hallucinations in language models. After poring over their research paper, I decided to get my hands dirty and explore how Data Gemma leverages the open-source Data Commons knowledge graph to improve data retrieval. In this post, I’ll walk you through my journey of setting up a RAG pipeline with Data Gemma, testing its capabilities, and seeing how it stacks up against other models out there.
Understanding the Problem Space
LLMs are getting impressively sophisticated—they can summarize text, brainstorm creative ideas, even crank out code. But let’s be real: sometimes they confidently spout inaccuracies—a phenomenon we lovingly call hallucination. Google’s research aims to tackle this head-on by addressing three major challenges:
- Teaching the LLM when to fetch data from external sources versus relying on its own knowledge.
- Helping the LLM decide which external sources to query.
- Guiding the LLM to generate queries that fetch the data needed to answer the original question.
Typically, we tackle these problems with Tool Use + Retrieval Augmented Generation. Here’s the playbook:
- Tool Use: The LLM is trained—either through fine-tuning or In-Context Learning—to decide which API to call, when to call it, and what arguments to pass.
- RAG: Once the data is fetched, it’s augmented into the instruction, and the LLM generates an answer.
Introducing Data Commons
To streamline the process of fetching data, Google introduced an open-source knowledge graph called Data Commons. According to Google, Data Commons brings two innovations to the table:
- A Unified Knowledge Graph: A massive collection of publicly available datasets.
- Natural Language API: An API that accepts natural language queries to interact with the knowledge graph—no LLMs required.
Their research argues that relying on the LLM to choose between multiple APIs and determine the right arguments is too error-prone at scale. Replacing that with a single knowledge graph and a natural language API significantly reduces the chances of hallucinations during query inference.
Exploring Retrieval Interleaved Generation (RIG)
Traditional RAG systems retrieve relevant information before generating a response. RIG flips the script by interleaving retrieval and generation. Essentially, the model identifies when it needs statistical data during response generation and fetches it in real-time from the knowledge graph. This approach aims to minimize hallucinations by grounding responses in verified data.
Google released two versions of Data Gemma:
- RIG: An LLM fine-tuned to produce answers to statistical questions along with natural language queries for Data Commons. The idea is that when the LLM generates a statistical value, it also generates a natural language query describing the statistic, which can then be used to query the Data Commons KG using its NL API.
- RAG: An LLM fine-tuned to produce a list of Data Commons natural language queries relevant to the user query. Instead of asking the LLM to generate statistics or predict complex queries, the model generates a list of natural queries that expand the scope of the original user query and can be answered by a reliable external database.
Personally, the second approach—using the LLM to expand the user query—seemed more intriguing. The paper’s evaluation section notes that human evaluators also preferred the answers from the RAG pipeline over those from the RIG pipeline. So, I decided to build a RAG pipeline myself using Data Gemma and Data Commons to see what’s up.
Getting My Hands Dirty
Setting Up the Environment
Interestingly, Google has not published a 7b version of the model on HuggingFace (or at-least I couldn’t find it), and the 27b version of the model is way too big for my machine. Luckily, I found several quantized versions of the model and decided to go with most downloaded one: bartowski/datagemma-rag-27b-it-GGUF. I used the 2-bit quantized version of the model. With llama-cpp-python, hosting these models for inference is a breeze.
Testing the Model
With the model up and running, I wanted to see how well it performed. I used the example query:
“Has the use of renewables increased in the world?”
Data Gemma effectively broke down this question into specific statistical queries:
- What is the carbon emission in world?
- How has carbon emission changed over time in world?
- What is the renewable energy consumption in world?
- How has renewable energy consumption changed over time in world?
To be clear, I didn’t give any special instructions—the model is trained to generate these queries. A few things immediately jumped out:
- The model mapped “renewables” to related concepts like “carbon emission” and “renewable energy consumption,” and it captured the temporal aspect of the question.
- It retained the place name “world” in all generated queries.
- It generated queries in a consistent format, making them easy to process.
Integration with Data Commons
I wrote a simple client to call the Data Commons NL API using its Python library. You’ll need a Data Commons API key, which you can get here.
The RAG pipeline then just takes the list of generated queries and calls Data Commons NL API for each of them. The API responds with a structured response that contains the numerical value, unit, actual source of information etc. This information can be passed on to the next step of pipeline for answer generation. I wrote a small utility to convert the API response into natural language.
Why This Approach Is Intriguing
One of the key challenges with naive Retrieval Augmented Generation (RAG) is its heavy reliance on the user’s initial query to find relevant documents. Even with semantic search intended to bridge gaps, it often falls short, especially when dealing with broad or ambiguous queries. Other techniques like HyDe or Pseudo Relevance Feedback exist but are usually too specialized for semantic search scenarios.
What makes proposed solution stand out is its ability to break down a user query into multiple focused and relevant sub-queries. This query expansion approach enhances retrieval by covering more ground and retrieving more pertinent information. The fact that a 2-bit quantized version of a 27B model running on my Mac could achieve this is just icing on the cake.
This pattern isn’t limited to statistical queries; it can be applied to a wide range of domains. Imagine an AI agent that helps you plan a trip. If you ask, “Help me plan a trip to Brazil,” the agent could decompose this into sub-queries like:
- What are the best times to visit Brazil?
- Which cities in Brazil are must-see destinations?
- What are the top attractions in Brazil?
While Data Gemma isn’t currently designed for this use case, it exemplifies a general pattern that could significantly enhance AI interactions by making them more comprehensive and context-aware.
In contrast, most naive RAG solutions either rely solely on semantic search or depend on the LLM to determine API arguments for tool use. For information retrieval, this approach often struggles due to the overwhelming number of variables and relationships involved. The Natural Language API offered by Data Commons demonstrates how tool use for information retrieval can be greatly simplified. Importantly, this NL API doesn’t rely on an LLM to generate the final query executed on the knowledge graph; instead, it uses predefined translation logic.
As highlighted in Google’s research paper:
“Given a query, we first break it down into the following components: one or more statistical variables or topics (like ‘unemployment rate,’ ‘demographics,’ etc.); one or more places (like ‘California’); and a finite set of attributes (like ‘ranking,’ ‘comparison,’ ‘change rate,’ etc.). The variables and places are further mapped to corresponding IDs in Data Commons. For each of the components, we apply different Natural Language Processing (NLP) approaches that we have been independently iterating on. For statistical variables or topics, we use an embeddings-based semantic search index; for places, we use a string-based named entity recognition implementation; for attribute detection, we use a set of regex-based heuristics.“
Comparing with Other Models
Curious about how Data Gemma stacks up against other models, I tested Claude Sonnet 3.5 by prompting it to produce similar queries for the given user question. The prompt I used is in the Appendix, and I found in the Data Commons code repo. Surprisingly, Claude was also able to generate relevant queries and interact with the Data Commons API effectively. This suggests that a fine-tuned model like Data Gemma isn’t strictly necessary; with proper prompt engineering, other LLMs can achieve similar results.
However, it’s worth noting that I used a 2-bit quantized version of the Data Gemma 27B model running on my Mac. In contrast, Claude Sonnet 3.5 is a much larger model.
Conclusion
In the grand scheme of things, I think Data Gemma pushes the envelope by simplifying how LLMs interact with external data sources through natural language APIs. It offers a fresh take on reducing hallucinations and improving the factual accuracy of AI-generated content. Whether this pattern becomes the new standard or not, exploring it has been a valuable exercise in understanding the evolving landscape of AI and how we can make these systems more reliable and effective.
Leave a Reply