Episode 22

Enhancing GenAI with Knowledge Graphs: A Deep Dive with Kirk Marple

Kirk Marple

Transcript

Description

In this episode we talk to Kirk Marple about the power of Knowledge Graphs when combined with GenAI models. Kirk explained the growing relevance of knowledge graphs in the AI era, the practical applications, their integration with LLMs, and the future potential of Graph RAG.

Kirk Marple a veteran of Microsoft and General Motors, Kirk has spent the last 30 years in software development and data leadership roles. He also successfully exited the first startup he founded, RadiantGrid, acquired by Wohler Technologies.Now, as the technical founder and CEO of Graphlit, Kirk and his team are streamlining the development of vertical AI apps with their end-to-end, cloud based offering that ingests unstructured data and leverages retrieval augmented generation to improve accuracy, domain specificity, adaptability, and context understanding – all while expediting development.

Show Notes

Resources

Episode Summary –

Introduction to Knowledge Graphs:
- Knowledge graphs extract relationships between entities like people, places, and things, facilitating efficient information retrieval.
- They represent intricate interactions and interrelationships, enabling users to “walk the graph” and uncover deeper insights.
Importance in the AI Era:
- Knowledge graphs enhance data retrieval and filtering, crucial for feeding accurate data into large language models (LLMs) and multimodal models.
- They provide an additional axis for information retrieval, complementing vector search.
Industry Use Cases:
- Commonly used in customer data platforms and CRM models to map relationships within and between companies.
- Knowledge graphs can convert complex datasets into structured, easily queryable formats.
Challenges and Limitations:
- Familiarity with graph databases and the ETL process for graph data integration is still developing.
- Graph structures are less common and more complex than traditional relational models.
Integrating Knowledge Graphs with LLMs:
- Knowledge graphs enrich data integration and semantic understanding, adding context to text retrieved by LLMs.
- They can help reduce hallucinations in LLMs by grounding responses with more accurate and comprehensive context.
Graph RAG (Retrieval Augmented Generation):
- Combines knowledge graphs with RAG to provide additional context for LLM-generated responses.
- Allows retrieval of data not directly cited in the text, enhancing the breadth of information available for queries.
Scalability and Efficiency:
- Effective graph database architectures can handle large-scale graph data efficiently.
- Graph RAG requires a robust ingestion pipeline and careful management of data freshness and retrieval processes.
Future Developments:
- Growing interest and implementation of knowledge graphs and Graph RAG in various industries.
- Potential for new tools and standardization efforts to make these technologies more accessible and effective.
Graphlit: Simplifying Knowledge Graphs:
- The platform focuses on simplifying the creation and use of knowledge graphs for developers.
- Provides APIs for easy integration, supporting domain-specific vertical AI applications.
- Offers a unified pipeline for data ingestion, extraction, and knowledge graph construction.
Open Source and Community Contributions:
- Recommendations for libraries and projects in the knowledge graph space.
- Notable contributors and projects include data extraction libraries and AI agent initiatives.

Kirk Marple LinkedIn Profile: Kirk Marple | LinkedIn

Graphlit – https://www.graphlit.com/

The Data Scientist – Media – Data Science Talent

Series you might like

AI V Humans

2 Parts

Data Strategy Evolved: How the Biological Model fuels enterprise data performance

1 Part

Deep Fakes

2 Parts

Enhancing GenAI with Knowledge Graphs: A Deep Dive

1 Part

Enterprise Data Architecture in The Age of AI - How To Balance Flexibility, Control and Business Value

1 Part

Future AI Trends: Strategy, Hardware & AI Security at Intel

1 Part

How AI Is Driving The Eradication Of Malaria

1 Part

How AI is Reshaping Startup Dynamics and VC Strategies

1 Part

How Observability is Advancing Data Reliability and Data Quality

1 Part

How Science is (mis)communicated in Online Media

1 Part

How to Leverage Data For Exponential Growth

1 Part

How to Use Neural Networks

2 Parts

How XPRIZE is enabling AI for social good

1 Part

Image Processing

1 Part

Key Principles For Scaling AI In Enterprise: Leadership Lessons

1 Part

Mapping forests: Verifying carbon offsetting with machine learning

1 Part

Maximising the Impact of Your Data & AI Consulting Projects

1 Part

The Evolution of GenAI: From GANs to Multi-Agent Systems

1 Part

The future of LLMs, ELMs and the semantic layer

1 Part

The Path to Responsible AI

1 Part

The Pitfalls of Using AI Systems for Hiring - Julia Stoyanovich, NYU

1 Part

Transforming Freight Logistics with AI and Machine Learning

1 Part

Using Open Source LLMs in Language for Grammatical Error Correction (GEC)

1 Part

Using Time Series Analysis to Uncover Why Gun Sales Increase After Mass Shootings

1 Part

Why Evolutionary Biology Has Big Implications For Future AI Development

1 Part

Transcript

Speaker Key:

DD: Damien Deighan

PD: Dr Philipp Diesinger

KM: Kirk Marple

00:00:00

[intro music]

DD: This is the Data Science Conversations Podcast with Damien Deighan and Dr Philipp Diesinger. We feature cutting edge data science and AI research from the world’s leading academic minds and industry practitioners, so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science Talent and Data Science Recruitment Experts. Welcome to the Data Science Conversations Podcast. My name is Damian Deighan, and I’m here with my co-host, Philipp Diesinger. How are you, Philipp?

PD: Good. Thank you. Hi, guys.

DD: Today we’re talking to Kirk Marple about the latest developments in combining Knowledge Graphs with LLMs to get better results from generative AI initiatives. It’s a new exciting area for us to explore. Kirk, how are you doing?

KM: Oh, very good. Yeah, glad to be here. Thank you.

DD: Great, and it’s a pleasure to have you. So by way of introduction, Kirk is based in Seattle and has spent the last 30 years in software development and data leadership roles. He did his computer science degree at the University of Pennsylvania, and his Master’s at the University of British Columbia. His early career was spent working at large corporates, which included four years at Microsoft and he also worked at General Motors. He then moved into the world of smaller businesses, and he successfully exited from his first startup, called Radian Grid, which was acquired by Waller Technologies. Now he is the technical founder and CEO of Graphlet, Kirk and his team are streamlining the development of vertical AI apps with their end to end cloud based offering that ingests unstructured data, and leverages retrieval augmented generation to improve accuracy, domain specificity, adaptability, and context understanding. All this whilst expediting development. So if we start at the top, Kirk, can you just explain Knowledge Graphs, what they are and their historic role in the technology industry?

KM: Yeah, for sure. It’s really about extracting relationships between bits of knowledge. Classically, we talk about people, places and things. So that could be a company with where they’re located, what their revenue is, number of employees, and we would call that metadata on that, on that entity. And that company may have relationships to like, say, Microsoft’s in Seattle, so there’s an edge that you create in the Knowledge Graph between those entities. And the Knowledge Graph is really just that blown up. It’s all the different interactions and inner relationships between those little bits of knowledge and the value becomes information retrieval. It’s a great way to represent the knowledge, in a way, that you can then retrieve it and, quote, walk the graph kind of get from one place to the other, and really be able to learn more from the knowledge that’s embedded in them.

00:03:13

DD: Okay, great. What sparked your interest in Knowledge Graphs? Can you take us back to there, please?

KM: As you said, I’d been working in the broadcast video area for about 10 years and we dealt a lot with the metadata for, like, audio tracks. We were ingesting all of the record label information, so they had, like, who played on an album or song, the producers and all that. And you can think of that as essentially a knowledge graph, if you break it down, where say a producer produced multiple songs, or somebody was a session player on different songs or different albums. So I started to think about all this metadata we were capturing and how it’s all interrelated. And so I guess about seven years ago, I started thinking about building an app for live music and the venues people were playing at and kind of all the interrelationships, you could kind of see like, who was on the common bill with somebody else and things like that. And honestly, it was a side project and that got me started thinking about Knowledge Graphs, and how you can apply it to different forms of data, and then it became more of a business thing after that.

DD: Okay. Why do you think data scientists, data engineers, that are working in industry should be paying attention to Knowledge Graphs, right now in the current AI era?

KM: I think it’s people are familiar with RAG and the R in RAG is really retrieval or information retrieval. And it’s an area we’ve been working in have been working in, first you have to get the data and then it becomes a search problem or a filtering problem via metadata and Knowledge Graphs to us really give another axis of how can I retrieve data to feed into now large language models and large multimodal models? So, I see it as a thing that, it has been a bit of a sort of a sidecar in very specialised parts of the industry. But now it gives you another view, really on the data. And honestly, I think vector search has been sort of a big thing over the last couple of years. It’s been around for a while before that. But graphs are another facet of information retrieval that complement that as well.

PD: Yeah, I think that’s interesting, because both large language models, neural networks, as well as the knowledge graph itself are fundamentally like graph based technologies, right? So they retrieve somehow, they are features that we’re using now from this huge complexity that comes from graphs and different nodes, connecting and so on. On the topic of Knowledge Graphs, maybe if we stay on that side a little bit longer, could you give some examples of industry use cases for Knowledge Graphs across industry? So, that are not so dependent on a specific sector?

KM: Yeah I think commonly, if you think about the sort of customer data platform kind of concept we have customers and you have the end users and kind of the different relationships there. There’s a pretty good implicit kind of graph structure to that. You have people that are working at companies, those companies exist in places. So I think there’s a pretty common relationship that a lot of times these are relational models like, like a Salesforce data model, or things like that. But there’s also, you can map that in a lot of ways to, to a graph structure. And so we kind of look at Knowledge Graphs as almost like another index to the data. And I think there’s a lot of datasets like the CRM model, or things like that, that map pretty easily to a graph model as well.

PD: Makes sense. Yeah. So maybe to get to the bottom of how do the Knowledge Graph differ from other commonly used data structures? Let’s say, a JSON format, or something that has an intrinsic hierarchy versus a knowledge graph, which does not, but has maybe potentially directional links. Can you talk a bit about that?

00:07:04

KM: Yeah it’s, you could represent Knowledge Graphs. Butin a JSON structure some other data structure. But they’re kind of like a linked list. In that, you can follow the links, and kind or, or like a DAG kind of workflow. The problem is you might get recursion you might get sort of coming back around to the same element, like, I worked at Microsoft, there might be another person that, um, worked at Microsoft, that maybe Microsoft bought their company, and then that person, that links back to me, and so you could get cycles in in that graph and I think that’s an area where it’d be hard to represent just as, from a serialisation standpoint, in a JSON structure, but you can there’s ways to work around that. You only walk the graph so far, where you collapse links together.

PD: You already started talking about the challenges, or maybe also the limitations of Knowledge Graphs. Are there others, other limitations that we should be aware of?

KM: I think some is just familiarity and there, there’s a lot of graph database companies the FRJ and other ones that everybody’s aware of that. I think you have to have something that can support the data structure, which I think maybe is difficult in sort of the ETL process, to get data into a graph. Maybe people aren’t as familiar with. But I think we have the same problem where what we really thought about is how do you do ETL for unstructured data? And that’s, there weren’t great, good tooling. And I think there’s also maybe not as good tooling for graph ETL, even though there’s vendors doing it, but it’s not as familiar to folks. So, I think that’s a limitation and I think it’s just not as common of a data structure to people. And I think it’s maybe they’re familiar with the kind of typical relational model and graph just doesn’t fit their kind of mind mindset as well.

KM: That makes sense, yeah. You mentioned already natural language data. How do Knowledge Graphs contribute to data integration and maybe semantic understanding of such datasets?

PD: The way we look at it is say you have a couple pages of text and you’re mentioning companies, you’re mentioning people, places, like I said, and you can use a vector embedding to find a company Adobe, or Amazon or someone in that text, and it would use sort of a keyword search or even a vector embedding. In the RAG process, you want to pull back the relevant text, and then provide that to the LLM. But what if there’s data about the entity that’s not in the text? And that’s really where Knowledge Graphs shine, where maybe it’s the current year’s revenue, or it’s how many employees do they have or things you can enrich around the entity that you can then kind of pull back from anything that you’re finding, via the vector betting. So to me, it’s kind of an enrichment step where you could just sort of retrieve on the graph itself, but then it’s more of a global set of data, you’re not probably going to get anything relevant to a specific question necessarily. Maybe the question starts with a text kind of vector search and then you can sort of do another set of kind of context to it that pulls data in from the graph. But there may be use cases where you just want to talk to the graph itself, if you have enough information in there. And when I say kind of pull from the graph, it’s really the metadata around the nodes, more than like the node itself, which might just be like a word.

PD: So you mentioned already, like vectorisation, embedding of data and so on, which are very important techniques when it comes to LLMs. Are there any potential other benefits of integrating Knowledge Graphs with large language models?

KM: The context one is the one we’ve been kind of working on and kind of seen the big benefit of. The one that hasn’t been tapped as much and I started looking at this five years ago of actually training models on graphs, and having sort of a

00:10:59

graph embedding, because as a similarity search, I want to say, okay, I’m related to my company, I’m related to my location and you kind of take that subgraph of, sort of my personal subgraph of information and I want to find how that’s similar to a more global graph of data. And that may give me a set of data to retrieve. There are, there definitely are models out there, I wouldn’t call them…you can’t just get them from like open API’s and API today. They’re not kind of off the shelf as a service yet.

I think that’s been sort of a limiting factor, a bit where the bar is still a little bit high for integration, compared to how common embedding models are, that you can just get an API key and start using it in five minutes. So I think that that, to me is an area that if, if we continue down the path, and it starts to be more accessible to application developers, I think we’ll start to see more benefit.

PD: Is it possible to go the other way around to start with embedding or vectorisation and then use that to build a knowledge graph based on it?

KM: From clustering. That’s the thing. We haven’t done it directly. But from a building clusters based on similarity, and then building that knowledge graph around that, definitely, I’ve seen it, I’ve seen a couple of projects that are kind of looking at using embeddings for that similarity clustering. I think it’s really interesting area, because you can look at the data from kind of different directions of kind of grouping things in similarity, mapping that to common, maybe doing named entity extraction on those clusters and finding kind of maybe another pattern there. I think there’s a tonne of possibility in that direction.

PD: Makes sense. So what kind of methods or techniques are you using to integrate Knowledge Graphs with large [unclear 00:12:48]?

KM: Yeah we look at it, there’s kind of two sides to it, there is the data ingestion path of, okay, how do you, how do you get data in? How do you extract entities? And how do you store it? And we started on that first, that’s the first step of, you’re pulling in data from slack, from email from documents from podcasts and you have to do named entity extraction, named entity recognition on that. That was really where we started. And so building up a knowledge graph from the data. So we had a good data structure to pull from. And now we’ve kind of moved into the RAG sort of Graph RAG concept of, okay, now I can ingest all this data, I can have all this representation in a graph, how can I use it?

And that’s where we’ve done some experiments and done some projects with starting with vector search, figuring out, okay, I have some, I have similar text kind of chunks that I found, those text chunks have entities that we’ve extracted from them. Then you can essentially use it as a way of expanding your retrieval, and say, okay, these entities that were named in there. Now, let’s go pull it into more content that also observe those same entities. And that’s a way to use them, use graphs for like expanded retrieval. Another I’ll just mention is, using graphs and faceted queries, is actually a great way to get data analytics. We have actually a demo on our website of, just ingest a website and use any art to build a graph and then essentially get sort of a histogram of, here’s all the topics, here’s all the people, the places, the companies that were mentioned in the data. And so you’re kind of summarising the graph into just a chart form. It’s actually a really easy way to kind of get a different view on what’s inside your data.

PD: I was about to ask about actually. So, like, just reflecting a little bit on what you said before. So my question is about can or how can Knowledge Graphs actually contribute to helping with interpretability and explainability of LLMs, which is like a big problem still, right? Is there like a way of doing that? Or do you need to have the data structure already

00:15:04

graph based and then you add the LLM as another layer on top? Or can you also go the other way around and just exploit the Knowledge Graphs to kind of understand what the LLM is actually doing?

KM: Yes, it’s interesting.one way is to look at, if you have citations, so you have the LLM, it responds, it gives you back a list of citations of your sources that you used. And you can then visualise the citations in a graph form and say, okay, well, here’s the topics.in addition to the text that it found, here’s the topics and entities that were essentially cited. You can look at, hey do these have commonality between these citations? Are they are they similar? Are they different? So I think that’s, that’s actually a really interesting way. I was actually planning on building a demo app that actually does that, just from the citations, do a graph search, get that data and be able to visualise it in a demo app. And I think it’s, it’s really useful that way.

PD: If we compare Knowledge Graphs, from a computational speed perspective to other data models, is there a significant slowdown or a different type of requirement of resources when you scale that up? Like what’s your experience there?

KM: There’s a lot of difference of, there’s so many different graph database implementations, so it’s hard to generalise. But I think you can and now with, even with different graph query languages, and there’s starting to be standardisation around that. I think you can, from what I’ve seen, you can make it as fast as you need, you can throw more resources at it, there’s different graph database architectures that are better than others. If you look at a pure query, like we’re making to the graph database versus just like a JSON document store, it’s definitely slower. It’s maybe it’s still milliseconds, but it’s maybe 3x slower just to do a graph query from a simple JSON lookup, kind of sequel type query on that. But what we’ve done is kind of balance it where the graph is really an index. So we’re just, it’s almost like a vector query, where, from our vector queries, we just get IDs back. Then we just look up that data in a faster data store, like our JSON document store. And we do the same thing with the graph. So the graph itself is really more just the node and edge index. But the meat of the metadata and all the things linked to it are in a faster data store.

PD: Makes a lot of sense. Yes. So in my professional life, I’ve not come across huge use cases, so far of Knowledge Graphs. So I remember some use cases of call shopping, you know, which can be represented pretty well with graphs and so on, where links represent, basically a coupling between different entities. So I remember that, but I wanted to ask you, is there any… or what’s the largest scale use case that you’re familiar with?

KM: There’s some massive Knowledge Graph implementations for, that I’ve seen from like telcos and from different things like, that they’re doing kind of relationship based around cybersecurity, like money fraud. That’s another area that I’ve seen. They’re not very deep Knowledge Graphs. Like, there’s, it’s sort of, hey, I call you and you call them and maybe there’s some money transferred and it’s kind of… but there’s a lot of entities and I think that’s where there’s a lot of different algorithms you can write on the graphs come into play. We’re, we’re almost using graphs in a different way, where we’re not doing, like, sort of walking the graph, or PageRank, or we’re not running algorithms on it, per se. We’re using it more as a sort of connective tissue for the knowledge, as a way that we can, from any point, you can pull on data and really, it’s more like a an alternate index for us to represent that data.

The visualisation point you made was important, where I think a lot of the common use cases I’ve seen for graphs are more visualisation centric. Like, there’s a couple of great tools out there that where you can dump in a lot of data, and you can like walk the graph

00:19:20

and click into them and follow the patterns. And I think that’s, I’ve seen a lot of use cases that are more just exploratory, in a sense.

PD: But it seems like a visualisation use case would be very limited, because it’s fundamentally limited in the end by the user. So as a visualisation tool, like I can see that that makes sense. There’s lots of applications, where data is visualised in that sense of a graph that you can then explore, disease data, protein data, and so on.

KM: Yeah. The other thing I think is the sort of time based index of a graph that kind of evolves over time, you’re inserting data that was generated, say, like a Slack message, and so you can sort of pivot on the graph and say, okay, well show me the relationships in the graph over time. I think having that, and we’ve done some work with geospatial as well and that’s where I think you can, you could start to say, oh, hey, show me the relationships within this geo boundary and between these time ranges gets super interesting.

PD: So, Kirk, maybe one more question on the Knowledge Graph as kind of a data framework or useful tool. If there’s a data scientist, you know, sitting on a problem, what are kind of the signals that a knowledge graph could be the right tool to get ahead with it?

KM: The thing that comes to mind is sort of the interrelationships between the data.in a typical data set maybe it’s sort of, if it’s row based maybe there’s not as much interrelationships between the rows, or we’re seeing different data sets like that. And this is a way where there’s kind of an implicit grouping or classification really, and I think that’s, that’s a lot where we see, is you can bucket data into sort of different classes of nodes. And one thing I didn’t mention before, so we were based on schema.org, and the JSON LD kind of classification structure, the taxonomy, and that’s really where there already is standardisation around what is a person, what is the organisation and defining metadata. So I think you can look at it one way where you’re kind of mapping data to an existing kind of graph structure, or some people take a different approach where you’re mapping it, you’re kind of inventing your own data structure, in your own relationships. So I think but to me, it’s really about that, that kind of classification metaphor of, okay, this is a something, and you can sort of assign that, and then that becomes a relationship between other data in the graph.

PD: So talking about the relationship intrinsically hidden in the data, for knowledge graph, what would that be? So, I’m assuming there would be some sort of node, links between the nodes, maybe labels, also on the links, different types of links? And maybe directional links? Is there anything else?

KM: Yeah you could… you’re totally right. So you’d have typically like a node name, and a node label. So that could be like a person with my name, a directional link of say, there’s like a place, Seattle, which is like a place label and Seattle is the name and a directional link of, and you could have, basically an edge name, and so like, lives in, could be like the edge. And that would be where you…you definitely want to standardise some of that, the classification with the labels and with the edges, because you want to be like, okay, where have I lived before, and maybe, maybe I have a lives in, or lived in edge to 10 different places I’ve lived. And so that’s a really interesting area.an important point of, you don’t want your knowledge graph to be kind of random, like it has to have some structure to it, because you want to be able to query it on the other side of it.

That’s, that’s really where some of this some this really plays in because we’ve kind of standardised a concept of observations, like we’re in a piece of content, we’re observing an entity, like, we recognise a place, a person, a company. So we kind of have a bit more

00:23:19

of an abstraction of this, hey, it’s an observed entity in content and then you can basically flip it around and say, okay, in this piece of content, show me all my observations? Or what other content observed the same entity? So once you kind of have this structure, it makes it a lot easier to query your graph with.

DD: I would like to take a brief moment to tell you about our quarterly industry magazine called The Data Scientist, and how you can get a complimentary subscription. My co-host on the podcast, Philipp Diesinger is a regular contributor and the magazine is packed full of features from some of the industry’s leading data and AI practitioners. We have articles spanning deep technical topics from across the data science and machine learning spectrum. Plus, there’s careers advice and industry case studies from many of the world’s leading companies. So go to datasciencetalent.co.uk/media to get your complimentary magazine subscription. And now we head back to the conversation.

DD: Kirk, could you explain the concept of graph retrieval augmented generation or Graph RAG as it’s known and its important?

KM: It’s something that’s come up over the last, I don’t know maybe the last year or so. It’s actually an area that was really core to how we were thinking about our implementation of RAG and using our Knowledge Graph, Microsoft just released a paper several months ago on it, but the concept is really leveraging a knowledge graph as extra context for the generation side, which is really the prompt you provide to the LLM, so let me contrast it more so, where typically you’re using a vector search to find bits of text that are relevant to the question or to the prompt and Graph RAG you can augment that information you get back with the relationships of that text to the graph. And so like we were talking about, in the extraction of, if there’s Microsoft and open AI and Seattle, in the sighted text, you can sort of expand your, your footprint of information and this is where you kind of have to guide it a bit.

Because it’s like, you don’t really know you don’t want to pull in everything, like all the last 10 years of Microsoft revenue or something like that. So you have to sort of guide the graph retrieval, to say, okay, I’m asking you a question, it seems financially related, go import all the revenue of the last several years from the entities that we’ve identified in the query. And that’s where you can then use the graph as a secondary search mechanism and retrieval mechanism. That’s really, it’s still early days. There’s been some prototypes out there, some papers, but I think it’s still an evolving area.

PD: What would be the differences between like a baseline traditional RAG system and the Graph RAG, what is the improvement?

KM: So Graph RAG the way we see it is, it’s with a typical RAG, you’re starting with the text that you’ve ingested. So text is extracted from documents, transcripts, from audio files, and you’re chopping that up into chunks and that’s what’s provided to the LLM. In contrast Graph RAG is you can actually get data that wasn’t in the text, the cited text. So it could be data that was extracted from other documents, like maybe it found the revenue of a company in a different document, or it pulled it from an API service somewhere else, to enrich the Knowledge Graph. So it’s really a way that it can see outside the domain of the cited sources and provide more context, to answer the question. That’s really how we see Graph RAG.

PD: What are the key components or modules within the Graph RAG framework?

KM: A lot of it is really ingestion. So you have to create the Knowledge Graph, which is probably the vast majority of the work is, you have to do named entity extraction, or you have to use LLMs, to do the identification of the entities or the

00:27:36

nodes in the Knowledge Graph and have a really rich kind of ETL pipeline for creating the Knowledge Graph. So I’d say that’s probably more than the majority of the work, is just in having a pipeline that can deal with changes in data, like you edit a document, you remove a reference to a company, how do you remove that from your knowledge graph? Those kinds of things. Then the other side of it is really just during the retrieval stage of the RAG workflow, having a way that you know, okay, I’m not just going to look at the text in the sighted text, I’m going to kind of widen my vision and start walking the graph to pull in extra data.

The thing we’ve seen is it’s, at least right now, it has to be guided like, I haven’t seen a way that you can make it dynamic and be like, hey, I’m going to use Graph RAG, if I see this scenario, and I’m going to use normal RAG if not. Because there’s extra work that has to be done during retrieval and also you had to create the Knowledge Graph in the first place. But it is something we’re looking at is, can we make it dynamic, like sort of, it’ll sort of pull from the graph as needed, without being guided is something we’re looking at.

PD: How does Graph RAG handle scalability or efficiency concerns with large square graph data?

KM: I think a lot is in the architecture. In ours, as I said, it’s more of an index. So we’re not storing the text, or we’re not storing a lot of data in the graph database itself. So the walking, the graph queries end up being pretty fast. And then we’re able to pull in the metadata from a faster layer, storage layer. So, I think, scalability, when you’re going to have millions or billions of entities, you need a graph database that can handle that. And I would say most of if not all of the existing graph database solutions can handle that scale. So that’s typically not a problem. The queries are usually fast, they’ve been tuned enough. That’s why I think a lot of the hard work, at least what we’ve seen, is on the creation side, just making sure you can actually get the right data, keep it fresh, and that kind of thing.

PD: You already talked a little bit about the differences between Graph RAG and traditional RAG. Can you talk a bit more about how Graph RAG addresses the typical challenges related to building a graph presentation or graph representation learning, like node embedding or graph classification?

KM: What I would love to get to and hopefully we’ll get to soon is the ability to, once you have the graph created, create embeddings of the relationships. So you have a subgraph of information, say, by page and so I’ve observed these companies, these people, these places, and their relationships on a page of text, and I think then be able to run higher level algorithms like, like graph embeddings, and say, hey, what is similar to this page of text with its graph relationships? Because today, you can run it just a vector embedding on the raw text. I think having a graph embedding that takes both of them, or like a multi-vector embedding could be really interesting. And I think I could see that evolving over the next year or so as this becomes more commonplace.

PD: Sounds quite interesting. What are some current limitations or potential drawbacks of Graph RAG?

KM: I would say the biggest is just, it does take more work. It’s you could go back and backfill the graph from the text, if you say you’ve ingested like 1,000 documents, and you want to create the graph, that may be a little more trouble. We make you opt in so you can set up a workflow and say, hey, I want to build an entity graph from this data, there’s a little bit of extra cost. You’re doing some analysis on that text, or you’re running LLMs, you’re using up tokens. I would say that’s probably the biggest

00:31:28

limitation is just managing cost and scale. Because you may, if you have a tonne of data, maybe you want to pick and choose the data that you want to create a knowledge graph from. But there’s other solutions. If you’re not assuming cloud hosted, like an open API, API something like that, you can build a lot of this more in your data centre with local models and things like that and keep the cost down that way.

PD: It makes a lot of sense. Yeah. We already talked about use cases of Knowledge Graphs before, could you share some examples of revert applications of Graph RAG where it has been successfully used to tackle like, previous problem maybe?

KM: Yeah this is what I’ll probably have to say, I don’t know, because it’s still more of an exploration phase right now. I think it’s so new of what the capabilities are. We’re actually looking for more of those use cases to understand how this applies, and we’ve done our own prototyping and testing. But I think it’s still early days, honestly, for this area. I’m hoping over the next six months, a year, we’ll start to get more in production Graph RAG use cases, that we can get some more, some more feedback on.

PD: Makes sense, yes. What are the problems that the development community behind Graph RAG is currently still working on?

KM: Yeah, I think the biggest thing I’ve seen is, there’s not really any standardisation on what Graph RAG means. I think everybody has their own interpretation of it. There was a Microsoft paper that kind of put a sort of line in the sand about it, and a lot of what they described we were already doing. We hadn’t really talked about in that way of putting a word to it. But I think we’ll start to see a bit more coalescing. Some people don’t agree with this but I think RAG is standardised a lot. The pattern of RAG, per se, there’s still knobs you can turn like the usury ranking or not, or those kind of things. But the concept of RAG I think is stabilised. We’re not there yet with Graph RAG and so I think we’ll really kind of have to go through a wave of trying things, see what works, see what doesn’t, before that kind of settles into a pattern.

PD: Kirk, are there any other topics that you would like to talk more about?

KM: The other area is just, exploring the diversity of knowledge in multimodal data. We’ve talked a lot about text and creating entities. But there’s also a lot you can do with visual data identity, like seeing, I don’t know, we’re talking about Seattle, so, seeing the Space Needle in a picture can create a node in a graph. So I just want to make sure we don’t get sort of locked into the text world. But this all really applies to multimodal data as well, obviously audio transcriptions, but there could be things that are found in a video, seen in a video or image that become nodes and available to Graph RAG the same way as something in the spoken or written text.

DD: So Kirk, Knowledge Graphs, a lot has been talked about how they can help reduce a great deal of the problem of hallucinations with LLMs. Could you perhaps talk about that for a second?

KM: Yeah, I think the grounding concept of providing sources for the LLM to pull from is really what is at the heart of creating an accurate RAG algorithm, RAG pattern. And I think we the ability to have graphs and pull in more context, feeds into providing that extra accuracy. And it’s something it’s speculative and prototyped at this point, I wouldn’t say we have great sort of numbers on exactly the difference between this. But what we’ve seen is you can sort of wind down your amount of hallucinations, by proper grounding, by giving good content sources, even by some prompt engineering, to really have the LLM focus on the content you’re giving it, not what

00:35:32

it’s been trained on. And so we haven’t struggled as much with the hallucination, if you have good retrieval. And so I think the Graph RAG really feeds into, okay, I’m providing it more context, it has more data to kind of chew on, that isn’t in its training set and that should minimise the amount of hallucinations because you’re, you’re giving it a wider set of context.

DD: Looking to the future, what are the developments or advancements that you see coming with Knowledge Graphs, or Graph RAG that you’re most excited about?

KM: Yeah, I think bringing it to market of Graph RAG and kind of building Knowledge Graphs. We’ve really seen a swell of people talking about it on Twitter, talked about it on Reddit, more papers coming out. So you can kind of see the momentum around awareness and implementation, a lot of what’s out there and still kind of in the demo phase. I think we’ve been really focused on the ingestion path and constructing the Knowledge Graph. We’ll be releasing something more formal from a Graph RAG feature very shortly. But I think it’s what we’re seeing in the market is as people’s awareness grows, we’ll start to see more people talking about it and implementing it. But I think it’s it brings up a whole other question of evals. How do you even evaluate that this is better? That’s an area that I think is, some companies have been doing RAG evals. I don’t know how well they’ll apply to Graph RAG evals and those kinds of things, that can be kind of tricky. But that could open up a whole new market for products or companies to help with that area.

So I think that’s one area, evals, I think aren’t going away, they’re only going to become more important. And then I think, really getting into kind of the future of, kind of where the RAG pattern can be useful. We really see two areas, and it’s repurposing your content that you’ve ingested is one of them. So, using the retrieval part of this, to find interesting information in a large dataset might be good for marketing materials, might be good for technical reports and so being able to say, give a rough outline of what you want to pull from, or create, like a blog post or report and use the graph to fill in the details. And that graph could be constructed from other structured data, or unstructured data. I think that becomes really interesting, where we can use this pattern with a really rich data set to create really high quality content.

Then the other is really the sort of agent concept that’s starting to be talked about. I think there’s some open source projects that are working on this. There’s also a lot of past work around actor models and sort of distributed architectures and things like that, that I think we don’t want to forget about. I think there’s a lot of work that we can pull from. I think that’s a pattern that can apply, where the RAG concept is really just a set of functionality that gets called from the agent and can sort of feed to the output of the agent into the input of another agent. But from our perspective, I see, we kind of see them as two different layers, where the RAG is more analogous to like a database query, where there’s some input or some output, and then you have a sort of programming system on top of it for asynchronous agents, and those kinds of things.

So we’ll see how it evolves. I think other people have different perspectives where maybe they’re more integrated, I see it a little bit more as a kind of workflow layer, kind of a graph workflow and then the RAG is this functionality that kind of fits in, it fits in underneath it.

DD: So perhaps, Kirk, you can tell us a little bit about Graphlet and your approach to simplifying this whole fascinating area and what the work that you’re doing there?

KM: For sure. No, it’s really interesting. I’ve been looking into this kind of unstructured data platform space for several years after kind of having a personal interest in Knowledge Graphs and started to see how the kind of worlds collide. Of, you want to get the data

00:39:44

ingested and extracted and prepared. But then creating the Knowledge Graph kind of fits right into that. So we’ve been working on this for a few years now and now offer a platform for developers to build what we’re calling vertical AI apps. So you bring, it’s kind of bring your data for a domain specific area like healthcare, FinTech, advertising, marketing, really anything. We handle the full end to end pipeline.

So we’re contrasting that against kind of a lot of open source projects that are today that are kind of DIY. Like you have to figure out, like pick a vector database vendor or pick a PDF extraction vendor, and cobble this together yourself, where we really are focused on simplicity for the developers, where it’s one API call to put data into us, one API call to search or to have a conversation with the data and then we also offer things like publishing, where use LLM prompts to repurpose a bunch of content and it really fits nicely, because we’re all built on this kind of retrieval model, from the dataset that’s generated, from our ingest pipeline. So really, I think we’re taking more of a content first approach, to say, let’s just get the content into the system, index stored, prepared in a way that we can use it and then you can sort of pull from it via search or RAG or anything you want to do later. But we really make sure that it acts as a really simple programming model.

We offer a direct API access, but also just ship a Python and a TypeScript SDK, so anybody building like, we’ve just built a bunch of streamlet sample apps. One of them that is really interesting, I thought, was extracting website topics. So it’s using the entity extraction that we talked about people, companies, topics, and then charting that. So it’s using a lot of the same core pieces for more classic data visualisation model. One of the things I just haven’t gotten to but we’re going to do in the next couple weeks is a graph representation and a graph viewer. So right, now we can create the graph and then in the sample app show a way that you could walk the graph, and kind of visualise the data you have. I think that’ll kind of connect the dots for people to see, kind of, okay, here’s the data that I’ve been extracting, and the relationships, and how can I kind of see the relationships to other pieces of content?

Right now we have the open source SDKs, we’re free to use for up to a gigabyte of data that you can ingest and start playing with, VR sample apps we are starting to code against it. And then we’re essentially usage based, so we have a small platform fee that you get started with and then you just pay by how much data you put in and use. So really, compared to what we’ve seen is we’re saving even like a cost of a developer, for some companies, where instead of having to build a whole pipeline and everything like this from scratch, you just get an API to use and that’s really, our focus is to open this kind of technology and AI models and things like that up to another set of developers that maybe just wouldn’t have the expertise to integrate into their applications.

DD: In terms of the open source work that’s going on, is there anything there that you’d advise people to check out?

KM: Yeah I’ve seen a lot of really good things from, there’s an instructor library for data extraction, I think it’s really interesting in the space. There’s a guy Yohe on Twitter that does a lot with Knowledge Graphs. He’s been building a lot of, a lot of different samples around kind of extracting information into graphs and visualising them. He does a lot of great work. Then the last one I’ll say is, I’ve been following Crew AI, he’s doing some AI agents and things like that, that I think is really getting some good interest and I like where they’re going. And we’ll be, yes there’s always great projects coming up, those are those are some of the ones I’ve been keeping my eye on.

DD: Unfortunately, that concludes today’s episode. Before we leave you, just wanted to quickly mention our magazine, The Data Scientist, you can subscribe for the magazine for free at datasciencetalent.co.uk/media and we will be featuring this conversation in the September

00:44:07

issue of the magazine, a transcribed and edited version. Kirk, thank you so much for joining us today. It was an absolute pleasure talking to you.

KM: Yes, thanks so much for the opportunity. It really got into some great areas that we got to explore and really it was a great time being here.

DD: Thank you also to my co-host Philipp Diesinger and of course you for listening, do check out our other episodes at datascienceconversations.com and we look forward to having you with us on the next show.

00:44:40