Skip to content

Episode 21

Using Open Source LLMs in Language for Grammatical Error Correction (GEC)

Bartmoss St Clair

Share this:


At LanguageTool, Bartmoss St Clair (Head of AI) is pioneering the use of Large Language Models (LLMs) for grammatical error correction (GEC), moving away from the tool’s initial non-AI approach to create a system capable of catching and correcting errors across multiple languages.

LanguageTool supports over 30 languages, has several million users, and over 4 million installations of its browser add-on, benefiting from a diverse team of employees from around the world.

Show Notes

Episode Summary –

  1. LanguageTool decided against using existing LLMs like GPT-3 or GPT-4 due to cost, speed, and accuracy benefits of developing their own models, focusing on creating a balance between performance, speed, and cost.


  1. The tool is designed to work with low latency for real-time applications, catering to a wide range of users including academics and businesses, with the aim to balance accurate grammar correction without being intrusive.


  1. Bartmoss discussed the nuanced approach to grammar correction, acknowledging that language evolves and user preferences may vary, necessitating a balance between strict grammatical rules and user acceptability.


  1. The company employs a mix of decoder and encoder-decoder models depending on the task, with a focus on contextual understanding and the challenges of maintaining the original meaning of text while correcting grammar.


  1. A hybrid system that combines rule-based algorithms with machine learning is used to provide nuanced grammar corrections and explanations for the corrections, enhancing user understanding and trust.


  1. LanguageTool is developing a generalized GEC system, incorporating legacy rules and machine learning for comprehensive error correction across various types of text.


  1. Training models involve a mix of user data, expert-annotated data, and synthetic data, aiming to reflect real user error patterns for effective correction.


  1. The company has built tools to benchmark GEC tasks, focusing on precision, recall, and user feedback to guide quality improvements.


  1. Introduction of LLMs has expanded LanguageTool’s capabilities, including rewriting and rephrasing, and improved error detection beyond simple grammatical rules.


  1. Despite the higher costs associated with LLMs and hosting infrastructure, the investment is seen as worthwhile for improving user experience and conversion rates for premium products.


  1. Bartmoss speculates on the future impact of LLMs on language evolution, noting their current influence and the importance of adapting to changes in language use over time.


  1. LanguageTool prioritizes privacy and data security, avoiding external APIs for grammatical error correction and developing their systems in-house with open-source models.

Bartmoss St. Clair LinkedIn Profile: Bartmoss St. Clair | LinkedIn

The Data Scientist – Media – Data Science Talent


Speaker Key:

DD Damien Deigh

PD Philipp Diesinger

BC Bartmoss St Clair


DD: This is the Data Science Conversations Podcast with Damien Deigh and Dr. Philipp Diesinger. We feature cutting edge data science and AI research from the world’s leading academic minds and industry practitioners, so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science Talent, the data science recruitment experts [upbeat music]. Welcome to the Data Science Conversations Podcast. My name is Damien Deigh, and I’m here with my co-host once again, Philipp Diesinger.

PD: Hi guys.

DD: So, today we are talking to Bartmoss St Clair about industry use cases for large language models. And by way of intro, Bartmoss holds both maths and physics degrees from Heidelberg University. He’s been working as an AI researcher and engineer for many years at the likes of Harman International and Samsung. He has also been a guest researcher at Alexander Humboldt Institute in Berlin. Currently, Barmoss is the head of 


AI at Language Tool. They are a German software company that has a writing assistant for multilingual proofreading, grammar and style checking, and they do that in over 30 languages. Bartmoss has a deep understanding of the maths behind AI and has been developing products in the NLP space since 2012. He works as an advisor to boards and companies, large and small, and he even has found the time to develop an open-source community that he has created called Secret Sauce, where they focus mainly on voice assistance. We’re delighted to have him here. Welcome to the podcast, Bartmoss.

BC: Thank you, Damien. It’s a pleasure to be here. This is actually my first time on a podcast, so I’m actually extra pumped for this.

DD: Great. So, we normally start with your own story, Bartmoss. So, please do tell us how did you go from maths and physics at Heidelberg into data science and AI?

BC: I ask myself that all the time. Honestly, what happened was ever since I was a kid, I really wanted to be a physicist and I did that for many years at university. I just found, honestly that the academic life just didn’t suit me really, it just didn’t fit for me. And when I started questioning what I wanted to do next, an opportunity presented itself. A colleague of mine, his father 



is a professor and worked with AI way back then in, I think 2012, it was 2013, somewhere in there.

And there was this very interesting project with natural language processing, dealing with automating content governance systems for banks. And purely because of nepotism, honestly, it’s not what you know, sometimes it’s who you know. I got involved in that and I started studying and learning about that as quickly as possible. And then I founded a company to build up the solution for, I think it was five or six different languages. And I mean, this was back in the stone ages of NOP back then. And I really discovered a great passion for this. And I just knew that’s what I wanted to do for the rest of my life. And sometimes you just get lucky like that, I guess.

DD: And obviously, you find yourself now at the cutting edge of actually using LLMs in the business world. We’re obviously in the very early stages of these use cases, but you have some solid ones to talk about. So, let’s start there. Can you give us an overview of what you’re doing at Language Tool?

BC: So, at Language Tool, one of the use cases we have, which of course would be the primary use case for us would be in grammatic error correction called GEC for short, where obviously someone writes 


something like a sentence or a text, and then they want their grammar checked, and they want it replaced with the correct grammar. That’s, of course, a very basic use case. Language Tool has existed in itself for about 20 years, but of course they didn’t use AI back then or machine learning. As head of AI, one of the things we wanted to do was create a general grammatic error correction system, GEC for short, to be able to catch all kinds of errors for all languages possible and correct them. Now, that’s a kind of an interesting use case in my opinion. I mean, I also have taught English back when I was at university, I actually taught English and I always had an interest in language and I corrected a lot of grammar back in the day.

So, it really fit with what I did. And how it really works there is simply that a user first writes a text, and then of course our system needs to somehow be able to correct it. Now, how exactly does that work? Well, you’re going to want to use a, a model for that. And one thing that is a very big question there is do you use a model that exists already that’s a very large model, like a large language model something like GPT-3 or GPT-4 and just use prompting, or do you create your own models for this? And one thing of course we found is that if you create your own models to do this very specifically, it’s on one side cheaper, but on the other side it’s faster and it actually works better and scores better. And so, we’ve created our, our 


own models for doing exactly this kind of task. There’s a lot of questions there when it comes to business case with latency, how accurate or correct do you want your system to be?

There’s a lot of trade-offs with, of course, what kind of resources you have to run this in production. Of course, when you run for millions of users, you have to make sure to have a good trade-off between the performance, and speed, and price, right? There are also a lot of discussions about do people use encoder-decoder models, something like [inaudible 00:05:53], T5 or other sequences-to-sequence based models, or do you use just a decoder model? Decoder models have become very, very popular, and we’ve seen a lot of scaling behind decoder models. I mean, such as GPT-3, GPT-4, LLaMA, LLaMA 2, et cetera, many more coming out it seems every week, there’s a lot of great tools for them. But sometimes the question is how big do you need a model for your purpose? And do you have to benchmark that and test that to see how well that works.

PD: Can you dive a little bit deeper to give us a better understanding of the business case? Like who are the users, how does it work? Is it a real-time application? Is it offline, online? Like how do we have to envision this grammar tool.



BC: This is a real time application that needs to work with low latency as users are typing in a document, or on a website, or in any way in their browser. For example, we have an application for the desktop. There’s many, many ways you can use Language Tool. And of course, it needs to work as quickly as possible, the use cases are completely varied in this case. I mean, it can be academics writing, it could be for business. We have both B2B and B2C customers. So, really, we don’t have a one size fits all for our customers. In the end, it really varies, but one thing is for sure we need to find a good balance for grammar correction and annoying people. That’s something that’s really kind of funny with these systems, people just assume that it’s just the grammar’s either correct or not.

But there are cases where we’ve seen that many users don’t like a rule and you look at the analytics and you think, well, it’s technically it’s correct, but if enough people don’t like something, you have to debate whether you want to turn it off. And a good example of this is in English with, to whom or for whom. People, a lot of times nowadays in English, just say, who. And grammar changes over time, and we have to be mindful of that. Another funny use case there with that is a lot of people in Slavic languages don’t use the definite and indefinite article. So, the and a. You notice that they say that they don’t want this rule, a lot of times to suggesting an edit that puts in an indefinite or definite article. But in the 


end, that it is correct. I hate to tell you guys, it’s correct to use those articles there. We’re pretty sure about that. And you have to really strike a balance with your users, and I think that’s something really, really important there.

PD: How many users do you have?

BC: We have several million users in many different languages, six primary languages. However, we support over 30 different languages worldwide. And we have employees from all over the world. It’s a very great and diverse place to work, that’s for sure.

DD: You mentioned decoders and encoder-decoders. What are the different scenarios where you might recommend one over the other?

BC: I mean, it really depends, of course. If you’re starting out by maybe prototyping, maybe you want to use a very large decoder, which is very popular nowadays, using a prompt so that it has some sort of emergent behavior so that you can just do zero shot or a few shot prompting that’s quite good for that. But maybe, just maybe you want to do a task where you’re doing something like translation, for example. And generally, it’s found that sequence-to- sequence encoder-decoder models perform very good for that. And of course, you have to think of things like context 


window. Do you want to do large amounts of text or do you just want to do it on a sentence level? There’s a lot of things to consider here.

DD: So, how does GEC work in practice then? And what are the challenges with what you’re doing?

BC: Well, GEC in practice, of course you have to start with some really good data. And generally, you want to fine tune a model with very good data where you just have sentence pairs perhaps, or text pairs between one that could possibly have mistakes in it and one that is completely golden. Once you have, of course, fine tune your model, and of course you need to check it to see if how it’s working. Now, you would think that would be quite simple to check to see how well it’s performing, but you have to remember there can be multiple grammar corrections for a sentence. So, there you have to be able to handle that. And I think that’s a very interesting challenge where it’s not just black and white.

Other challenges, of course, things with hallucinations or extreme edits, if you want to call it that, you don’t really want the output to be too far changed. You want it to be the same meaning as the original sentence, and maybe you don’t want certain words changed, you just want to fix the grammar. And there’s always a risk with these models that they will change much more. And there’s a lot of different tools to solve these types 


of problems. Everything from checking edit distances with Levinstein or similarity with co-sign similarity. And there’s a lot of different approaches there. There’s also some interesting things that I’ve read about with tagging edits.

There’s something called errant, which can be very popular, especially academically, it’s very popular for tagging sentence edits that can also reduce this issue of over editing, of over changing a sentence. And also, sometimes there’s a question of how much do you change a sentence before you shouldn’t change it anymore. The great thing about language tool, like I said, is it’s existed for 20 years, and so there’s a lot of standards and practices in there that have developed through building a rule-based system that we could inherit into doing this with artificial intelligence.

PD: The rule-based system that you mentioned from the past, that system is still being used at the moment. So, we have a hybrid system generally high-end rule-base.

BC: Absolutely. When it comes to really basic things that could be formulated into rules, it’s very cheap and very accurate, and it works. There’s no reason to fix that. A lot of times it’s these more complex, contextual based grammar issues where you can’t just create a simple rule for it because there’s so many exceptions that machine learning is very ideal for that. 


And of course, running machine learning models with inference in production could be more expensive than just writing a simple rule with let’s say, RegX, or Python, or whatever you would want to use.

PD: So, partners, you explained that they are basically rule-based systems and you also use AI? If you have rule-based systems, like based on regular expressions. So, how do they work together with the AI systems?

BC: Well, one thing you can do there is it’s very important for us that we don’t just correct the grammar, but we explain to the user why, the rules. And every time we have a match, we also have to explain the reasoning behind that. And this means that every match has a unique rule ID. So, for certain rule-based systems, you have an ID for those, and you have an ID for all of the types of matches that can occur in the machine learning aspects, and you can then prioritize those rules. For example, let’s say that nine times out of 10, there’s a rule that works with rule-based system, but then it triggers for that one time where there’s a deeper context, you can prioritize the AI model over the rule-based system. And it works pretty well, actually. I mean, I’m honestly sometimes so surprised. We don’t have endless loops we get stuck in or, of correction, or anything like that. We handle that pretty well.



PD: So, Bartmoss, you mentioned that you were trying to develop a generalized system, generalized GEC. How far away are you from that, or how do people need to think about your AI system? Is it like specialized agents that all specialize in one task that they solve and the rule-based system then uses them depending on the need? Or is it really like one generalized AI system that basically answers some of the questions?

BC: Like I said, we do keep our legacy rules where they function very well and they always work. I mean, there’s certain cases where something’s always going to be true, like capitalizing at the beginning of a sentence or punctuation at the end and so on and so forth. Very basic things. But for general grammatic error correction, this is something that we worked on for a while and we actually do have running in production. It’s a quite interesting system. As I said, we do have these many, many layers in production from rule-based systems, to specific machine learning based solutions first for an actual one type of correction. We call that system Hydro Leo. And we also use what we call GEC, which is the general grammatic error correction, which can solve everything. Oh, well, not everything. There’s certain things you can’t teach a model sometimes and you have to do in post-processing, but I don’t know, hope I’m not giving away too many secrets here [laughter].



PD: How much data do you need to train those models? Like what’s the role of the data? What kind of data do you use? Is it all natural data? Is it synthetic data?

BC: In the end, it is a mixture. And I mean, we do have opt-in user data that we collect from our website. We also have golden data that we’ve annotated internally ourselves and from our language experts. We’d be lost without them. And so, we have people for every language. We have several people who actually go through and review and annotate data for us, which is very, very, very helpful. We do also generate data, absolutely, synthetic data. And generally, when it comes to these models, you have to ask yourself do you want to train these things in stages or do you want to train them all in one go? Is it better to use synthetic data for a part of it or should you just use fewer points of data and focus on quality?

These are a lot of questions that we have to answer constantly. And I don’t know how much detail I can go into there, but it’s quite an interesting mixture of tasks and methods that we use there with generation, with user data, with data that we internally create and annotate. One thing that’s of course very important that you have to watch out for is of course, the distribution of the data. It really should be how it is for the users, and we want to keep that distribution as close to how the users actually use this as possible. And so, you don’t want to get too far away with your data 


distribution, especially with errors. I mean, there are errors that can come more often than others, spoiler alert, for example, commas. Lot of errors there with commas. And so, you can see those way more often than certain other types of errors. And you want to make sure that that distribution stays roughly the same, otherwise you won’t have as good performing model.

DD: Yeah. It takes me back to the very first podcast we did was with Professor Phillip Cohen from Johns Hopkins University, and it was on machine translation. And he talked about the very subjective nature of language. So, given that how do effectively benchmark the GEC and its tasks?

BC: That’s something we’ve had to build a lot of tools for from scratch, because that just doesn’t really exist out there. And as I said, you can get something marked wrong, but it’s an actual correct correction because there are more than one way to correct a sentence, right? Or a text. And in those cases, of course the most obvious method, which is kind of tiring sometimes is just to collect every possible variant of a correct version of a sentence. But that’s a monumentous task, and it’s not always the best way to do it. There are a lot of other ways you can do this, obviously, you need to ask yourself, what’s more important, do you want precision or recall? When it comes to the users, we do have some really good analytics there.


I mean, we can’t see what people are writing or anything like that. We are very data privacy focused. Of course, if you opt in to have your text used for machine learning purposes by us, then obviously, but generally, we can’t read what people are writing, that’s not saved anywhere. And so, we can see if a user generally applied our suggestion, or if they said no to it, or just ignored it. And we guide ourselves a lot by that. I think honestly, there is no 100% silver bullet, it’s just a barrage of methodologies there to ensure that we are giving our users the best possible quality corrections that they can trust, whether they’re writing a thesis or a legal document. And we take that very, very serious. And I think because we’ve been doing this for 20 years, I think that that gives our brand a very big strength because we have ironed out a lot of those issues.

PD: And how do you measure the performance of your system?

BC: Like I said, there’s a lot of different measures we use for performance, obviously, the typical F1 scores and things like that when you’re training a model or fine tuning a model generally with your evaluation dataset this is something you look at. And of course, that’s not 100% foolproof because it can offer suggestions that are correct that you might not have. But that’s a good rule of thumb there. Also, as I said, the user analytics. When you put something online you can see if the users really hate it. Very quickly, we do partial rollouts, we do ab tests, obviously. The standard practices that 


you see within these, within companies when it comes to rolling out models, we like to release regularly, and start with partial rollouts, and just see what our users say to it.

That’s a very important metric for us is from the users. We do also manual reviews obviously, that’s very also time consuming. But that’s something that we find very important is taking subsets of data and having professional language experts review this. And that’s one point that I think is quite important. These aren’t just the usual data annotation folks, these people are very specialized in language because a regular person who does data annotation is not really qualified to know exactly what’s 100% grammatically correct in a language. Being a native speaker isn’t enough, otherwise we wouldn’t have a business case because people wouldn’t make mistakes, right? So, you have to find very, very highly qualified people to review the data, who really have a very deep understanding of language.

DD: Can you talk about the platform before LLMs and how it has improved the user experience and the performance now that you have LLMs?

BC: There’s two main use cases for us with LLMs. One of course is the, we use it for the general grammatic error correction, obviously, and we also use it for rewriting or rephrasing. In these two areas, well, first off with 


paraphrasing, you can’t really do it without these kinds of models. And it was a huge big use case for us, and it created a lot of stickiness for us when we released this first and we saw it was a really big hit, and that just completely changed because that’s something you couldn’t do really before with previous systems. And of course, with grammatical error correction, you can really catch things that you couldn’t catch before. Before such models, the best we could do was train smaller, much, much smaller models on a very specific error, and it would just do that specific error of a, something where there’s a comma missing or something like that.

And you would have to think of all those cases. And that’s very similar to rule-based, where you have to literally think of each and every type of error possible. Whereas when you’re training a model, as long as you can just correct the output sentence correctly, it can catch a lot of things that are much deeper that you would never think about and very deep contextual things. And you can change things with style, you can improve fluency. There’s so many use cases for language and that we can bring that into all kinds of applications. I mean, I myself use it of course, every day which is really nice. I get a free premium account, so I’m using it in my Slack, I’m using it in my emails, and I find myself, even though I taught 



English for several years, I find that I’m like, oh, I missed a comma there. And that’s quite helpful. So, [laughter].

PD: Regarding that switch to using LLMs more, did that have an impact on your cost side instead the consideration?

BC: Yeah, absolutely. There was a cost trade off there, because we do host all of our own infrastructure for this. With our GPU servers, we found that it’s kind of a no brainer. You get higher retention when you find more errors and improve people’s spelling and grammar. And when you do this, you do get better numbers, you get better conversion rates for premium. So, it’s a worthy investment for us. Of course, you really have to consider these things with costs, and bringing the cost down is so something that we actively do. And we’ve worked a lot in compressing models, accelerating models, and that’s its own little niche area of these things aren’t plug and play as I assumed they were, I thought we could just plug it in. And you end up pulling your hair out, trying to get some framework working for compression or something that’s new, and you think it should just work and you end up building all of your own stuff and even fixing upstream bugs, and oh my goodness, I could talk about a whole podcast just about that, I think [laughter].



PD: I can imagine that. A little bit philosophical question, so you mentioned at the beginning that language is something that’s possibly changing, but now we are ending in era where LLMs and Gen AI has an impact on our language, right? It’s correcting us, right? It’s writing itself, write lots of articles and texts and so on, this methodically generated. Do you think that at some point in the future, in 10 years or so, that LLMs will start having an impact on shaping language?

BC: I think to some degree they already do. I mean, we can, a lot of times, detect whether a longer text is generated. And I noticed a lot by playing around with certain different LLMS that they like to inject certain words all the time. And I don’t know if that’s for watermarking purposes or what that actually is, because a lot of times they’re not very transparent about these things. But I’ve noticed that, and I think as we rely on them more, we will definitely see a greater impact with language. But it’s always important to note that these things are just predicting the next token based on a huge amount of data from people. And so, as long as you’re training new models or fine-tuning new models, that’ll change with the times also.

We might have to add in new vocabs possibly, but otherwise, I think this thing with language, language is fluent it is very fluid and it changes over time, and we have to change with it. And I think there’s nothing wrong with that. And maybe one day we will remove the articles, indefinite articles in 


language. Who knows? I don’t know. I mean, how much longer before in German, there might not be a formal U or in other languages. All of these things are constantly changing and there’s so many form, like English words invading other languages, becoming way more global. And I think that will continue, and as long as we’re able to communicate with each other effectively and beautifully, I don’t think there’s anything wrong with that.

PD: And you mentioned that, a point that I found quite interesting. So, you are trying to detect basically whether text or some data has been generated by a model or whether it was generated by humans. I’m assuming you need to do that to ensure data input data quality. How do you approach this?

BC: We don’t specifically offer this as a product, but of course this is something that we do play with and we do check out. And I mean, one thing, the easiest way to do it is you can take an LLM and figure out what the probability is of the next word in a sequence. And if it seems too probable that the LLM would predict these words in a sequence, then it might be generated, long story short, right?

PD: So, and then when you find a text that seems to be have generated, then you take it out of your training data or how does it work?


BC: Well, for us that that’s not really a consideration. I mean, whether it’s generated or not is really immaterial for us. This is just something out of interest. We do a lot of different research, and we have a lot of researchers and people who have a really heavy background in research and open source at language tool when it comes to AI. And so, we do a lot of different types of experiments and that’s some of it. And we do have partner companies we work with for a lot of these things. And we’re part of a larger group company, so these are a lot of things that we consider, but for us, it’s just more out of interest. What people are doing with the language and how it’s used is whether it’s generated or not is really immaterial to our business case. In the end, even the greatest, latest LLMs can make grammar mistakes. And I’ve seen that again and again. And so, they’re not foolproof themselves. Our job is just to make sure that we, we get it correct for you.

DD: And I can’t remember if you’re allowed to say this or not, but you are using open source LLM with language tool, you’re not combining it with GPT via the API.

BC: No. We don’t use external APIs, especially for grammatic error correction. One, this is a privacy concern because a lot of our users are very privacy focused and our data generally stays within the EU. And like I said, we also don’t save just any data from our users. It’s only when it’s opted in 


from the website. We don’t want to pass that externally anywhere because we have a high respect for, for data privacy of our users, even to third parties. And so, we keep that all in-house with processing, with inference that is. And so, that’s something very important to us. So, everything we’re doing is completely built internally. And we use a variety of different types of models, and we’re always experimenting with new ones.

DD: How are you dealing with the hallucination problem when it comes to a correction that’s made to, and that generates an output of text?

BC: You do see these things as problems occasionally, and like I said, there’s a lot of different ways to solve this. Like I said, you can measure everything from the edit distance, like Levenshtein distance, you can check things like co-sign similarity to make sure you’re not changing the meaning. You can look at the types of edits by doing some sort of edit typing. You can use a variety of methods. And we do use a variety of methods to ensure that these models don’t just completely change an output for our users. And of course, that’s something we have to filter with the data that we get also, you have to be careful. Maybe another word would be better, but then that’s style. And we want to offer users the chance. If they want to have the style change, then they can have that. But if they just want grammar, then we’ll just give them grammar. And we like to offer users a multitude of different ways of correcting, and we don’t 


want to kind of mix it all together in one big bag of here’s what you get and we’re going to change your whole sentence. And unless that’s what users want, we can also offer that too.

DD: I would like to take a brief moment to tell you about our quarterly industry magazine called The Data Scientist and how you can get a complimentary subscription. My co-host on the podcast, Philipp Diesinger is a regular contributor, and the magazine is packed full of features from some of the industry’s leading data and AI practitioners. We have articles spanning deep technical topics from across the data science and machine learning spectrum. Plus, there’s careers advice and industry case studies from many of the world’s leading companies. So, go to to get your complimentary magazine subscription. And now we head back to the conversation. Switching up to the second use case, Bartmoss, can you give us the overview of how you’ve been using chatbots for learning simulations with Hey LLaMA?

BC: Hey LLaMA uses LLMs as chatbots to help creators turn training material into AI simulations across a broad and diverse range of use cases. We’re talking from language learning, sales, leadership, education, et cetera. And this is something very interesting of embracing LLMs to actually help people learn. And I mean, that’s something I’m passionate about. And of course, the co-founders Dan and [inaudible 00:31:05] are absolutely 


passionate about, ’cause I think they’ve been doing this now for over two years. And this is quite an interesting area, it’s something that, that’s really evolving with LLMs that you can do this, where it’s something you really couldn’t do before. And of course, on the onset it wasn’t really focused on chatbots. It was interesting because they were creating a lot of different AI based games and other things to enable learning. But then they found that the chatbot feature was just the most sticky one.

And when you’re doing startups, you never know what’s going to be successful. It’s always a bit of a surprise to see what people actually like in the real world, right? And that’s how it kind of got into more focused on chatbots based on user feedback and it just kind of exploded. They’re relying on LLMs with using either prompting or fine tuning to create simulations for users to be able to learn. And I mean, we’re talking, like, I’ve used it, for example, for job interview, right? I wanted to practice for job interviews and things like that, and I started practicing and trying that out. That was an interesting use case. And I think it might’ve helped me get my job at Language Tool because I did simulate, I did simulate with that, and that was fun back in the early days when we were still trying to figure out those features.

It can simulate multiple personas. So, it’s not just like you’re talking to the AI, but it can simulate multiple personas and different types of situations. 


Of course, this is something you really have to watch out for at the same time, because of course, you can be dealing with factual information here and LLMs can be, can make mistakes such as things with a reversal curse, which is something Damien, you’ve brought up before that there are cases where LLMs can just not, they’re not as intelligent as we think they are, and I think that that’s something you have to watch out for.

DD: Can you just quickly explain the reversal curse?

BC: It’s, when you would have some sort of, I guess, easy logical conclusion that you could get. Maybe something where if A, then B, and maybe if B, then A. So, it goes in, let’s say both directions. But then it’s misunderstood in one direction, perhaps, and you say, well, it goes, it should work both ways. And then the model might not understand it. And so, you might say, A is B and B is A, and then you’ll say, well, what is B? And then it will say C or something like that. I don’t know how I, maybe that’s not a very good example. I don’t know. Damien, do you have maybe a better example for us or?

DD: Yeah. I think the example used in a recent paper is they ask the LLM who is Tom Cruise’s mother, it answers perfectly, but then it can’t tell you who his mother’s famous son is. So, that’s would be, I think an example of that. So, you mentioned prompting and fine tuning, so maybe talk about when 


that’s the best method versus retrieval augmented generation, RAG, and that whole interplay and how people should decide there.

BC: Yeah. This is a really big question, and I mean, of course you can do all of the above, obviously, but generally, this is something where you have to test. And with retrieval augmented generation, that is anywhere you would inject some sort of information into a prompt with a query to give some sort of context or information for the LLM. Now, you can do this on a really big level where you would use some sort of a vector store or vector database, where you have huge amounts of data and articles and you have to match with the query to actually extract out information and then prompt that further. That’s something that’s very popular nowadays, especially with chatbots. And you can do that. I would say the biggest use case there is something like with open question and answering systems.

You can use an API or you can use a vector store, something like that, and try to match the user’s query and then prompt that right back in. Hey LLaMA is doing that to an extent they’re not actually using vector source currently, because that’s not necessary. And that’s another thing with business is, you don’t do things just because they’re cool, right? Yeah. You have to stay business focused. And they are injecting information such as what is the scenario, what is the user’s name, things like that into the prompt that they generate, they build to prompt a model. And of 


course, generally when it comes to fine tuning, you have to understand that facts can change, things can change. And when you’re fine tuning a model, if that information can change, that might not be the best course of action.

You might not want to do that. But then again, maybe you want to fine tune generally on how the prompt is displayed or how it comes out. But you still want to give it slots or facts. And this is very similar, actually, it seems to be like history’s repeating itself. I remember, this reminds me of NLG engines back in the old days with slot filling and things like that. Before LLMs, this is kind of how you would handle things, you would write out text and then you would have little slots in them that then you would dynamically fill programmatically. And it’s quite similar to that. And I think the revolution of course is in the LLMs themselves, but the techniques that we’re using are still quite, quite similar to slot filling in natural language generation engines seen years ago.

That’s, I think what it really boils down to is what is your use case? Can you just fine tune a model? Like, obviously, with the previous business case we talked about with grammatical error correction, you can create a model or fine tune a model or you even use something new and splashy like Laura or something like that, if you’re so inclined, if you want to train such a big model and you can’t fit it there. But when it comes to factual 


things and especially things that can change over time, you definitely want to inject that information from some of API or database into your prompt.

PD: I mean, you’ve talked a couple of times about LLMs and that they don’t come off the shelf. There’s of course, many different ways of adjusting an LLM and making it perform better. You mentioned training it fine tuning, but then of course you can also do a lot of prompt engineering, you can build agents and tools and so on. What is your practical experience in that space? What has the most impact or what does your workflow look like? What the steps the process look like to make an LLM perform on a specific problem that you have?

BC: In so many ways, it’s becoming easier and easier because there’s so many tools available that I just didn’t exist before. And I found myself writing a lot of code and building something from scratch just to find out that there was an amazing open-source tool or something out there that I didn’t know about. And then someone sent me the repo and I kicked myself going, why didn’t I just use that? I think one of the more popular ones out there, for example, with when it comes to these kinds of things is LangChain. That’s something that a lot of people are using nowadays. That’s just one of them, there’s a lot of other tools available out there. And I think, in the end, the improvements of the LLMs themselves whether open source, closed source, API based, et cetera, that’s great.


But in the end, I think it’s the tools that make a difference. And we didn’t see a lot of great tools back in the old days with natural language generation, obviously. Like there was some stuff from universities and you kind of had to build it all yourself. And nowadays these tools are readily available and it makes it so easy. You don’t really need to go down deep level and do everything by hand anymore. And I’ve seen people picking up and coding their own chatbots with low-code, no-code solutions. And I think that’s where a lot of the big boost comes from, really, and making it easy for people who maybe aren’t AI researchers for many years to be able to build something and just push the button and get it working.

DD: Can you give us an overview of the downfalls of relying on the external APIs and the closed source models?

BC: Absolutely. I mean, when it comes to using an external API, it’s quite easy to start off and prototype, and I think that’s very good. But there are questions of course, of data privacy which is a big question there. What data was used on these models, how were they trained, things like bias. And you don’t really have as much control there when you’re relying on them. And some of the biggest issues involved that I see as an advisor to companies is the latency and the cost. And I think, especially costs, this is something that’s greatly underestimated by companies. I see this a time and time again when people hit me up for micro consulting or a general 


advisory work is they don’t really understand the cost involved, especially if you have actually a feature that’s a hit and you can see that there.

And at the same time, there are a lot of user complaints on latency of this is taking too long to get a response. And you do see that quite often. And a very good example of that would be actually, going back to Language Tool, we created a system for rewriting, or rephrasing, or paraphrasing. There’s a lot of turns for that, but in the end, it’s taking a style such as formal and you take a sentence and you make it more formal. For example, that’s a concrete example of that. And we started off using OpenAI for that just to prototype it and see how it would work if users liked it. Because like I said, you never know what people want out there.

And that’s why I stay out of the product department, I’ll stick to AI. I think AI is easier than product in the end because you never know what people want. And it turns out that people really loved this rephrasing, rewriting feature so much so that we saw huge, huge growth in users and the usage. And we celebrated on one side, but on the other side we were thinking, oh, the costs are going up. And sooner or later that hit a red line, and it hit the red line a lot quicker than we expected it would. And that’s when the CEO calls you up and says, well, what are we going to do about this? You’re the head of AI, do something. And go, okay, I’m on it. And we realized we needed to make our own in-house models.


That’s the term we use is in-house to try to differentiate between using external APIs for LLMs. And this is something that we had to of course develop very quickly. And we actually really found that when you get the right data, you get the right model, and you get the right service for hosting this with, with GPUs, the cost is tremendously less, the latency is a lot less. And when you’re doing really big numbers every month, it really cuts it down to almost nothing in comparison. But the craziest thing is when we benchmark this using AB tests, we found that users preferred our models over the very large LLM models that we were using with external APIs. And we actually used several external APIs for this testing.

And yeah, it seems that a lot of times if you take models and make your own models or fine tune models, they can have a smaller number of parameters, but at the same time, they can perform better because they’re very domain specific for a task. And I think that’s where most of my bread and butter comes from is from that experience. And that’s also what we are doing with Hey LLaMA also. And I advise a lot of companies on this. I hope I’m not giving all my secrets away and people will still give me money for this. But in the end, it’s really about getting people off that addiction to external APIs that can be very costly and have a lot of latency that users complain about to creating very, very specific models for the 



use case that they have that is just like the NASA slogan of the nineties, faster, better, cheaper.

And that’s what we’re really focused on there. And that’s exactly what we’re doing as an AI strategy with Hey LLaMA. And that’s why it creates profitability for companies, because one thing I hear a lot with companies, they want to grow their user base and they don’t care about profitability and they end up spending so much money. But I think you can also be profitable at the same time if you cut down those costs. And so, I think that’s a big advantage to a lot of these open-source models and that you can run them yourself for business use cases. Not all of them, you have to watch out for those licenses, but you can run them yourselves and you can save money. Now, of course, at the same time, a lot of these closed source ones, like I said, they’re so much easier to use and it’s, I highly recommend it for rapid prototyping and seeing what would be sticky with your users. I think that’s a great use case. But in the end, it’s a question of time and money.

DD: Sorry. You were originally using GPT-3 then?

BC: Yes, yes. We were using GPT-3 originally when that came out back then. And that’s what we were using originally for this service. And of course, we also had to create a special agreement with our users for this feature to 


opt in because we were passing that data to a third party. And that was a consideration for us. And I mean, in the end, we also used a lot of other resources available there. We used a company also called Alif Alpha, and we worked with them. And we also, when GPT-4 came out, we tried using that also, for the same use case. We did fine tuning with GPT-3 because a lot of people seem to overlook that you can also fine tune those models and you don’t just need to do prompting with them. And lot of different types of tests, lot of different things we tried out there. But in the end, we, we found that our models in-house actually work better for both for users and also for our internal tests that were double-blind with our language experts.

DD: And would you say that the cost differential from OpenAI’s models to your own model is what, 10 to 1? 5 to 1?

BC: Oh, gosh. It’s massive. It’s massive. I mean, honestly with our users the way they were using this feature so heavily and how many more users kept signing onto it, it would’ve just been a matter of time before the company wouldn’t have been able to pay for that anymore, I think, honestly. And I would say, let’s say something like 10 to 1 maybe.

PD: And how much, you mentioned already that reducing parameter space is key. How about reducing the amount of training data? Is that something 


that you guys tried out, like shifting to more higher, high quality data but to less training? Yeah?

BC: Absolutely. You do have to have a balance there. And I mean, it similar to what we talked about with our first business case of general grammatic error correction. It’s always about the quality of the data. But with rephrasing it’s a bit trickier because what is more formal, what is less formal, how do you define formality of a sentence? How do you define whether a sentence is simpler or any sort of style? That can be quite subjective. And it’s hard to get an agreement between people about sentence formality. And so, in the end it’s really, really heavily based on user acceptance and AV testing of a large swath of users. We do, of course, internally check these things out, but in the end, yeah. It’s quite hard to measure these things.

PD: Makes sense, huh?

DD: So, Bartmoss, before we wrap up, do you want to quickly give an overview of your open-source community?

BC: I would love to. This is a big passion for me. And I mean, we all do this in our free time, this isn’t like a work thing for us. At Secret Sauce AI, we’ve been doing that for, gosh, I guess a couple years now. And we’re mainly focused on voice assistance, but any type of AI, and we really focus on 


trying to help open-source developers understand and use NLP and create tools for them to better use NLP in their own open-source projects. We are a conglomerate of many different projects. There’s of course, Mycroft, which I think is the biggest one. We have a lot of people from the Mycroft community. I’d actually love to give a special shout out to JARAs AI. He’s kind of the heart of that community.

There’s OpenVoiceOs, which is connected to that. There’s Leon AI, Sapphire, Athena, Lily, a lot of different little projects out there. And we all kind of come together and just hack it out. And whenever we have time, and unfortunately, we, we don’t have always so much time, but we squeeze in our time here and there to work on our different open-source projects. And if anyone is ever interested in doing a collaboration, you can reach out to me. And we always welcome that whether it’s ASR, NLG, NLU, TTS, even computer vision generation of images or computer vision things like that. We work in all kinds of areas and we love to teach people and build tools to support the open-source community to further that.

DD: Awesome. And we will put the links for that into the show notes. So, sadly, that concludes today’s episode. Just before we leave you, I want to quickly mention our magazine, The Data Scientist. We will feature a version of this conversation as an article in a forthcoming issue of the magazine. The magazine is packed full of articles detailing what some of the world’s most 


successful companies are doing in relation to data and AI, and you can subscribe for free at So, Bartmoss, thank you so much for joining us today. It was an absolute pleasure and it was an amazing conversation.

BC: The pleasures all on this side, Damien and Philipp, thank you so much for inviting me and having me here today.

DD: And thank you. Philipp, thanks to you.

PD: Thank you.

DD: Do check out our other episodes at, and we look forward to having you with us on the next show [upbeat music].