Field Notes

AI LQA Reality Check

Stephanie Episode 3

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 14:42

AI LQA sounds like the shortcut every localization team wants, but the real story is more nuanced and a lot more useful. We sit down with Erik Vogt to define AI-driven linguistic quality assurance and quality estimation in plain terms, then follow the practical question that matters: how do you use AI to review more content without blowing up cost, time, or trust? If you manage translation quality in a TMS or CAT tool environment, this conversation gives you a grounded map of what works today and what still breaks.

We dig into the most common AI LQA use cases: scoring segments so you can skip “likely good” content, isolating the worst segments so reviewers spend time where risk is highest, and using QE as an early go or no-go signal. Eric explains why the human baseline is messy too, including the reality of reviewer disagreement under MQM style frameworks, and why AI’s consistency can still speed up human review even when it cannot match human judgement end to end. We also talk about the impressive results teams sometimes see when the AI has the right glossary and guidance and why false positives can quietly erase those gains.

From there, we get tactical: why turnkey solutions often disappoint, how to break QA into narrow sequential checks, and how prompt engineering and tuning can improve reliability across languages. We close with what to expect next, including faster throughput, more transparency around AI compute costs, and better comparative data on foundation models for localization quality workflows. If you’re evaluating AI LQA tools, subscribe, share this with your localization team, and leave a review with the biggest question you still have about automated translation QA.

Why AI LQA Matters Now

Stephanie Harris-Yee

Hi, I'm here with Erik Vogt to talk about a subject that's top of mind for many in the localization industry, and that is AI LQA.

Defining AI LQA And Quality Estimation

Stephanie Harris-Yee

Now, Eric, for those who are just exploring this topic, can you give a brief explanation of what we mean when we say AI LQA?

Erik Vogt

Sure. I think there's a broad group of categories here. So there's things like quality estimation, there's things that identify where quality issues might happen. And there's lots of different ways of approaching this. And so under the hood, there's lots of internal capabilities that are involved here. Um, so whenever, for example, when you use a tool like Grammarly, when you're typing in English, that is essentially doing a QA for you. Or there's also lots of products that are built in that help uh where the commas go and where you should check your spelling. Things like that are kind of built into the authoring environment, just like they're built into the translation environment. So there's some of this that's sort of implicit. And a lot of the TMS tools out there and cat tools are designed to incorporate some of these QA capabilities in it. But I think we're talking about something else, which is can you use it after that process is done? Is there a tool that can then take another look and find where there might be some issues? And this can be especially helpful when doing updates on large projects where you don't really have the budget to review everything. You want to use a tool that will find issues more efficiently for you and just help you find where those things are. So there's QE capabilities, is really like how how likely is this the right number? It attaches a score to it or a category to it, like red flag, not red flag. And these this category of products, and there's several uh very strong ones on the market, have several different particular use cases. Uh and one is to get an overall assessment, like how is this overall? What is our general guess as to whether or not this is a go-no-go the way it is? Another one is to, if you have a limited budget, can you isolate, can you not review, for example, things that are likely good segments? Like, or can we eliminate those? And some of the case studies out there uh are showing you could eliminate 30, 40% of the content for review, maybe more, if it's scoring above a certain amount. That cuts that much out of your human labor. And that can be a very powerful motivator. On the flip side, you could also say, no, just show me the bad ones because I want to take a finite capacity and I want to focus my human attention on the most likely worst stuff uh that's out there. So each of these kind of ways of thinking about the value proposition in these tools is a way of helping uh either focus on cost reductions as a this is safe, or focus on best use of human attention on what is worst. Now, here's the interesting part.

The Human Baseline And Variability

Erik Vogt

What if you have a capability that can detect what's wrong, why not just have it fix everything? Just have it just do it all together. Now, therein lies where things start getting a little interesting because you know, as we think about the the capabilities of uh a Q a QA system, you what are you but what's your benchline? What is what are you comparing against? Generally, you're comparing to a human. Uh now, now let's look at human variability. If you send uh LQA to different reviewers using the exact same instructions and rules, we're we see that there's uh generally about a 0.6 uh correlation between the output of all these using an MQM methodology. That's a number that I've heard mentioned from multiple sources. As I think it's probably true-ish uh that generally speaking, there's about that kind of correlation. Where are what is the difference? It is often in what the reviewer finds. So really good reviewers, and I've I've done tests myself where I've tried to find the errors. And generally speaking, you can give people tests and say, Well, you know how many errors there are in here. You you do a standard LQA, they'll find most of them, but different reviewers will miss different of them. In other cases, they might classify them a little differently, put them in different categories. Is it terminology or is it accuracy? Or in other cases, there's a difference of opinion about severity. But anyway, there's a certain preferential dimension, there's a for certain kind of thoroughness dimension. AI has a couple advantages, it's pretty consistent and it doesn't get tired, and it's gonna find what it's gonna find pretty fast. So it can be a powerful tool to help accelerate the human LQA process, also, which is a cool way to think about it. Now, AI is not as good as the when you think about the typical correlation between an LQA, automated AI, LQA, and correlating it with what you know the human standard is, it's about a 0.4. So it's it's not as it's good, but it's gonna facilitate but not replace humans. And I I'm we can have a debate about whether it's gonna get to 0.6 next year or ever, but it's it that's kind of where things are right now. So anyway, just to finish that thought, the um the quality target here is often to try to get up to equivalent of a human or better, and that's often a function of the amount of data that the model has that you're doing that LQA with. And sometimes that needs to be equal to or greater than the amount of data that the MT or other tools had to start with. So it's a lot more metadata, it's a lot more training. It's the same story we've worked on years to try to figure out how to make MT the best. It also needs a lot more data to be better than the MT that it's or the humans that it's trying to judge. So I think some tools out there take a an approach of let's make this tool available in the LQA environment and help facilitate the human review. And then you end up with the classic human plus AI is better than the alternative. But as I mentioned, there's other use cases in there to consider as well. So I hope that gives you a little bit of an idea of the big picture for the whole QE and LQA capabilities.

How Accurate AI LQA Really Is

Stephanie Harris-Yee

Yeah. So maybe we already covered this, but um what is the state of the research now like? How good is it? Is this something that you know people who want to test it out can kind of do and expect a reasonably good result? Or are they still vast fluctuations in how things are coming out in the end?

Erik Vogt

Uh some of the tests we've done have been incredibly encouraging. Like some of them we were getting up to catching 90 to 99% of the issues, like very, very positive, especially for very good starting point. The AI is capable of doing some really, really cool things. And again, this is provided with the correct glossary, the correct guidance. You know, it's theoretically possible. That having been said, there's other things to think about, like the level of false positives, like how it might find all the issues, but how many false positives are and how what's the effective effort at clearing all those false positives that the AI thinks could be issues that aren't? There's other issues such as variations between languages. Uh, I think East Asian languages pretty reliably are harder to meet the same standards compared to the Western European languages, which were linguistically more similar, and also there's a ton more data, so it tends to be to be more effective. So I think we run pilots that take about a week to kind of do all the setup and and can an effective review. I think if if uh other systems may be able to do pilots in different amounts of times, depending on kind of what their what their approach is. Uh some of them are out of the box or ready to test sort of the generic version right out of the box. Like I think some tools are out there. I think Taos provides sort of a model with estimates that's pretty much ready to go. And there's ways of improving it, you know, with the appropriate training and instructions.

False Positives, Languages, And Tuning

Stephanie Harris-Yee

So it sounds like there's still some limitations right now on what you would even recommend someone to use this for. Are there any limitations that we haven't covered that you would kind of warn people against? I know we talked about like the language will make a big difference if it's Japanese or English, things like that. But is there any other kind of outline factors that people should be aware of?

Erik Vogt

One thing we recognize is that models are do have things that make them make mistakes. Like we think about it as hallucinating or just errors. Having worked with this a lot, it is hard not to personify these models a little bit because I think of them as kind of children in some ways. I'm finding it fascinating that I'll give the same instructions to the AI sometimes multiple times, and we'll come up with different, even fundamental ways of responding. It's almost like experimenting to see what's going to make me happy. So what's fundamentally critical is to break down the problem into chunks. So we think about it, people use the word agentic, but we also are thinking about it in terms of tasking AI to do narrow specific things one at a time, and you put that in sequence, and then you're gonna get to a better result. So it's not a nebulous like brain like humans are, it's one, you know, you can't unpack our brains and have individual subprocesses work on it, but we can work on one part of the problem at a time. We can break down a problem and do one thing at a time. So you might focus on glossary first, then you might look at style, then you might look at something else. And so we can put n number of different sort of tasks or agents together to string them together to produce the desired outcome. The second thing I'll just say is that tuning is critical, and tuning manifests in different ways, both from input instructions to the prompts themselves. So we have a kind of a dynamic prompt mindset. So we're we automatically have the AI create the prompts that it uses, kind of based on the content that it's evaluating. But I think it's really critical to see how that works and then revise and cycle through some iterations there to make sure that you're getting the most from that output. I guess that to answer your question as succinctly as I can, much more succinctly, is uh be careful of turnkey solutions out of the box, just no effort involved. It's just gonna do its thing. It's probably gonna be less than what you could do if you provided more appropriate support and fine-tuning to ensure that you're getting the best possible outcome for the for at this exact time.

Stephanie Harris-Yee

Okay. So

What Changes Next In AI LQA

Stephanie Harris-Yee

what do you see, or what do you think we will see happening in the near future with this technology?

Erik Vogt

It's gonna keep getting incrementally better. Uh, I think the operators will get better at deploying these capabilities in in more uh efficient and effective ways. You know, I think the efficiency is one thing that factors more at scale. So you think about the comp compute load for each of these tasks. How efficiently is the prompt um transacting the instructions that you're trying to get to? There is a cost associated with it, there's time associated with it, and I think we're seeing uh a faster throughput. Another thing I'd just say too is I would expect in the way we think about these things that tech, the tech and AI costs may start showing up in proposals as we start thinking about like the cost of these. It's still relatively minor compared to the human costs. So just often it's kind of loaded in there and included in the weighted word. But I think when you start moving at orders of magnitude higher scale, then it starts to become more of an issue. That's one second big trend. I'd say the models, the foundation models uh that these most of our capabilities are built on top of are themselves evolving. And I think I'm loving the increased interest in quantifying the relative value of each of these. I think um, for example, there's uh Intento does their state of AI report that compares just at a high, high level. It's very kind of general, just to give you an idea of the relative strengths of different machine translation models. I imagine we'll start seeing more comparative, generic sorts of analysis out there that would show us the relative strength of the different foundation models. But really, these are just entry points. They're just a beginning of the conversation. Any given customer's target is a unique situation, and it does require uh a very careful consideration of the prompt optimization, the optimization of compute, the optimization of the which number of steps kind of yields the best possible outcome. So these are all iterative and in some ways a little trial and error, but in general, we're seeing a faster, you're getting to the right answer faster. That's the trend that we're seeing. Yeah.

Stephanie Harris-Yee

Okay. Well, great.

No Magic Bullet Yet

Stephanie Harris-Yee

Well, thank you, Eric. I think this has given a really good overview of kind of where we are right at this moment and what we should be keeping an eye on as things develop into the future. So thank you so much. Um, any last things you want to add?

Erik Vogt

This is a really exciting time to be in this industry. It's really cool. I think that we're seeing um more and more interested in this ourselves. Uh, like I said, some of the pilots, some of the experiments are startlingly good. At other times, there's surprising mistakes. So it's it's uh it's an evolving story here, but it's a fun, fun time to kind of be exploring this as we we learn a new language around this different approach to to how we use AI to get the best possible outcome. There's no guaranteed magic bullet yet, but there uh but anybody who isn't using these tools is leaving leaving a considerable amount of potential on the table.

Stephanie Harris-Yee

Yeah, yeah. All right. Thank you so much.

Speaker 1

Sure. Thank you.