Field Notes

The Governance Problem

Stephanie Episode 9

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 11:57

AI translation has never looked better on the surface, yet plenty of teams still can’t make it work reliably in production. We dig into the uncomfortable reason: large language models are probabilistic systems, so the failure modes shift from obvious “bad machine translation” to believable, fluent mistakes that can quietly change meaning, introduce the wrong product definition, or slip in biased or hallucinated details. That’s where governance becomes the difference between a clever demo and a scalable localization program.

We walk through three layers of AI localization governance we can actually use: model selection (choosing the right model for the right domain, balancing quality, latency, and cost), model grounding (feeding the model authoritative terminology, product knowledge, regulatory context, and trusted sources via approaches like RAG, terminology databases, and knowledge graphs), and risk-based workflow governance (tiering content so high-risk text gets the right human oversight while low-risk content doesn’t get over-reviewed).

We also get practical about orchestration: when humans should intervene, which subject matter experts you’re paying for, what “failure” looks like in your metrics, and how to build feedback loops, exception handling, and rework paths that reduce redundant QA cycles. If your localization team is feeling margin pressure, this conversation connects governance to business value and shows how smarter KPIs change by content risk. Subscribe, share this with your localization or AI ops team, and leave a review with the governance question you’re wrestling with right now.

Why AI Still Breaks Workflows\n

Stephanie Harris-Yee

Hello, I'm Stephanie, and I'm here again with Erik to talk about the latest and greatest in the localization and AI space. So today we're going to be talking about something called governance. Now that's a big word, and we're talking about it specifically in kind of this AI localization field. But let's go ahead and set some groundwork before we jump into that. Eric, AI translation has improved dramatically with large language models, are now often outperforming the traditional machine translation. And now they can also be customized a lot in ways that are really incredible, I think. But you have been saying in the past that companies are still struggling to operationalize AI in localization. So if the models are getting better, what's actually going wrong here?

Erik Vogt

There's plenty of discussion about how the models are improving. And we have good data that we use to measure that improvement. So I'm not going to go into kind of edit distance reports or kind of quantification of how these systems are improving. It is clear that they are improving, but the base models themselves still have limitations to them. And I think that the LLMs give a lot more control, things like tone and audience adaptation. There's more terminology control, there's domain-specific prompts, but they're still probabilistic systems and they're still going to have a certain error rate. They're not deterministic. And just because they did it once doesn't mean they'll always do it. There can be semantic drift, the meaning changes, like how the model is interpreting meaning over time. And obviously, famously we understand things like bias, hallucinations, incorrect product definitions, rewriting things in new and unexpected ways. So all this stuff doesn't really surprise anybody. We've been talking a lot about that, and you can talk about it. But what we're seeing is how does the system react to changes in this workflow? So the type of errors are changing into like believable mistakes, like things that if you read them, it's like that's a perfectly grammatically correct sentence. But how do we govern that? How do we govern our human review process or our measurement ecosystem and our risk profiles around the use of these models? And so I think in general, even though they're getting better, I think we're underinvesting in attention to the governance of these models. That's my starting thesis here.

Stephanie Harris-Yee

Okay.

What Governance Means In Practice\n

Stephanie Harris-Yee

So when you say governance, what does that mean in like the basic sense? Is this just talking about workflow policies, or are we talking about some more technical things? What's the definition of governance in this case?

Erik Vogt

Definitely has nothing to do with parliament and it has nothing to do with democracy, nothing to do with current events. But governance is actually a complex and interesting field. There's a lot of different layers to it, but I'll talk about three of them right now. One

Model Selection And Tradeoffs\n

Erik Vogt

of them, which is model selection. So, for example, different models behave differently. There's latency and quality, trade-off, they perform differently depending on the domain. So there's some elements there of just which model do you use for which application? Just decisions, right? Control points. Second is model

Grounding With RAG And Terminology\n

Erik Vogt

grounding. And I'm using this term kind of to address a wide range of things that you can do to help the model perform better. It's not just prompt engineering, it's also having to do with supplying the model with the correct terminology authority, product knowledge, regulatory context. These are all things that a lot of LLMs can build into their product ecosystem. But there's a place where you can put this stuff and the model will perform better. So think RAG, think terminology databases, think knowledge graphs, think product documentation, all these things can be used to inform a stronger LLM output, sometimes astonishingly good if those root resources are in fact valid. So note to self, those things need to be the correct resources. They have to be coherent and they have to be relevant. So there's governance in which resources do you supply to which projects becomes an important part of this. So the model interacts with enterprise systems and it retrieves authoritative data and then it verifies that information before generating text. That's the way it works. So the best AI systems are not standalone models. Most people are just like send it to ChatGPT and get it back again or whatever. It's not the way it ideally should work, especially with corporate high-risk content. These systems are connected, and governance is the framework by which you put all this stuff together.

Risk Based Review And Content Tiering\n

Erik Vogt

And that gets to my third point, which is risk-based workflow governance. So not every piece of content needs to be reviewed the same way, but you need to keep track of that, which ones went through each of these tasks. And when you have, say, TMs that you created with NMT or with a lightweight review, you don't want to leverage those TMs into a high-risk content. So you just keep track of your different tiers. But I think the smartest deployments out there are taking content tiering into consideration, each of which is tagged with a certain risk profile. And then you select and balance the cost-benefit of selective human oversight into each of those. And again, the governance is all about keeping track of all this stuff and controlling which content goes through which track and who's talking about it. And there's a whole bunch of different aspects of this, but those are really the main ones. So what you're really trying to do in terms of creating value, and we're going to get to that in a second, but you don't want to over-review low-risk content and you don't want to under-review high-risk content.

Stephanie Harris-Yee

So, yeah, let's talk about value then. So I guess you can look at value in a couple of different ways. What are we talking about in this case? And how is that value coming out specifically in the this orchestration process versus say AI, LLMs? We're using those because they're cheaper, they're faster. So that's the sense of value. But how is this other side of value coming out here?

Orchestration And Quality Signals\n

Erik Vogt

So let's first look into what orchestration actually is. Which model do you use? What information are you asking it to retrieve? When should humans intervene? And to some extent also, which humans should intervene? And that kind of is relevant to our conversation about subject matter expertise. And who is the experience that you're paying for reviewing this? And what are you tasking them with? What control do you give them? And so that's also part of it. But also the quality signals, like how are you going to measure what a failure is and how and get an idea of how much it costs? And then also the learning feedback mechanisms, like how do you improve models? How do you improve workflows? How do you identify when something's gone off and how do you fix it? So these are some of areas that an orchestration conversation should have. Like what how are you controlling all these different aspects? Exception management, failure, rework workflows, change management, all that stuff. So as the coordination is really critical, right? Why? What is the business objective? What is the value that this

Business Value And KPIs By Risk\n

Erik Vogt

creates? I think one is that we're using human attention wisely. So if we're having the right people, we're looking at the right content and using their time effectively, it means not wasting it on non-value adding tasks whenever possible. Second is you really want to reduce the redundant QA cycles. A lot of these are very fragmented, but some are very formal. And QA cycles cost time, money, release. It's a pain. So how do you detect and fix and prevent from re-occurring the type of errors that are showstoppers? There's also the centralization of visibility across workflows. So imagine, let's say, we have three different workflows. We have high risk, moderate risk, and low risk. Each of these workflows has their own kind of tracks, their own characteristics, their own control mechanisms, as all these orchestration elements would tell us. So you want to know like, how are you doing against different criteria for each of these tracks? Like, how many critical areas did you have for the high-risk content? That might be your number one KPI is number of critical fails in your high-risk content. Whereas your low risk content might be optimized around total cost per word or even total product cost, because now you can generate content that just be like poof, in a few prompts, you've got an entire article. So, how well are you delivering to that business value? And it and word rate may be irrelevant, but maybe it is just word rate, is the it's just trying to get from two cents a word down to one and a half cents a word through prompt optimization and kind of good token management and things like that. So, in general, the this is hopefully giving an idea that there's fundamentally different ways to think about these, and that if you do have a risk-based model for how you route work, then you will come up with different metrics and different orchestration parameters for each of these models.

Enforceable Processes And Cost Reality\n

Erik Vogt

And then I guess the last part really to keep in mind is governance is all about creating enforceable processes. So if you don't want any terminology failures, you need to have some control mechanisms to prevent and review content. Again, remembering we're dealing with a probabilistic system, we can't be 100% sure that a model has introduced some surprising interpretation of a common word with more than one meaning without 100% human review. And in some cases, that 100% human review might even be more expensive than just having a human writing it in the first place. So it's very important to keep in mind the essential purpose of the content that you're creating and how are you orchestrating all the different systems that are supporting the delivery of value, which is a sum of all those different characteristics, the right quality, the right cost, the right characteristics of what you're

Margin Compression And Faster Orchestration\n

Erik Vogt

looking for. Yeah, so I think just thinking here, this is also a big issue for our industry, right? We're dealing with cost and margin compression as an industry. So the complexity of the orchestration and dealing with that complexity, which I've talked about before as well, is becoming more and more important to solve because we don't have massive amounts of generic word count to buffer optimization around. Like the cost of optimization, let's say it costs $100 and it's a $1,000 project. Then you're getting your workflows down, all that stuff is 10% of your total project cost, like 10% of the total job cost. Whatever it is, yeah. But if let's say you cut the translation price in half, now your orchestration overhead is still $10, but you're now 20% of your total cost, or maybe not doing my math right, but basically you're you could double or triple the operational overhead of that orchestration. So one of the things that we all have to figure out how to do is how to orchestrate faster and more efficiently. And we need tools to be able to optimize these workflows as quickly as possible and be able to customize the parameters by which we measure the success for each of those types of workflows.

Closing Takeaways\n

Stephanie Harris-Yee

Okay. Well, thank you, Erik. I think this is good. It gives people a list, maybe, of things to check to say, okay, am I looking at this aspect of governance and this and this as they go down the list of trying to get their AI systems up and running? Because I know that's a huge pain point for a lot of folks right now. The LLM seems great, but now it's this whole second piece. So thank you for coming in and sharing your insights.

Erik Vogt

Always a pleasure, Steph. Thanks very much and have a great rest of your day.