Building A Generative AI Platform"
url: https://huyenchip.com/2024/07/25/genai-platform.html
date: "2024-09-08 14:18:48"
Building A Generative AI Platform
After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but certain applications might deviate. This is what the overall architecture looks like.
在研究公司如何部署生成式 AI 应用后,我发现它们平台中存在许多相似之处。本文概述了生成式 AI 平台的常见组件、它们的功能以及实现方式。我尽量使架构保持通用性,但某些应用可能会有所偏差。这就是整体架构的样貌。
This is a pretty complex system. This post will start from the simplest architecture and progressively add more components. In its simplest form, your application receives a query and sends it to the model. The model generates a response, which is returned to the user. There are no guardrails, no augmented context, and no optimization. The Model API box refers to both third-party APIs (e.g., OpenAI, Google, Anthropic) and self-hosted APIs.
这是一个相当复杂的系统。本文将从最简单的架构开始,逐步添加更多组件。在其最简单形式中,您的应用程序接收一个查询并将其发送到模型。模型生成一个响应,并将其返回给用户。没有防护栏,没有增强的上下文,也没有优化。模型 API 框既指第三方 API(例如 OpenAI、Google、Anthropic)也指自托管 API。
From this, you can add more components as needs arise. The order discussed in this post is common, though you don’t need to follow the exact same order. A component can be skipped if your system works well without it. Evaluation is necessary at every step of the development process.
从这,根据需要添加更多组件。本文中讨论的顺序是常见的,尽管您不需要遵循完全相同的顺序。如果您的系统没有该组件也能正常运行,则可以跳过该组件。在开发过程的每个步骤都需要进行评估。
- Enhance context input into a model by giving the model access to external data sources and tools for information gathering.
增强模型对上下文输入的处理,通过为模型提供访问外部数据源和收集信息工具的权限。 - Put in guardrails to protect your system and your users.
设置护栏以保护您的系统和用户。 - Add model router and gateway to support complex pipelines and add more security.
添加模型路由器和网关以支持复杂管道并增加安全性。 - Optimize for latency and costs with cache.
针对延迟和成本进行缓存优化。 - Add complex logic and write actions to maximize your system’s capabilities.
添加复杂逻辑并编写动作以最大化您系统的功能。
Observability, which allows you to gain visibility into your system for monitoring and debugging, and orchestration, which involves chaining all the components together, are two essential components of the platform. We will discuss them at the end of this post.
可观测性,它使您能够对系统进行监控和调试,以及编排,涉及将所有组件串联起来,是平台上的两个基本组成部分。我们将在本文末尾讨论它们。
» What this post is not «
这不是这篇帖子的内容
This post focuses on the overall architecture for deploying AI applications. It discusses what components are needed and considerations when building these components. It’s not about how to build AI applications and, therefore, does NOT discuss model evaluation, application evaluation, prompt engineering, finetuning, data annotation guidelines, or chunking strategies for RAGs. All these topics are covered in my upcoming book AI Engineering.
这篇帖子重点讨论了部署 AI 应用的总体架构。它讨论了构建这些组件所需哪些组件以及构建时的考虑因素。这并不是关于如何构建 AI 应用,因此,不涉及模型评估、应用评估、提示工程、微调、数据标注指南或 RAGs 的块分割策略。所有这些主题都将在我的即将出版的《AI 工程》一书中涵盖。
Table of contents 目录
Step 1. Enhance Context
第一步. 增强上下文
….RAGs ……RAGs
….RAGs with tabular data
……表格数据的 RAGs
….Agentic RAGs ……代理 RAGs
….Query rewriting ……查询重写
Step 2. Put in Guardrails
第二步:设置护栏
….Input guardrails … 输入护栏
……..Leaking private information to external APIs
………向外部 API 泄露私人信息
……..Model jailbreaking ………模型越狱
….Output guardrails ……输出护栏
……..Output quality measurement
………输出质量测量
……..Failure management ………故障管理
….Guardrail tradeoffs ……护栏权衡
Step 3. Add Model Router and Gateway
第 3 步:添加模型路由器和网关
….Router 路由器
….Gateway … 网关
Step 4. Reduce Latency with Cache
第 4 步. 使用缓存降低延迟
….Prompt cache ……提示缓存
….Exact cache ……精确缓存
….Semantic cache ……语义缓存
Step 5. Add complex logic and write actions
第 5 步。添加复杂逻辑并编写动作
….Complex logic ……复杂逻辑
….Write actions ……编写动作
Observability 可观测性
….Metrics ……指标
….Logs ……日志
….Traces ……痕迹
AI Pipeline Orchestration
人工智能管道编排
Conclusion 结论
References and Acknowledgments
参考文献和致谢
Step 1. Enhance Context
第一步. 增强上下文
The initial expansion of a platform usually involves adding mechanisms to allow the system to augment each query with the necessary information. Gathering the relevant information is called context construction.
平台初始扩展通常涉及添加机制,使系统能够将必要信息添加到每个查询中。收集相关信息称为构建上下文。
Many queries require context to answer. The more relevant information there is in the context, the less the model has to rely on its internal knowledge, which can be unreliable due to its training data and training methodology. Studies have shown that having access to relevant information in the context can help the model generate more detailed responses while reducing hallucinations (Lewis et al., 2020).
许多查询需要上下文才能回答。上下文中的相关信息越多,模型就越不需要依赖其内部知识,而这些知识由于训练数据和训练方法可能不可靠。研究表明,在上下文中获取相关信息可以帮助模型生成更详细的回答,同时减少幻觉(Lewis 等人,2020 年)。
For example, given the query “Will Acme’s fancy-printer-A300 print 100pps?”, the model will be able to respond better if it’s given the specifications of fancy-printer-A300. (Thanks Chetan Tekur for the example.)
例如,给定查询“Acme 的豪华打印机 A300 能否以每秒 100 页的速度打印?”,如果提供豪华打印机 A300 的规格,模型将能更好地回答。(感谢 Chetan Tekur 提供的例子。)
Context construction for foundation models is equivalent to feature engineering for classical ML models. They serve the same purpose: giving the model the necessary information to process an input.
基础模型语境构建等同于经典机器学习模型的特征工程。它们具有相同的目的:为模型提供处理输入所需的信息。
In-context learning, learning from the context, is a form of continual learning. It enables a model to incorporate new information continually to make decisions, preventing it from becoming outdated. For example, a model trained on last-week data won’t be able to answer questions about this week unless the new information is included in its context. By updating a model’s context with the latest information, e.g. fancy-printer-A300’s latest specifications, the model remains up-to-date and can respond to queries beyond its cut-off date.
情境学习,即从情境中学习,是一种持续学习的形式。它使模型能够持续地融入新信息以做出决策,防止其过时。例如,仅用上周数据训练的模型,除非将新信息纳入其情境,否则无法回答关于本周的问题。通过用最新信息更新模型情境,例如 fancy-printer-A300 的最新规格,模型保持最新状态,并能回答截止日期之后的问题。
RAGs
The most well-known pattern for context construction is RAG, Retrieval-Augmented Generation. RAG consists of two components: a generator (e.g. a language model) and a retriever, which retrieves relevant information from external sources.
最著名的上下文构建模式是 RAG,即检索增强生成。RAG 由两个组件组成:一个生成器(例如,一个语言模型)和一个检索器,它从外部来源检索相关信息。
Retrieval isn’t unique to RAGs. It’s the backbone of search engines, recommender systems, log analytics, etc. Many retrieval algorithms developed for traditional retrieval systems can be used for RAGs.
检索并非 RAGs 独有。它是搜索引擎、推荐系统、日志分析等的骨架。许多为传统检索系统开发的检索算法也可用于 RAGs。
External memory sources typically contain unstructured data, such as memos, contracts, news updates, etc. They can be collectively called documents. A document can be 10 tokens or 1 million tokens. Naively retrieving whole documents can cause your context to be arbitrarily long. RAG typically requires documents to be split into manageable chunks, which can be determined from the model’s maximum context length and your application’s latency requirements. To learn more about chunking and the optimal chunk size, see Pinecone, Langchain, Llamaindex, and Greg Kamradt’s tutorials.
外部内存源通常包含非结构化数据,如备忘录、合同、新闻更新等。它们可以统称为文档。一个文档可以是 10 个标记或 100 万个标记。天真地检索整个文档可能导致您的上下文任意长。RAG 通常需要将文档分割成可管理的块,这可以从模型的上下文长度最大值和您的应用程序的延迟需求中确定。有关分块和最佳块大小的更多信息,请参阅 Pinecone、Langchain、Llamaindex 和 Greg Kamradt 的教程。
Once data from external memory sources has been loaded and chunked, retrieval is performed using two main approaches.
一旦从外部内存源加载数据并进行分块,检索将采用两种主要方法进行。
-
Term-based retrieval 基于词的检索
This can be as simple as keyword search. For example, given the query “transformer”, fetch all documents containing this keyword. More sophisticated algorithms include BM25 (which leverages TF-IDF) and Elasticsearch (which leverages inverted index).
这可以简单到关键词搜索。例如,给定查询“transformer”,检索包含此关键词的所有文档。更复杂的算法包括 BM25(利用 TF-IDF)和 Elasticsearch(利用倒排索引)。Term-based retrieval is usually used for text data, but it also works for images and videos that have text metadata such as titles, tags, captions, comments, etc.
基于术语的检索通常用于文本数据,但它也适用于具有文本元数据(如标题、标签、字幕、评论等)的图像和视频。
-
Embedding-based retrieval (also known as vector search)
基于嵌入的检索(也称为向量搜索)
You convert chunks of data into embedding vectors using an embedding model such as BERT, sentence-transformers, and proprietary embedding models provided by OpenAI or Google. Given a query, the data whose vectors are closest to the query embedding, as determined by the vector search algorithm, is retrieved.
您使用 BERT、sentence-transformers 等嵌入模型或 OpenAI 或 Google 提供的专有嵌入模型将数据块转换为嵌入向量。给定一个查询,通过向量搜索算法确定与查询嵌入最接近的数据向量被检索。Vector search is usually framed as nearest-neighbor search, using approximate nearest neighbor (ANN) algorithms such as FAISS (Facebook AI Similarity Search), Google’s ScaNN, Spotify’s ANNOY, and hnswlib (Hierarchical Navigable Small World).
向量搜索通常被框架化为最近邻搜索,使用近似最近邻(ANN)算法,如 FAISS(Facebook AI 相似搜索)、谷歌的 ScaNN、Spotify 的 ANNOY 和 hnswlib(层次可导航小世界)。
The ANN-benchmarks website compares different ANN algorithms on multiple datasets using four main metrics, taking into account the tradeoffs between indexing and querying.
ANN-benchmarks 网站使用四个主要指标,在多个数据集上比较不同的 ANN 算法,同时考虑索引和查询之间的权衡。- Recall: the fraction of the nearest neighbors found by the algorithm.
回忆:算法找到的最近邻的分数。 - Query per second (QPS): the number of queries the algorithm can handle per second. This is crucial for high-traffic applications.
每秒查询数(QPS):算法每秒可以处理的查询数量。这对于高流量应用至关重要。 - Build time: the time required to build the index. This metric is important especially if you need to frequently update your index (e.g. because your data changes).
构建时间:构建索引所需的时间。这个指标非常重要,尤其是如果您需要频繁更新索引时(例如,因为您的数据发生变化)。 - Index size: the size of the index created by the algorithm, which is crucial for assessing its scalability and storage requirements.
索引大小:算法创建的索引大小,这对于评估其可扩展性和存储需求至关重要。
This works with not just text documents, but also images, videos, audio, and code. Many teams even try to summarize SQL tables and dataframes and then use these summaries to generate embeddings for retrieval.
这不仅可以与文本文档一起工作,还可以与图像、视频、音频和代码一起工作。许多团队甚至尝试总结 SQL 表和数据框,然后使用这些总结来生成用于检索的嵌入。 - Recall: the fraction of the nearest neighbors found by the algorithm.
Term-based retrieval is much faster and cheaper than embedding-based retrieval. It can work well out of the box, making it an attractive option to start. Both BM25 and Elasticsearch are widely used in the industry and serve as formidable baselines for more complex retrieval systems. Embedding-based retrieval, while computationally expensive, can be significantly improved over time to outperform term-based retrieval.
基于词项的检索比基于嵌入的检索快得多,也便宜得多。它可以即插即用,因此是一个吸引人的起点选项。BM25 和 Elasticsearch 在业界得到广泛应用,并作为更复杂检索系统的强大基准。虽然基于嵌入的检索在计算上成本较高,但随着时间的推移可以显著改进,从而超越基于词项的检索。
A production retrieval system typically combines several approaches. Combining term-based retrieval and embedding-based retrieval is called hybrid search.
一个生产检索系统通常结合几种方法。结合基于词项的检索和基于嵌入的检索称为混合搜索。
One common pattern is sequential. First, a cheap, less precise retriever, such as a term-based system, fetches candidates. Then, a more precise but more expensive mechanism, such as k-nearest neighbors, finds the best of these candidates. The second step is also called reranking.
一种常见的模式是顺序的。首先,一个便宜但不太精确的检索器,如基于词的系统,检索候选者。然后,一个更精确但更昂贵的机制,如 k 近邻,找到这些候选者中最好的。第二步也称为重新排序。
For example, given the term “transformer”, you can fetch all documents that contain the word transformer, regardless of whether they are about the electric device, the neural architecture, or the movie. Then you use vector search to find among these documents those that are actually related to your transformer query.
例如,给定术语“transformer”,您可以检索包含单词“transformer”的所有文档,无论它们是关于电器设备、神经网络架构还是电影。然后您使用向量搜索在这些文档中找到与您的“transformer”查询实际相关的文档。
Context reranking differs from traditional search reranking in that the exact position of items is less critical. In search, the rank (e.g., first or fifth) is crucial. In context reranking, the order of documents still matters because it affects how well a model can process them. Models might better understand documents at the beginning and end of the context, as suggested by the paper Lost in the middle (Liu et al., 2023). However, as long as a document is included, the impact of its order is less significant compared to in search ranking.
上下文重排序与传统搜索重排序不同,因为项目的确切位置不太关键。在搜索中,排名(例如,第一或第五)至关重要。在上下文重排序中,文档的顺序仍然很重要,因为它影响模型处理文档的效果。模型可能更好地理解上下文开头和结尾的文档,如论文《迷失在中间》(刘等,2023 年)所建议。然而,只要文档被包含在内,其顺序的影响与搜索排名相比就不太显著。
Another pattern is ensemble. Remember that a retriever works by ranking documents by their relevance scores to the query. You use multiple retrievers to fetch candidates at the same time, then combine these different rankings together to generate a final ranking.
另一种模式是集成。记住,检索器通过按文档与查询的相关性分数对文档进行排序来工作。您同时使用多个检索器来获取候选文档,然后将这些不同的排名组合起来以生成最终的排名。
RAGs with tabular data
RAGs with tabular data 表格数据相关的 RAGs
External data sources can also be structured, such as dataframes or SQL tables. Retrieving data from an SQL table is significantly different from retrieving data from unstructured documents. Given a query, the system works as follows.
外部数据源也可以是结构化的,例如数据框或 SQL 表。从 SQL 表中检索数据与从非结构化文档中检索数据有显著不同。给定一个查询,系统的工作方式如下。
- Text-to-SQL: Based on the user query and the table schemas, determine what SQL query is needed.
文本到 SQL:根据用户查询和表架构,确定所需的 SQL 查询。 - SQL execution: Execute the SQL query.
SQL 执行:执行 SQL 查询。 - Generation: Generate a response based on the SQL result and the original user query.
生成:根据 SQL 结果和原始用户查询生成响应。
For the text-to-SQL step, if there are many available tables whose schemas can’t all fit into the model context, you might need an intermediate step to predict what tables to use for each query. Text-to-SQL can be done by the same model used to generate the final response or one of many specialized text-to-SQL models.
对于文本到 SQL 的步骤,如果有很多可用的表,它们的模式无法全部适应模型上下文,你可能需要一个中间步骤来预测每个查询应使用哪些表。文本到 SQL 可以通过用于生成最终响应的相同模型或许多专门的文本到 SQL 模型之一来完成。
Agentic RAGs 代理 RAGs
An important source of data is the Internet. A web search tool like Google or Bing API can give the model access to a rich, up-to-date resource to gather relevant information for each query. For example, given the query “Who won Oscar this year?”, the system searches for information about the latest Oscar and uses this information to generate the final response to the user.
互联网是数据的重要来源。像谷歌或必应 API 这样的网络搜索工具可以为模型提供丰富的、最新的资源,以便为每个查询收集相关信息。例如,给定查询“今年谁获得了奥斯卡?”系统会搜索有关最新奥斯卡的信息,并使用这些信息生成对用户的最终响应。
Term-based retrieval, embedding-based retrieval, SQL execution, and web search are actions that a model can take to augment its context. You can think of each action as a function the model can call. A workflow that can incorporate external actions is also called agentic. The architecture then looks like this.
基于术语的检索、基于嵌入的检索、SQL 执行和网页搜索是模型可以采取以增强其上下文的行为。你可以把每个行为看作是模型可以调用的一个函数。能够整合外部行为的流程也被称为代理型。那么,架构看起来就像这样。
» Action vs. tool «
动作与工具
A tool allows one or more actions. For example, a people search tool might allow two actions: search by name and search by email. However, the difference is minimal, so many people use action and tool interchangeably.
一个工具允许一个或多个动作。例如,一个人物搜索工具可能允许两个动作:按姓名搜索和按电子邮件搜索。然而,区别很小,所以很多人将动作和工具互换使用。
» Read-only actions vs. write actions «
只读操作与写操作
Actions that retrieve information from external sources but don’t change their states are read-only actions. Giving a model write actions, e.g. updating the values in a table, enables the model to perform more tasks but also poses more risks, which will be discussed later.
从外部来源检索信息但不改变其状态的操作是只读操作。给模型赋予写操作,例如更新表中的值,使模型能够执行更多任务,但也带来了更多风险,这将在后面讨论。
Query rewriting 查询重写
Often, a user query needs to be rewritten to increase the likelihood of fetching the right information. Consider the following conversation.
User: When was the last time John Doe bought something from us?
AI: John last bought a Fruity Fedora hat from us two weeks ago, on January 3, 2030.
User: How about Emily Doe?
The last question, “How about Emily Doe?”, is ambiguous. If you use this query verbatim to retrieve documents, you’ll likely get irrelevant results. You need to rewrite this query to reflect what the user is actually asking. The new query should make sense on its own. The last question should be rewritten to “When was the last time Emily Doe bought something from us?”
最后一个问题,“Emily Doe 怎么样?”,是模糊的。如果你直接使用这个查询来检索文档,你很可能会得到不相关的结果。你需要重新编写这个查询,以反映用户实际询问的内容。新的查询应该能够独立理解。最后一个问题应改写为“Emily Doe 上次在我们这里购买东西是什么时候?”
Query rewriting is typically done using other AI models, using a prompt similar to “Given the following conversation, rewrite the last user input to reflect what the user is actually asking.”
查询重写通常使用其他 AI 模型,使用类似“给定以下对话,将最后用户的输入重写以反映用户实际询问的内容。”的提示进行。
Query rewriting can get complicated, especially if you need to do identity resolution or incorporate other knowledge. If the user asks “How about his wife?”, you will first need to query your database to find out who his wife is. If you don’t have this information, the rewriting model should acknowledge that this query isn’t solvable instead of hallucinating a name, leading to a wrong answer.
查询重写可能会变得复杂,尤其是当你需要执行身份解析或整合其他知识时。如果用户问“他的妻子怎么样?”,你首先需要查询数据库以找出他的妻子是谁。如果你没有这些信息,重写模型应该承认这个查询无法解决,而不是凭空猜测一个名字,导致错误答案。
Step 2. Put in Guardrails
第二步:设置护栏
Guardrails help reduce AI risks and protect not just your users but also you, the developers. They should be placed whenever there is potential for failures. This post discusses two types of guardrails: input guardrails and output guardrails.
Input guardrails
Input guardrails are typically protection against two types of risks: leaking private information to external APIs, and executing bad prompts that compromise your system (model jailbreaking).
输入护栏通常防范两种风险:向外部 API 泄露私人信息和执行损害您系统(模型越狱)的恶意提示。
Leaking private information to external APIs
泄露私人信息给外部 API
This risk is specific to using external model APIs when you need to send your data outside your organization. For example, an employee might copy the company’s secret or a user’s private information into a prompt and send it to wherever the model is hosted.
此风险特定于在需要将数据发送到组织外部时使用外部模型 API。例如,员工可能会将公司的机密或用户的个人信息复制到提示中并发送到模型托管的地方。
One of the most notable early incidents was when Samsung employees put Samsung’s proprietary information into ChatGPT, accidentally leaking the company’s secrets. It’s unclear how Samsung discovered this leak and how the leaked information was used against Samsung. However, the incident was serious enough for Samsung to ban ChatGPT in May 2023.
三星员工将三星的专有信息输入 ChatGPT,意外泄露了公司的机密,这是最引人注目的早期事件之一。不清楚三星是如何发现这一泄露的,以及泄露的信息是如何被用来对付三星的。然而,这一事件严重到足以让三星在 2023 年 5 月禁止使用 ChatGPT。
There’s no airtight way to eliminate potential leaks when using third-party APIs. However, you can mitigate them with guardrails. You can use one of the many available tools that automatically detect sensitive data. What sensitive data to detect is specified by you. Common sensitive data classes are:
没有一种万无一失的方法可以消除使用第三方 API 时的潜在泄露。然而,您可以通过安全措施来减轻它们。您可以使用许多可用的工具之一来自动检测敏感数据。要检测哪些敏感数据由您指定。常见的敏感数据类别包括:
- Personal information (ID numbers, phone numbers, bank accounts).
个人资料(身份证号码、电话号码、银行账户)。 - Human faces. 人脸。
- Specific keywords and phrases associated with the company’s intellectual properties or privileged information.
与公司知识产权或机密信息相关的特定关键词和短语。
Many sensitive data detection tools use AI to identify potentially sensitive information, such as determining if a string resembles a valid home address. If a query is found to contain sensitive information, you have two options: block the entire query or remove the sensitive information from it. For instance, you can mask a user’s phone number with the placeholder [PHONE NUMBER]. If the generated response contains this placeholder, use a PII reversible dictionary that maps this placeholder to the original information so that you can unmask it, as shown below.
许多敏感数据检测工具使用 AI 来识别可能敏感的信息,例如确定一个字符串是否类似于有效的家庭地址。如果发现查询包含敏感信息,你有两个选择:阻止整个查询或从其中移除敏感信息。例如,你可以用占位符[PHONE NUMBER]遮盖用户的电话号码。如果生成的响应包含此占位符,使用将此占位符映射到原始信息的 PII 可逆字典,以便你可以取消遮盖,如下所示。
Model jailbreaking 模型越狱
It’s become an online sport to try to jailbreak AI models, getting them to say or do bad things. While some might find it amusing to get ChatGPT to make controversial statements, it’s much less fun if your customer support chatbot, branded with your name and logo, does the same thing. This can be especially dangerous for AI systems that have access to tools. Imagine if a user finds a way to get your system to execute an SQL query that corrupts your data.
已成为一种在线运动,试图越狱 AI 模型,让它们说出或做出坏事。虽然有些人可能觉得让 ChatGPT 发表争议性言论很有趣,但如果你的客户支持聊天机器人,带有你的名字和标志,也这样做,那就没这么有趣了。这对可以访问工具的 AI 系统来说尤其危险。想象一下,如果用户找到了一种方法让您的系统执行一个会破坏您数据的 SQL 查询。
To combat this, you should first put guardrails on your system so that no harmful actions can be automatically executed. For example, no SQL queries that can insert, delete, or update data can be executed without human approval. The downside of this added security is that it can slow down your system.
为应对这一问题,您首先应在系统中设置安全护栏,以确保无法自动执行有害操作。例如,没有人类批准,不能执行可以插入、删除或更新数据的 SQL 查询。这种额外安全性的缺点是可能会减慢您的系统。
To prevent your application from making outrageous statements it shouldn’t be making, you can define out-of-scope topics for your application. For example, if your application is a customer support chatbot, it shouldn’t answer political or social questions. A simple way to do so is to filter out inputs that contain predefined phrases typically associated with controversial topics, such as “immigration” or “antivax”. More sophisticated algorithms use AI to classify whether an input is about one of the pre-defined restricted topics.
为了防止您的应用程序发表不应发表的荒谬言论,您可以为您应用程序定义范围之外的议题。例如,如果您的应用程序是客户支持聊天机器人,它就不应回答政治或社会问题。这样做的一个简单方法是过滤掉包含与争议性话题相关的预定义短语,例如“移民”或“反疫苗”。更复杂的算法使用人工智能来分类输入是否涉及预先定义的受限话题。
If harmful prompts are rare in your system, you can use an anomaly detection algorithm to identify unusual prompts.
如果您的系统中有害提示很少,您可以使用异常检测算法来识别不寻常的提示。
Output guardrails 输出护栏
AI models are probabilistic, making their outputs unreliable. You can put in guardrails to significantly improve your application’s reliability. Output guardrails have two main functionalities:
人工智能模型是概率性的,这使得它们的输出不可靠。您可以通过设置护栏来显著提高您应用程序的可靠性。输出护栏有两个主要功能:
- Evaluate the quality of each generation.
评估每一代的质量。 - Specify the policy to deal with different failure modes.
指定处理不同故障模式的政策。
Output quality measurement
输出质量测量
To catch outputs that fail to meet your standards, you need to understand what failures look like. Here are examples of failure modes and how to catch them.
为了捕捉不符合您标准的结果,您需要了解失败的表现。以下是失败模式和如何捕捉它们的例子。
-
Empty responses. 空响应。
-
Malformatted responses that don’t follow the expected output format. For example, if the application expects JSON and the generated response has a missing closing bracket. There are validators for certain formats, such as regex, JSON, and Python code validators. There are also tools for constrained sampling such as guidance, outlines, and instructor.
格式不正确的响应,不符合预期的输出格式。例如,如果应用程序期望 JSON 格式,而生成的响应缺少一个闭合括号。存在针对某些格式的验证器,例如正则表达式、JSON 和 Python 代码验证器。还有用于约束抽样的工具,如指导、大纲和讲师。 -
Toxic responses, such as those that are racist or sexist. These responses can be caught using one of many toxicity detection tools.
有毒回复,如种族主义或性别歧视的回复。这些回复可以使用许多毒性检测工具之一进行捕捉。 -
Factual inconsistent responses hallucinated by the model. Hallucination detection is an active area of research with solutions such as SelfCheckGPT (Manakul et al., 2023) and SAFE, Search Engine Factuality Evaluator (Wei et al., 2024). You can mitigate hallucinations by providing models with sufficient context and prompting techniques such as chain-of-thought. Hallucination detection and mitigation are discussed further in my upcoming book AI Engineering.
模型产生的与事实不符的幻觉。幻觉检测是一个活跃的研究领域,有如 SelfCheckGPT(Manakul 等人,2023 年)和 SAFE(搜索引擎事实性评估器,Wei 等人,2024 年)等解决方案。您可以通过为模型提供足够的上下文和提示技术,如思维链,来减轻幻觉。幻觉检测和缓解将在我即将出版的《人工智能工程》一书中进一步讨论。 -
Responses that contain sensitive information. This can happen in two scenarios.
包含敏感信息的回复。这可能在两种情况下发生。- Your model was trained on sensitive data and regurgitates it back.
您的模型在敏感数据上进行了训练并回吐了这些数据。 - Your system retrieves sensitive information from your internal database to enrich its context, and then it passes this sensitive information on to the response.
您的系统从内部数据库检索敏感信息以丰富其上下文,然后将其传递给响应。
This failure mode can be prevented by not training your model on sensitive data and not allowing it to retrieve sensitive data in the first place. Sensitive data in outputs can be detected using the same tools used for input guardrails.
这种故障模式可以通过不在敏感数据上训练您的模型,并且从一开始就不允许它检索敏感数据来防止。可以使用用于输入防护栏的相同工具检测输出中的敏感数据。 - Your model was trained on sensitive data and regurgitates it back.
-
Brand-risk responses, such as responses that mischaracterize your company or your competitors. An example is when Grok, a model trained by X, generated a response suggesting that Grok was trained by OpenAI, causing the Internet to suspect X of stealing OpenAI’s data. This failure mode can be mitigated with keyword monitoring. Once you’ve identified outputs concerning your brands and competitors, you can either block these outputs, pass them onto human reviewers, or use other models to detect the sentiment of these outputs to ensure that only the right sentiments are returned.
品牌风险应对,例如错误描述您公司或竞争对手的回应。例如,当 X 训练的模型 Grok 生成一个表明 Grok 是由 OpenAI 训练的回应时,导致互联网怀疑 X 窃取了 OpenAI 的数据。这种故障模式可以通过关键词监控来减轻。一旦您确定了涉及您品牌和竞争对手的输出,您可以阻止这些输出,将它们转交给人工审查员,或使用其他模型来检测这些输出的情感,以确保只返回正确的情感。 -
Generally bad responses. For example, if you ask the model to write an essay and that essay is just bad, or if you ask the model for a low-calorie cake recipe and the generated recipe contains an excessive amount of sugar. It’s become a popular practice to use AI judges to evaluate the quality of models’ responses. These AI judges can be general-purpose models (think ChatGPT, Claude) or specialized scorers trained to output a concrete score for a response given a query.
通常回答很差。例如,如果你要求模型写一篇文章,而这篇文章很差,或者如果你要求模型提供一个低卡路里蛋糕食谱,而生成的食谱含有过多的糖。使用 AI 评委来评估模型回答质量已成为一种流行做法。这些 AI 评委可以是通用模型(如 ChatGPT、Claude)或专门训练以针对给定查询输出具体评分的评分器。
Failure management 故障管理
AI models are probabilistic, which means that if you try a query again, you might get a different response. Many failures can be mitigated using a basic retry logic. For example, if the response is empty, try again X times or until you get a non-empty response. Similarly, if the response is malformatted, try again until the model generates a correctly formatted response.
人工智能模型是概率性的,这意味着如果您再次尝试查询,可能会得到不同的响应。许多失败可以通过基本的重试逻辑来缓解。例如,如果响应为空,尝试 X 次或直到得到非空响应。同样,如果响应格式不正确,尝试直到模型生成正确格式的响应。
This retry policy, however, can incur extra latency and cost. One retry means 2x the number of API calls. If the retry is carried out after failure, the latency experienced by the user will double. To reduce latency, you can make calls in parallel. For example, for each query, instead of waiting for the first query to fail before retrying, you send this query to the model twice at the same time, get back two responses, and pick the better one. This increases the number of redundant API calls but keeps latency manageable.
然而,这种重试策略可能会产生额外的延迟和成本。一次重试意味着 API 调用次数增加 2 倍。如果重试在失败后进行,用户将经历的延迟将加倍。为了减少延迟,您可以并行调用。例如,对于每个查询,您不必等待第一个查询失败后再进行重试,而是同时将这个查询发送给模型两次,获取两个响应,并选择更好的一个。这会增加冗余 API 调用的数量,但可以保持延迟在可控范围内。
It’s also common to fall back on humans to handle tricky queries. For example, you can transfer a query to human operators if it contains specific key phrases. Some teams use a specialized model, potentially trained in-house, to decide when to transfer a conversation to humans. One team, for instance, transfers a conversation to human operators when their sentiment analysis model detects that the user is getting angry. Another team transfers a conversation after a certain number of turns to prevent users from getting stuck in an infinite loop.
Guardrail tradeoffs 护栏权衡
Reliability vs. latency tradeoff: While acknowledging the importance of guardrails, some teams told me that latency is more important. They decided not to implement guardrails because they can significantly increase their application’s latency. However, these teams are in the minority. Most teams find that the increased risks are more costly than the added latency.
可靠性 vs. 延迟权衡:虽然承认护栏的重要性,但有些团队告诉我延迟更为重要。他们决定不实施护栏,因为这可能会显著增加他们应用程序的延迟。然而,这些团队是少数。大多数团队发现增加的风险比增加的延迟成本更高。
Output guardrails might not work well in the stream completion mode. By default, the whole response is generated before shown to the user, which can take a long time. In the stream completion mode, new tokens are streamed to the user as they are generated, reducing the time the user has to wait to see the response. The downside is that it’s hard to evaluate partial responses, so unsafe responses might be streamed to users before the system guardrails can determine that they should be blocked.
输出护栏可能在流式完成模式下效果不佳。默认情况下,整个响应在显示给用户之前生成,这可能需要很长时间。在流式完成模式下,新标记会随着生成实时传输给用户,减少了用户等待看到响应的时间。缺点是难以评估部分响应,因此,在系统护栏确定应该阻止之前,可能将不安全的响应传输给用户。
Self-hosted vs. third-party API tradeoff: Self-hosting your models means that you don’t have to send your data to a third party, reducing the need for input guardrails. However, it also means that you must implement all the necessary guardrails yourself, rather than relying on the guardrails provided by third-party services.
自托管与第三方 API 权衡:自托管您的模型意味着您无需将数据发送给第三方,减少了输入防护栏的需求。然而,这也意味着您必须自己实现所有必要的防护栏,而不是依赖第三方服务提供的防护栏。
Our platform now looks like this. Guardrails can be independent tools or parts of model gateways, as discussed later. Scorers, if used, are grouped under model APIs since scorers are typically AI models, too. Models used for scoring are typically smaller and faster than models used for generation.
我们的平台现在看起来是这样的。护栏可以是独立工具或模型网关的组成部分,如后文所述。如果使用,评分器会被归类在模型 API 下,因为评分器通常也是 AI 模型。用于评分的模型通常比用于生成的模型更小、更快。
Step 3. Add Model Router and Gateway
As applications grow in complexity and involve more models, two types of tools emerged to help you work with multiple models: routers and gateways.
Router
An application can use different models to respond to different types of queries. Having different solutions for different queries has several benefits. First, this allows you to have specialized solutions, such as one model specialized in technical troubleshooting and another specialized in subscriptions. Specialized models can potentially perform better than a general-purpose model. Second, this can help you save costs. Instead of routing all queries to an expensive model, you can route simpler queries to cheaper models.
一个应用程序可以使用不同的模型来响应不同类型的查询。为不同查询提供不同的解决方案有多个好处。首先,这允许你拥有专业化的解决方案,例如一个专门处理技术故障排除的模型和另一个专门处理订阅的模型。专业化模型可能比通用模型表现更好。其次,这可以帮助你节省成本。你不必将所有查询路由到昂贵的模型,可以将简单的查询路由到更便宜的模型。
A router typically consists of an intent classifier that predicts what the user is trying to do. Based on the predicted intent, the query is routed to the appropriate solution. For example, for a customer support chatbot, if the intent is:
一个路由器通常包含一个意图分类器,该分类器预测用户试图做什么。根据预测的意图,查询被路由到相应的解决方案。例如,对于一个客户支持聊天机器人,如果意图是:
- To reset a password –> route this user to the page about password resetting.
重置密码 –> 将此用户引导至关于密码重置的页面。 - To correct a billing mistake –> route this user to a human operator.
为了纠正账单错误 ——> 将此用户引导至人工操作员。 - To troubleshoot a technical issue –> route this query to a model finetuned for troubleshooting.
为了解决技术问题——>将此查询路由到针对故障排除进行微调的模型。
An intent classifier can also help your system avoid out-of-scope conversations. For example, you can have an intent classifier that predicts whether a query is out of the scope. If the query is deemed inappropriate (e.g. if the user asks who you would vote for in the upcoming election), the chatbot can politely decline to engage using one of the stock responses (“As a chatbot, I don’t have the ability to vote. If you have questions about our products, I’d be happy to help.”) without wasting an API call.
意图分类器还可以帮助您的系统避免超出范围的对话。例如,您可以有一个意图分类器,它预测一个查询是否超出范围。如果查询被认为不合适(例如,如果用户询问您在即将到来的选举中会投给谁),聊天机器人可以礼貌地拒绝参与,使用其中一个标准回复(“作为聊天机器人,我没有投票的能力。如果您对我们的产品有疑问,我很乐意帮助。”)而不浪费 API 调用。
If your system has access to multiple actions, a router can involve a next-action predictor to help the system decide what action to take next. One valid action is to ask for clarification if the query is ambiguous. For example, in response to the query “Freezing,” the system might ask, “Do you want to freeze your account or are you talking about the weather?” or simply say, “I’m sorry. Can you elaborate?”
如果您的系统可以访问多个操作,则路由器可以涉及一个下一操作预测器,以帮助系统决定下一步采取什么操作。一个有效的操作是在查询模糊时请求澄清。例如,对于“Freezing”的查询,系统可能会问:“您是想冻结您的账户还是您在谈论天气?”或者简单地说:“抱歉,您能详细说明吗?”
Intent classifiers and next-action predictors can be general-purpose models or specialized classification models. Specialized classification models are typically much smaller and faster than general-purpose models, allowing your system to use multiple of them without incurring significant extra latency and cost.
意图分类器和下一步动作预测器可以是通用模型或专用分类模型。专用分类模型通常比通用模型小得多,速度快得多,允许您的系统使用多个而不会产生显著的额外延迟和成本。
When routing queries to models with varying context limits, the query’s context might need to be adjusted accordingly. Consider a query of 1,000 tokens that is slated for a model with a 4K context limit. The system then takes an action, e.g. web search, that brings back 8,000-token context. You can either truncate the query’s context to fit the originally intended model or route the query to a model with a larger context limit.
当将查询路由到具有不同上下文限制的模型时,可能需要相应地调整查询的上下文。考虑一个有 1000 个标记的查询,预定用于具有 4K 上下文限制的模型。然后系统采取行动,例如网络搜索,返回 8000 个标记的上下文。您可以选择截断查询的上下文以适应最初打算的模型,或者将查询路由到具有更大上下文限制的模型。
Gateway 网关
A model gateway is an intermediate layer that allows your organization to interface with different models in a unified and secure manner. The most basic functionality of a model gateway is to enable developers to access different models – be it self-hosted models or models behind commercial APIs such as OpenAI or Google – the same way. A model gateway makes it easier to maintain your code. If a model API changes, you only need to update the model gateway instead of having to update all applications that use this model API.
模型网关是一个中间层,允许您的组织以统一和安全的模式与不同的模型进行接口。模型网关最基本的功能是使开发者能够以相同的方式访问不同的模型——无论是自托管模型还是 OpenAI 或 Google 等商业 API 背后的模型。模型网关使维护您的代码变得更加容易。如果模型 API 发生变化,您只需更新模型网关,而无需更新所有使用此模型 API 的应用程序。
In its simplest form, a model gateway is a unified wrapper that looks like the following code example. This example is to give you an idea of how a model gateway might be implemented. It’s not meant to be functional as it doesn’t contain any error checking or optimization.
在其最简单的形式中,模型网关是一个类似于以下代码示例的统一包装器。此示例旨在让您了解模型网关可能如何实现。它并不旨在具有功能性,因为它不包含任何错误检查或优化。
import google.generativeai as genai
import openai
def openai_model(input_data, model_name, max_tokens):
openai.api_key = os.environ["OPENAI_API_KEY"]
response = openai.Completion.create(
engine=model_name,
prompt=input_data,
max_tokens=max_tokens
)
return {"response": response.choices[0].text.strip()}
def gemini_model(input_data, model_name, max_tokens):
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(model_name=model_name)
response = model.generate_content(input_data, max_tokens=max_tokens)
return {"response": response["choices"][0]["message"]["content"]}
@app.route('/model', methods=['POST'])
def model_gateway():
data = request.get_json()
model_type = data.get("model_type")
model_name = data.get("model_name")
input_data = data.get("input_data")
max_tokens = data.get("max_tokens")
if model_type == "openai":
result = openai_model(input_data, model_name, max_tokens)
elif model_type == "gemini":
result = gemini_model(input_data, model_name, max_tokens)
return jsonify(result)
A model gateway is access control and cost management. Instead of giving everyone who wants access to the OpenAI API your organizational tokens, which can be easily leaked, you only give people access to the model gateway, creating a centralized and controlled point of access. The gateway can also implement fine-grained access controls, specifying which user or application should have access to which model. Moreover, the gateway can monitor and limit the usage of API calls, preventing abuse and managing costs effectively.
模型网关是访问控制和成本管理。不是给所有想要访问 OpenAI API 的人你的组织令牌,这些令牌容易被泄露,你只给人们访问模型网关的权限,创建一个集中和受控的访问点。网关还可以实现细粒度的访问控制,指定哪个用户或应用应该访问哪个模型。此外,网关可以监控和限制 API 调用的使用,防止滥用并有效管理成本。
A model gateway can also be used to implement fallback policies to overcome rate limits or API failures (the latter is unfortunately common). When the primary API is unavailable, the gateway can route requests to alternative models, retry after a short wait, or handle failures in other graceful manners. This ensures that your application can operate smoothly without interruptions.
Since requests and responses are already flowing through the gateway, it’s a good place to implement other functionalities such as load balancing, logging, and analytics. Some gateway services even provide caching and guardrails.
由于请求和响应已经通过网关流动,因此它是实现其他功能(如负载均衡、日志记录和分析)的好地方。一些网关服务甚至提供缓存和防护措施。
Given that gateways are relatively straightforward to implement, there are many off-the-shelf gateways. Examples include Portkey’s gateway, MLflow AI Gateway, WealthSimple’s llm-gateway, TrueFoundry, Kong, and Cloudflare.
鉴于网关相对容易实现,市面上有许多现成的网关。例如,Portkey 的网关、MLflow AI 网关、WealthSimple 的llm-网关、TrueFoundry、Kong 和 Cloudflare。
With the added gateway and routers, our platform is getting more exciting. Like scoring, routing is also in the model gateway. Like models used for scoring, models used for routing are typically smaller than models used for generation.
随着新增的网关和路由器,我们的平台变得更加有趣。就像评分一样,路由也在模型网关中。像用于评分的模型一样,用于路由的模型通常比用于生成的模型更小。
Step 4. Reduce Latency with Cache
第 4 步. 使用缓存降低延迟
When I shared this post with my friend Eugene Yan, he said that cache is perhaps the most underrated component of an AI platform. Caching can significantly reduce your application’s latency and cost.
当我与我的朋友杨恩分享这篇帖子时,他说缓存可能是人工智能平台最被低估的组件。缓存可以显著降低您应用程序的延迟和成本。
Cache techniques can also be used during training, but since this post is about deployment, I’ll focus on cache for inference. Some common inference caching techniques include prompt cache, exact cache, and semantic cache. Prompt cache are typically implemented by the inference APIs that you use. When evaluating an inference library, it’s helpful to understand what cache mechanism it supports.
缓存技术也可以在训练过程中使用,但由于本文是关于部署的,我将重点介绍推理缓存。一些常见的推理缓存技术包括提示缓存、精确缓存和语义缓存。提示缓存通常由您使用的推理 API 实现。在评估推理库时,了解它支持哪种缓存机制是有帮助的。
KV cache for the attention mechanism is out of scope for this discussion.
KV 缓存对于注意力机制不在本次讨论范围内。
Prompt cache 提示缓存
Many prompts in an application have overlapping text segments. For example, all queries can share the same system prompt. A prompt cache stores these overlapping segments for reuse, so you only need to process them once. A common overlapping text segment in different prompts is the system prompt. Without prompt cache, your model needs to process the system prompt with every query. With prompt cache, it only needs to process the system prompt once for the first query.
许多应用程序中的提示存在重叠文本段。例如,所有查询可以共享相同的系统提示。提示缓存存储这些重叠段以供重用,因此您只需处理一次。不同提示中的常见重叠文本段是系统提示。没有提示缓存,您的模型需要为每个查询处理系统提示。有了提示缓存,它只需在第一次查询时处理一次系统提示。
For applications with long system prompts, prompt cache can significantly reduce both latency and cost. If your system prompt is 1000 tokens and your application generates 1 million model API calls today, a prompt cache will save you from processing approximately 1 billion repetitive input tokens a day! However, this isn’t entirely free. Like KV cache, prompt cache size can be quite large and require significant engineering effort.
对于系统提示较长的应用,提示缓存可以显著降低延迟和成本。如果你的系统提示有 1000 个标记,并且你的应用今天生成了 100 万个模型 API 调用,提示缓存将每天为你节省大约 10 亿次重复输入标记的处理!然而,这并非完全免费。就像 KV 缓存一样,提示缓存的大小可能相当大,需要大量的工程努力。
Prompt cache is also useful for queries that involve long documents. For example, if many of your user queries are related to the same long document (such as a book or a codebase), this long document can be cached for reuse across queries.
提示缓存对于涉及长文档的查询也很有用。例如,如果您的许多用户查询都与同一长文档(如书籍或代码库)相关,则可以将该长文档缓存以供查询间重复使用。
Since its introduction in November 2023 by Gim et al., prompt cache has already been incorporated into model APIs. Google announced that Gemini APIs will offer this functionality in June 2024 under the name context cache. Cached input tokens are given a 75% discount compared to regular input tokens, but you’ll have to pay extra for cache storage (as of writing, $1.00 / 1 million tokens per hour). Given the obvious benefits of prompt cache, I wouldn’t be surprised if it becomes as popular as KV cache.
自 2023 年 11 月 Gim 等人引入以来,提示缓存已被纳入模型 API 中。谷歌宣布,Gemini API 将在 2024 年 6 月提供这项功能,命名为上下文缓存。与常规输入令牌相比,缓存输入令牌可享受 75%的折扣,但您需要额外支付缓存存储费用(截至写作时,每小时 1.00 美元/100 万令牌)。鉴于提示缓存的明显优势,如果它像 KV 缓存一样流行,我并不感到惊讶。
While llama.cpp also has prompt cache, it seems to only cache whole prompts and work for queries in the same chat session. Its documentation is limited, but my guess from reading the code is that in a long conversation, it caches the previous messages and only processes the newest message.
虽然 llama.cpp 也具有提示缓存,但它似乎只缓存整个提示,并在同一聊天会话中的查询中工作。其文档有限,但根据阅读代码的猜测,在长时间对话中,它会缓存之前的消息,只处理最新消息。
Exact cache 精确缓存
If prompt cache and KV cache are unique to foundation models, exact cache is more general and straightforward. Your system stores processed items for reuse later when the exact items are requested. For example, if a user asks a model to summarize a product, the system checks the cache to see if a summary of this product is cached. If yes, fetch this summary. If not, summarize the product and cache the summary.
如果提示缓存和 KV 缓存仅限于基础模型,精确缓存则更为通用和直接。您的系统存储处理过的项目,以便在需要精确项目时再次使用。例如,如果用户要求模型总结一个产品,系统会检查缓存以查看是否已缓存该产品的总结。如果是,则获取此总结。如果不是,则总结产品并将其缓存。
Exact cache is also used for embedding-based retrieval to avoid redundant vector search. If an incoming query is already in the vector search cache, fetch the cached search result. If not, perform a vector search for this query and cache the result.
精确缓存也用于基于嵌入的检索以避免冗余的向量搜索。如果传入的查询已经在向量搜索缓存中,则获取缓存的搜索结果。如果没有,则对此查询执行向量搜索并将结果缓存。
Cache is especially appealing for queries that require multiple steps (e.g. chain-of-thought) and/or time-consuming actions (e.g. retrieval, SQL execution, or web search).
缓存对于需要多个步骤(例如思维链)和/或耗时操作(例如检索、SQL 执行或网络搜索)的查询特别有吸引力。
An exact cache can be implemented using in-memory storage for fast retrieval. However, since in-memory storage is limited, a cache can also be implemented using databases like PostgreSQL, Redis, or tiered storage to balance speed and storage capacity. Having an eviction policy is crucial to manage the cache size and maintain performance. Common eviction policies include Least Recently Used (LRU), Least Frequently Used (LFU), and First In, First Out (FIFO).
精确缓存可以通过使用内存存储来实现快速检索。然而,由于内存存储有限,也可以使用 PostgreSQL、Redis 或分层存储等数据库来实现缓存,以平衡速度和存储容量。拥有一个驱逐策略对于管理缓存大小和保持性能至关重要。常见的驱逐策略包括最近最少使用(LRU)、最少使用(LFU)和先进先出(FIFO)。
How long to cache a query depends on how likely this query is to be called again. User-specific queries such as “What’s the status of my recent order” are less likely to be reused by other users, and therefore, shouldn’t be cached. Similarly, it makes less sense to cache time-sensitive queries such as “How’s the weather?” Some teams train a small classifier to predict whether a query should be cached.
缓存查询的时间取决于该查询再次被调用的可能性。针对特定用户的查询,如“我的最近订单状态如何”,不太可能被其他用户重用,因此不应缓存。同样,缓存对时间敏感的查询,如“天气怎么样?”也没有太多意义。一些团队训练一个小型分类器来预测一个查询是否应该被缓存。
Semantic cache 语义缓存
Unlike exact cache, semantic cache doesn’t require the incoming query to be identical to any of the cached queries. Semantic cache allows the reuse of similar queries. Imagine one user asks “What’s the capital of Vietnam?” and the model generates the answer “Hanoi”. Later, another user asks “What’s the capital city of Vietnam?”, which is the same question but with the extra word “city”. The idea of semantic cache is that the system can reuse the answer “Hanoi” instead of computing the new query from scratch.
Semantic cache only works if you have a reliable way to determine if two queries are semantically similar. One common approach is embedding-based similarity, which works as follows:
- For each query, generate its embedding using an embedding model.
- Use vector search to find the cached embedding closest to the current query embedding. Let’s say this similarity score is X.
- If X is more than the similarity threshold you set, the cached query is considered the same as the current query, and the cached results are returned. If not, process this current query and cache it together with its embedding and results.
This approach requires a vector database to store the embeddings of cached queries.
Compared to other caching techniques, semantic cache’s value is more dubious because many of its components are prone to failure. Its success relies on high-quality embeddings, functional vector search, and a trustworthy similarity metric. Setting the right similarity threshold can also be tricky and require a lot of trial and error. If the system mistakes the incoming query as being similar to another query, the returned response, fetched from the cache, will be incorrect.
与其他缓存技术相比,语义缓存的值更加可疑,因为其许多组件容易出错。其成功依赖于高质量的嵌入、功能向量搜索和可靠的相似度度量。设置正确的相似度阈值也可能很棘手,需要大量的尝试和错误。如果系统错误地将传入的查询视为与其他查询相似,那么从缓存中检索的返回响应将是不正确的。
In addition, semantic cache can be time-consuming and compute-intensive, as it involves a vector search. The speed and cost of this vector search depend on the size of your database of cached embeddings.
此外,语义缓存可能耗时且计算密集,因为它涉及向量搜索。这种向量搜索的速度和成本取决于您缓存嵌入数据库的大小。
Semantic cache might still be worth it if the cache hit rate is high, meaning that a good portion of queries can be effectively answered by leveraging the cached results. However, before incorporating the complexities of semantic cache, make sure to evaluate the efficiency, cost, and performance risks associated with it.
语义缓存如果缓存命中率较高,可能仍然值得,这意味着很大一部分查询可以通过利用缓存结果得到有效回答。然而,在引入语义缓存的复杂性之前,请确保评估其效率、成本和性能风险。
With the added cache systems, the platform looks as follows. KV cache and prompt cache are typically implemented by model API providers, so they aren’t shown in this image. If I must visualize them, I’d put them in the Model API box. There’s a new arrow to add generated responses to the cache.
随着新增的缓存系统,该平台看起来如下。KV 缓存和提示缓存通常由模型 API 提供商实现,因此在此图中未显示。如果必须可视化它们,我会将它们放在模型 API 框中。新增了一个箭头,用于将生成的响应添加到缓存中。
Step 5. Add complex logic and write actions
第 5 步。添加复杂逻辑并编写动作
The applications we’ve discussed so far have fairly simple flows. The outputs generated by foundation models are mostly returned to users (unless they don’t pass the guardrails). However, an application flow can be more complex with loops and conditional branching. A model’s outputs can also be used to invoke write actions, such as composing an email or placing an order.
我们迄今为止讨论的应用程序流程相对简单。基础模型生成的输出大多返回给用户(除非它们未通过安全栏)。然而,应用程序流程可以更复杂,包含循环和条件分支。模型的输出也可以用来调用写入操作,例如撰写电子邮件或下订单。
Complex logic 复杂逻辑
Outputs from a model can be conditionally passed onto another model or fed back to the same model as part of the input to the next step. This goes on until a model in the system decides that the task has been completed and that a final response should be returned to the user.
模型输出的结果可以条件性地传递给另一个模型,或者作为下一步骤输入反馈给同一模型。这个过程会持续进行,直到系统中的某个模型决定任务已完成,并应将最终响应返回给用户。
This can happen when you give your system the ability to plan and decide what to do next. As an example, consider the query “Plan a weekend itinerary for Paris.” The model might first generate a list of potential activities: visiting the Eiffel Tower, having lunch at a café, touring the Louvre, etc. Each of these activities can then be fed back into the model to generate more detailed plans. For instance, “visiting the Eiffel Tower” could prompt the model to generate sub-tasks like checking the opening hours, buying tickets, and finding nearby restaurants. This iterative process continues until a comprehensive and detailed itinerary is created.
这可以发生在您赋予系统规划和决定下一步行动的能力时。例如,考虑查询“为巴黎规划一个周末行程。”模型可能会首先生成一系列潜在活动:参观埃菲尔铁塔、在咖啡馆用午餐、游览卢浮宫等。然后,这些活动可以反馈给模型以生成更详细的计划。例如,“参观埃菲尔铁塔”可能会促使模型生成子任务,如检查开放时间、购票和寻找附近的餐厅。这个迭代过程会持续进行,直到创建出一个全面详细的行程。
Our infrastructure now has an arrow pointing the generated response back to context construction, which in turn feeds back to models in the model gateway.
Write actions 编写动作
Actions used for context construction are read-only actions. They allow a model to read from its data sources to gather context. But a system can also write actions, making changes to the data sources and the world. For example, if the model outputs: “send an email to X with the message Y”, the system will invoke the action send_email(recipient=X, message=Y)
.
动作用于上下文构建是只读动作。它们允许模型从其数据源读取以收集上下文。但系统也可以写入动作,对数据源和世界进行更改。例如,如果模型输出:“给 X 发送包含 Y 信息的邮件”,系统将调用动作 send_email(recipient=X, message=Y)
。
Write actions make a system vastly more capable. They can enable you to automate the whole customer outreach workflow: researching potential customers, finding their contacts, drafting emails, sending first emails, reading responses, following up, extracting orders, updating your databases with new orders, etc.
撰写操作使系统功能大大增强。它们可以让你自动化整个客户接触工作流程:研究潜在客户、寻找他们的联系方式、撰写电子邮件、发送第一封电子邮件、阅读回复、跟进、提取订单、用新订单更新你的数据库等。
However, the prospect of giving AI the ability to automatically alter our lives is frightening. Just as you shouldn’t give an intern the authority to delete your production database, you shouldn’t allow an unreliable AI to initiate bank transfers. Trust in the system’s capabilities and its security measures is crucial. You need to ensure that the system is protected from bad actors who might try to manipulate it into performing harmful actions.
然而,让 AI 自动改变我们生活的前景令人恐惧。就像你不应该给实习生删除生产数据库的权限一样,你也不应该允许不可靠的 AI 发起银行转账。对系统的能力和其安全措施有信心至关重要。你需要确保系统免受试图操纵其执行有害行为的恶意行为者的侵害。
AI systems are vulnerable to cyber attacks like other software systems, but they also have another weakness: prompt injection. Prompt injection happens when an attacker manipulates input prompts into a model to get it to express undesirable behaviors. You can think of prompt injection as social engineering done on AI instead of humans.
人工智能系统像其他软件系统一样容易受到网络攻击,但它们还有一个弱点:提示注入。提示注入发生在攻击者操纵输入提示到模型中,使其表现出不受欢迎的行为。你可以把提示注入看作是对人工智能而不是人类进行的社交工程。
A scenario that many companies fear is that they give an AI system access to their internal databases, and attackers trick this system into revealing private information from these databases. If the system has write access to these databases, attackers can trick the system into corrupting the data.
许多公司害怕的一个场景是,他们给 AI 系统访问内部数据库的权限,攻击者诱骗这个系统泄露这些数据库中的私人信息。如果系统对这些数据库有写权限,攻击者可以诱骗系统破坏数据。
Any organization that wants to leverage AI needs to take safety and security seriously. However, these risks don’t mean that AI systems should never be given the ability to act in the real world. AI systems can fail, but humans can fail too. If we can get people to trust a machine to take us up into space, I hope that one day, securities will be sufficient for us to trust autonomous AI systems.
任何想利用人工智能的组织都必须认真对待安全和安保问题。然而,这些风险并不意味着人工智能系统永远不应该被赋予在现实世界中行动的能力。人工智能系统可能会失败,但人类也会失败。如果我们能让人们相信机器能带我们进入太空,我希望有一天,安全性将足够让我们信任自主人工智能系统。
Observability 可观测性
While I have placed observability in its own section, it should be integrated into the platform from the beginning rather than added later as an afterthought. Observability is crucial for projects of all sizes, and its importance grows with the complexity of the system.
虽然我将可观测性放在了单独的部分,但它应该从一开始就集成到平台中,而不是事后才作为补充。可观测性对于所有规模的项目都至关重要,其重要性随着系统的复杂性增加而增长。
This section provides the least information compared to the others. It’s impossible to cover all the nuances of observability in a blog post. Therefore, I will only give a brief overview of the three pillars of monitoring: logs, traces, and metrics. I won’t go into specifics or cover user feedback, drift detection, and debugging.
本节提供的信息相较于其他部分最少。在博客文章中不可能涵盖可观察性的所有细微差别。因此,我将仅简要概述监控的三个支柱:日志、跟踪和指标。我不会深入具体细节或涵盖用户反馈、漂移检测和调试。
Metrics 指标
When discussing monitoring, most people think of metrics. What metrics to track depends on what you want to track about your system, which is application-specific. However, in general, there are two types of metrics you want to track: model metrics and system metrics.
当讨论监控时,大多数人会想到指标。要跟踪哪些指标取决于你想要跟踪的系统哪些方面,这是应用特定的。然而,一般来说,有两种类型的指标你想跟踪:模型指标和系统指标。
System metrics tell you the state of your overall system. Common metrics are throughput, memory usage, hardware utilization, and service availability/uptime. System metrics are common to all software engineering applications. In this post, I’ll focus on model metrics.
系统指标告诉您整体系统的状态。常见的指标包括吞吐量、内存使用、硬件利用率和服务可用性/运行时间。系统指标在所有软件工程应用中都是通用的。在这篇文章中,我将重点关注模型指标。
Model metrics assess your model’s performance, such as accuracy, toxicity, and hallucination rate. Different steps in an application pipeline also have their own metrics. For example, in a RAG application, the retrieval quality is often evaluated using context relevance and context precision. A vector database can be evaluated by how much storage it needs to index the data and how long it takes to query the data
模型指标评估您的模型性能,如准确率、毒性和幻觉率。应用程序管道中的不同步骤也有自己的指标。例如,在 RAG 应用中,检索质量通常通过上下文相关性和上下文精确度来评估。向量数据库可以通过其索引数据所需的存储量和查询数据所需的时间来评估
There are various ways a model’s output can fail. It’s crucial to identify these issues and develop metrics to monitor them. For example, you might want to track how often your model times out, returns empty responses or produces malformatted responses. If you’re worried about your model revealing sensitive information, find a way to track that too.
模型输出可能失败的方式有很多。识别这些问题并制定监测它们的指标至关重要。例如,你可能想跟踪你的模型超时、返回空响应或产生格式错误响应的频率。如果你担心你的模型会泄露敏感信息,也要找到一种方法来跟踪这一点。
Length-related metrics such as query, context, and response length are helpful for understanding your model’s behaviors. Is one model more verbose than another? Are certain types of queries more likely to result in lengthy answers? They are especially useful for detecting changes in your application. If the average query length suddenly decreases, it could indicate an underlying issue that needs investigation.
与长度相关的指标,如查询、上下文和响应长度,有助于理解您模型的行为。一个模型是否比另一个模型更冗长?某些类型的查询是否更有可能产生较长的答案?它们在检测应用程序中的变化方面特别有用。如果平均查询长度突然下降,这可能表明需要调查的潜在问题。
Length-related metrics are also important for tracking latency and costs, as longer contexts and responses typically increase latency and incur higher costs.
长度相关的指标对于跟踪延迟和成本也很重要,因为较长的上下文和响应通常会增加延迟并产生更高的成本。
Tracking latency is essential for understanding the user experience. Common latency metrics include:
跟踪延迟对于理解用户体验至关重要。常见的延迟指标包括:
- Time to First Token (TTFT): The time it takes for the first token to be generated.
首次标记生成时间(TTFT):生成第一个标记所需的时间。 - Time Between Tokens (TBT): The interval between each token generation.
时间间隔(TBT):每个标记生成之间的间隔。 - Tokens Per Second (TPS): The rate at which tokens are generated.
每秒令牌数(TPS):生成令牌的速率。 - Time Per Output Token (TPOT): The time it takes to generate each output token.
输出每个输出标记所需时间(TPOT):生成每个输出标记所需的时间。 - Total Latency: The total time required to complete a response.
总延迟:完成响应所需的总时间。
You’ll also want to track costs. Cost-related metrics are the number of queries and the volume of input and output tokens. If you use an API with rate limits, tracking the number of requests per second is important to ensure you stay within your allocated limits and avoid potential service interruptions.
您还想要跟踪成本。与成本相关的指标是查询次数以及输入输出令牌的量。如果您使用有速率限制的 API,跟踪每秒请求数量对于确保您保持在分配的限制内并避免潜在的服务中断非常重要。
When calculating metrics, you can choose between spot checks and exhaustive checks. Spot checks involve sampling a subset of data to quickly identify issues, while exhaustive checks evaluate every request for a comprehensive performance view. The choice depends on your system’s requirements and available resources, with a combination of both providing a balanced monitoring strategy.
在计算指标时,您可以选择抽查和全面检查。抽查涉及对数据子集进行采样以快速识别问题,而全面检查则评估每个请求以获得全面的性能视图。选择取决于您系统的需求和可用资源,两者结合提供了一种平衡的监控策略。
When computing metrics, ensure they can be broken down by relevant axes, such as users, releases, prompt/chain versions, prompt/chain types, and time. This granularity helps in understanding performance variations and identifying specific issues.
在计算指标时,确保它们可以根据相关轴(如用户、版本、提示/链版本、提示/链类型和时间)进行分解。这种粒度有助于理解性能变化并识别特定问题。
Logs 日志
Since this blog post is getting long and I’ve written at length about logs in Designing Machine Learning Systems, I will be quick here. The philosophy for logging is simple: log everything. Log the system configurations. Log the query, the output, and the intermediate outputs. Log when a component starts, ends, when something crashes, etc. When recording a piece of log, make sure to give it tags and IDs that can help you know where in the system this log comes from.
自这篇博客文章变得很长,我在《设计机器学习系统》中已经详细讨论了日志,所以这里我会简短一些。日志的哲学很简单:记录一切。记录系统配置。记录查询、输出和中间输出。记录组件启动、结束、崩溃等情况。在记录日志条目时,确保为其添加标签和 ID,以便您知道这条日志来自系统的哪个部分。
Logging everything means that the amount of logs you have can grow very quickly. Many tools for automated log analysis and log anomaly detection are powered by AI.
记录所有内容意味着您拥有的日志数量可以非常快速地增长。许多用于自动化日志分析和日志异常检测的工具都由人工智能驱动。
While it’s impossible to manually process logs, it’s useful to manually inspect your production data daily to get a sense of how users are using your application. Shankar et al. (2024) found that the developers’ perceptions of what constitutes good and bad outputs change as they interact with more data, allowing them to both rewrite their prompts to increase the chance of good responses and update their evaluation pipeline to catch bad responses.
虽然手动处理日志是不可能的,但每天手动检查您的生产数据以了解用户如何使用您的应用程序是有用的。Shankar 等人(2024)发现,随着开发者与更多数据的互动,他们对什么构成好和坏输出的看法会发生变化,这使他们既能重写提示以增加获得良好响应的机会,也能更新他们的评估流程以捕捉到不良响应。
Traces 痕迹
Trace refers to the detailed recording of a request’s execution path through various system components and services. In an AI application, tracing reveals the entire process from when a user sends a query to when the final response is returned, including the actions the system takes, the documents retrieved, and the final prompt sent to the model. It should also show how much time each step takes and its associated cost, if measurable. As an example, this is a visualization of a Langsmith trace.
追踪指的是记录请求执行路径的详细过程,该路径通过各种系统组件和服务。在人工智能应用中,追踪揭示了从用户发送查询到最终返回响应的整个过程,包括系统采取的操作、检索到的文档以及发送给模型的最终提示。还应显示每个步骤所需的时间和其相关成本,如果可以衡量的话。例如,这是一个 Langsmith 追踪的可视化。
Ideally, you should be able to trace each query’s transformation through the system step-by-step. If a query fails, you should be able to pinpoint the exact step where it went wrong: whether it was incorrectly processed, the retrieved context was irrelevant, or the model generated a wrong response.
理想情况下,您应该能够逐步追踪每个查询在系统中的转换过程。如果一个查询失败,您应该能够精确地指出它出错的具体步骤:是处理错误、检索到的上下文不相关,还是模型生成了错误的响应。
AI Pipeline Orchestration
人工智能管道编排
An AI application can get fairly complex, consisting of multiple models, retrieving data from many databases, and having access to a wide range of tools. An orchestrator helps you specify how these different components are combined (chained) together to create an end-to-end application flow.
人工智能应用可以相当复杂,由多个模型组成,从多个数据库中检索数据,并能够访问广泛的工具。协调器帮助您指定这些不同组件如何组合(链式连接)以创建端到端的应用程序流程。
At a high level, an orchestrator works in two steps: components definition and chaining (also known as pipelining).
在较高层面上,协调器分为两个步骤工作:组件定义和链式连接(也称为管道)。
-
Components Definition 组件定义
You need to tell the orchestrator what components your system uses, such as models (including models for generation, routing, and scoring), databases from which your system can retrieve data, and actions that your system can take. Direct integration with model gateways can help simplify model onboarding, and some orchestrator tools want to be gateways. Many orchestrators also support integration with tools for evaluation and monitoring.
您需要告诉编排器您的系统使用哪些组件,例如模型(包括生成、路由和评分模型)、您的系统可以从中检索数据的数据库,以及您的系统可以执行的操作。与模型网关的直接集成可以帮助简化模型上线,并且一些编排器工具希望成为网关。许多编排器还支持与评估和监控工具的集成。 -
Chaining (or pipelining)
链式(或管道线)
You tell the orchestrator the sequence of steps your system takes from receiving the user query until completing the task. In short, chaining is just function composition. Here’s an example of what a pipeline looks like.
您告诉协调器您的系统从接收用户查询到完成任务所采取的步骤顺序。简而言之,链式调用只是函数组合。以下是一个管道样式的示例。- Process the raw query.
处理原始查询。 - Retrieve the relevant data based on the processed query.
根据处理后的查询检索相关数据。 - The original query and the retrieved data are combined to create a prompt in the format expected by the model.
原始查询和检索到的数据合并,以创建模型期望的格式提示。 - The model generates a response based on the prompt.
该模型根据提示生成响应。 - Evaluate the response. 评估响应。
- If the response is considered good, return it to the user. If not, route the query to a human operator.
如果响应被认为良好,则将其返回给用户。如果不良好,则将查询转接到人工操作员。
The orchestrator is responsible for passing data between steps and can provide toolings that help ensure that the output from the current step is in the format expected by the next step.
协调器负责在步骤之间传递数据,并提供帮助确保当前步骤的输出符合下一个步骤期望格式的工具。 - Process the raw query.
When designing the pipeline for an application with strict latency requirements, try to do as much in parallel as possible. For example, if you have a routing component (deciding where to send a query to) and a PII removal component, they can do both at the same time.
在设计具有严格延迟要求的应用程序的管道时,尽可能多地并行处理。例如,如果您有一个路由组件(决定将查询发送到何处)和一个 PII 移除组件,它们可以同时进行这两项操作。
There are many AI orchestration tools, including LangChain, LlamaIndex, Flowise, Langflow, and Haystack. Each tool has its own APIs so I won’t show the actual code here.
有许多 AI 编排工具,包括 LangChain、LlamaIndex、Flowise、Langflow 和 Haystack。每个工具都有自己的 API,所以这里不会展示实际代码。
While it’s tempting to jump straight to an orchestration tool when starting a project, start building your application without one first. Any external tool brings added complexity. An orchestrator can abstract away critical details of how your system works, making it hard to understand and debug your system.
虽然一开始就跳到编排工具很诱人,但首先在没有编排工具的情况下构建您的应用程序。任何外部工具都会增加额外的复杂性。编排器可以抽象出您系统工作的关键细节,这使得理解和调试您的系统变得困难。
As you advance to the later stages of your application development process, you might decide that an orchestrator can make your life easier. Here are three aspects to keep in mind when evaluating orchestrators.
随着您进入应用程序开发过程的后期阶段,您可能会决定一个编排器可以使您的生活更轻松。在评估编排器时,以下三个方面需要您注意。
- Integration and extensibility
集成与扩展
Evaluate whether the orchestrator supports the components you’re already using or might adopt in the future. For example, if you want to use a Llama model, check if the orchestrator supports that. Given how many models, databases, and frameworks there are, it’s impossible for an orchestrator to support everything. Therefore, you’ll also need to consider an orchestrator’s extensibility. If it doesn’t support a specific component, how hard it is to change that?
评估编排器是否支持您目前正在使用或未来可能采用的组件。例如,如果您想使用 Llama 模型,请检查编排器是否支持该模型。鉴于存在许多模型、数据库和框架,编排器不可能支持所有内容。因此,您还需要考虑编排器的可扩展性。如果它不支持某个特定组件,那么更改它是多么困难? - Support for complex pipelines
支持复杂管道
As your applications grow in complexity, you might need to manage intricate pipelines involving multiple steps and conditional logic. An orchestrator that supports advanced features like branching, parallel processing, and error handling will help you manage these complexities efficiently.
随着您的应用程序复杂性增加,您可能需要管理涉及多个步骤和条件逻辑的复杂管道。支持分支、并行处理和错误处理等高级功能的编排器将帮助您高效地管理这些复杂性。 - Ease of use, performance, and scalability
易用性、性能和可扩展性
Consider the user-friendliness of the orchestrator. Look for intuitive APIs, comprehensive documentation, and strong community support, as these can significantly reduce the learning curve for you and your team. Avoid orchestrators that initiate hidden API calls or introduce latency to your applications. Additionally, ensure that the orchestrator can scale effectively as the number of applications, developers, and traffic grows.
考虑编排器的用户友好性。寻找直观的 API、全面的文档和强大的社区支持,因为这些可以显著降低您和您团队的学习曲线。避免那些启动隐藏 API 调用或引入应用延迟的编排器。此外,确保编排器能够随着应用、开发人员和流量的增长而有效扩展。
Conclusion 结论
This post started with a basic architecture and then gradually added components to address the growing application complexities. Each addition brings its own set of benefits and challenges, requiring careful consideration and implementation.
这篇帖子从基本架构开始,然后逐渐添加组件以应对不断增长的应用复杂性。每一次添加都带来自己的好处和挑战,需要仔细考虑和实施。
While the separation of components is important to keep your system modular and maintainable, this separation is fluid. There are many overlaps between components. For example, a model gateway can share functionalities with guardrails. Cache can be implemented in different components, such as in vector search and inference services.
虽然组件的分离对于保持系统模块化和可维护性很重要,但这种分离是灵活的。组件之间存在许多重叠。例如,模型网关可以与护栏共享功能。缓存可以在不同的组件中实现,例如在向量搜索和推理服务中。
This post is much longer than I intended it to be, and yet there are many details I haven’t been able to explore further, especially around observability, context construction, complex logic, cache, and guardrails. I’ll dive deeper into all these components in my upcoming book AI Engineering.
这篇帖子比我原本打算的要长得多,而且还有很多细节我没有能够进一步探讨,尤其是在可观察性、上下文构建、复杂逻辑、缓存和防护措施方面。我将在即将出版的《人工智能工程》一书中对这些组件进行更深入的探讨。
This post also didn’t discuss how to serve models, assuming that most people will be using models provided by third-party APIs. AI Engineering will also have a chapter dedicated to inference and model optimization.
这篇帖子也没有讨论如何部署模型,假设大多数人会使用第三方 API 提供的模型。AI 工程也将有一章专门介绍推理和模型优化。
References and Acknowledgments
参考文献和致谢
Special thanks to Luke Metz, Alex Li, Chetan Tekur, Kittipat “Bot” Kampa, Hien Luu, and Denys Linkov for feedback on the early versions of this post. Their insights greatly improved the content. Any remaining errors are my own.
特别感谢 Luke Metz、Alex Li、Chetan Tekur、Kittipat“Bot”Kampa、Hien Luu 和 Denys Linkov 对这篇帖子的早期版本提供的反馈。他们的见解大大提高了内容质量。任何剩余的错误都是我自己的。
I read many case studies shared by companies on how they adopted generative AI, and here are some of my favorites.
我阅读了许多公司分享的关于他们如何采用生成式 AI 的案例研究,以下是我最喜欢的一些。
- Musings on Building a Generative AI Product (LinkedIn, 2024)
关于构建生成式 AI 产品的沉思(领英,2024) - How we built Text-to-SQL at Pinterest (Pinterest, 2024)
如何我们在 Pinterest 构建了文本到 SQL(Pinterest,2024) - From idea to reality: Elevating our customer support through generative AI (Vimeo, 2023)
从想法到现实:通过生成式 AI 提升我们的客户支持(Vimeo,2023) - A deep dive into the world’s smartest email AI (Shortwave, 2023)
深入探索全球最智能的电子邮件 AI 世界(短波,2023) - LLM-powered data classification for data entities at scale (Grab, 2023)
LLM-驱动的数据实体大规模数据分类(Grab,2023) - From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey (Uber, 2024)
从预测到生成 – 米开朗基罗如何加速 Uber 的 AI 之旅(Uber,2024)