是一个由微软开发的 Python 工具,用于将多种文件和办公文档格式转换为 Markdown 格式,支持 PDF、PowerPoint、Word、Excel 等多种文件类型的转换,并且支持使用大型语言模型来描述图像。
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
MarkItDown 库是一个将各种文件转换为 Markdown(例如,用于索引、文本分析等)的实用工具。
It presently supports: 目前支持:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
图片(EXIF 元数据,和 OCR) - Audio (EXIF metadata, and speech transcription)
音频(EXIF 元数据,和语音转录) - HTML (special handling of Wikipedia, etc.)
HTML(对维基百科等特殊处理) - Various other text-based formats (csv, json, xml, etc.)
各种其他基于文本的格式(csv、json、xml 等) - ZIP (Iterates over contents and converts each file)
ZIP(遍历内容并转换每个文件)
Installation 安装
You can install markitdown
using pip:
您可以使用 pip 安装 markitdown
:
pip install markitdown
or from the source
或从源代码
pip install -e .
Usage 使用方法
The API is simple:
API 非常简单:
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
To use this as a command-line utility, install it and then run it like this:
要将其用作命令行工具,请先安装它,然后按如下方式运行:
markitdown path-to-file.pdf
This will output Markdown to standard output. You can save it like this:
这将输出 Markdown 到标准输出。您可以这样保存:
markitdown path-to-file.pdf > document.md
You can pipe content to standard input by omitting the argument:
您可以通过省略参数将内容管道到标准输入:
cat path-to-file.pdf | markitdown
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide llm_client
and llm_model
parameters to MarkItDown object, according to your specific client.
您还可以配置 markitdown 使用大型语言模型来描述图片。为此,您必须根据您的特定客户端提供 llm_client
和 llm_model
参数给 MarkItDown 对象。
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
You can also use the project as Docker Image:
您也可以将项目作为 Docker 镜像使用:
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Contributing 贡献
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
本项目欢迎贡献和建议。大多数贡献需要您同意贡献者许可协议(CLA),声明您有权并且实际上确实授予我们使用您贡献的权利。有关详情,请访问 https://cla.opensource.microsoft.com。
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
当您提交拉取请求时,CLA 机器人将自动判断您是否需要提供 CLA 并相应地装饰 PR(例如,状态检查,评论)。只需遵循机器人提供的说明即可。您只需在整个使用我们 CLA 的所有仓库中这样做一次。
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
本项目已采用微软开源行为准则。更多信息请参阅行为准则常见问题解答,或通过 [email protected] 联系以获取任何额外的问题或评论。
Running Tests 运行测试
To run tests, install hatch
using pip
or other methods as described here.
运行测试,请使用 pip
或其他此处描述的方法安装 hatch
。
pip install hatch
hatch shell
hatch test
Running Pre-commit Checks
运行提交前检查
Please run the pre-commit checks before submitting a PR.
请提交 PR 前运行预提交检查。
pre-commit run --all-files
Trademarks 商标
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
此项目可能包含项目、产品或服务的商标或标志。Microsoft 商标或标志的授权使用须遵守并遵循 Microsoft 的商标和品牌指南。在此项目的修改版本中使用 Microsoft 商标或标志不得引起混淆或暗示 Microsoft 的赞助。任何第三方商标或标志的使用均须遵守该第三方的政策。