Home

Awesome

<div id="top"></div> <div align="center">

license issue resolution open issues

<p align="center"> 👋 join us on <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">WeChat</a> </p>

English | 简体中文

</div> <div align="center"> </div>

Install

Prerequisites: python3.10

Install Dependencies

linux/osx

apt-get/yum/brew install libreoffice

windows

install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH

Install Magic-Doc

pip install fairy-doc[cpu] # cpu version
or
pip install fairy-doc[gpu] # gpu version

Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.

Example

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)

Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

File TypeSpeed
PDF (digital)347 (page/s)
PDF (ocr)2.7 (page/s)
PPT20 (page/s)
PPTX149 (page/s)
DOC600 (page/s)
DOCX1482 (page/s)

All Thanks To Our Contributors:

image

Acknowledgments

🖊️ Citation

@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}

License

This project is released under the Apache 2.0 license.

<p align="right"><a href="#top">🔼 Back to top</a></p>