ExtractOfficeContent: 提取Office文件中文本、表格和图像网站首页 其他

ExtractOfficeContent: 提取Office文件中文本、表格和图像

Liekkas Kono 2024-10-03 00:01:02

简介ExtractOfficeContent: 提取Office文件中文本、表格和图像

引言

最近有空写了一下这个库，用来提取Office文件中的文本和图像内容，用作后续整理训练语料使用。
最新更新请移步：Github

Extract Office Content

Use

Installextract_office_content
```
$ pip install extract_office_content
```

Run by CLI.

Extract All office file’s content.

$ extract_office_content -h
usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path

positional arguments:
file_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_office_content tests/test_files

Extract Word.

$ extract_word -h
usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path

positional arguments:
word_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_word tests/test_files/word_example.docx

Extract PPT.

$ extract_ppt -h
usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path

positional arguments:
ppt_path

optional arguments:
-h, --help            show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_ppt tests/test_files/ppt_example.pptx

Extract Excel.

$ extract_excel -h
usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
                    excel_path

positional arguments:
excel_path

optional arguments:
-h, --help            show this help message and exit
-f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
-o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR

$ extract_excel tests/test_files/excel_example.xlsx

Run by python script.

Extract all.

from pathlib import Path

from extract_office_content import ExtractOfficeContent


extracter = ExtractOfficeContent()


file_list = list(Path('tests/test_files').iterdir())

for file_path in file_list:
    res = extracter(file_path)
    print(res)

Extract Word.

from extract_office_content import ExtractWord


word_extract = ExtractWord()

word_path = 'tests/test_files/word_example.docx'
text = word_extract(word_path, "outputs/word")
print(text)

Extract PPT.

from pathlib import Path

from extract_office_content import ExtractPPT

ppt_extracter = ExtractPPT()

ppt_path = 'tests/test_files/ppt_example.pptx'
save_dir = 'outputs'
save_img_dir = Path(save_dir) / Path(ppt_path).stem
res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))
print(res)

Extract Excel.

from extract_office_content import ExtractExcel

excel_extract = ExtractExcel()

excel_path = 'tests/test_files/excel_with_image.xlsx'
res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')
print(res)