Python之利用PyPDF2库实现对PDF的删除和合并

技术2022-07-14 71

文章目录

概述安装一、The PdfFileReader Class1、getNumPages()2、getPage(pageNumber) 二、The PdfFileWriter Class1、addPage(page)2、write(stream) 三、The PdfFileMerger Class方法1、append（）方法2、merge（）方法3、write（）实例一：删除实例二：合并

概述

PyPDF2是Python中用于对PDF操作的第三方库，提供了删除、合并、裁剪、转换等操作最主要有四个类： The PdfFileReader Class The PdfFileMerger Class The PageObject Class The PdfFileWriter Class

安装

打开命令行键入

pip install PyPDF2

一、The PdfFileReader Class

PyPDF2.PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)

Parameters: stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.

strict (bool) – Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to True.

warndest – Destination for logging warnings (defaults to sys.stderr).

overwriteWarnings (bool) – Determines whether to override Python’s warnings.py module with a custom implementation (defaults to True).

1、getNumPages()

Calculates the number of pages in this PDF file.

Returns: number of pages Return type: int Raises PdfReadError: if file is encrypted and restrictions prevent this action.

2、getPage(pageNumber)

Retrieves a page by number from this PDF file.

Parameters: pageNumber (int) – The page number to retrieve (pages begin at zero) Returns: a PageObject instance. Return type: PageObject

二、The PdfFileWriter Class

class PyPDF2.PdfFileWriter This class supports writing PDF files out, given pages produced by another class (typically PdfFileReader).

1、addPage(page)

Adds a page to this PDF file. The page is usually acquired from a PdfFileReader instance.

Parameters: page (PageObject) – The page to add to the document. Should be an instance of PageObject

2、write(stream)

Writes the collection of pages added to this object out as a PDF file.

Parameters: stream – An object to write the file to. The object must support the write method and the tell method, similar to a file object.

三、The PdfFileMerger Class

Initializes a PdfFileMerger object. PdfFileMerger merges multiple PDFs into a single PDF. It can concatenate, slice, insert, or any combination of the above. 初始化一个PdfFileMerger对象，PdfFileMerger 用来将多个PDF合并为一个PDF，它能够连接，切割，插入或者以上的任意组合

See the functions merge() (or append()) and write() for usage information.

Parameters: strict (bool) – Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to True.

方法1、append（）

append(fileobj, bookmark=None, pages=None, import_bookmarks=True)

Identical to the merge() method, but assumes you want to concatenate all pages onto the end of the file instead of specifying a position. 和merge（）方法相同，但假定的是你想要把全部页面连接到文件的最后而不是指定位置

Parameters: fileobj – A File Object or an object that supports the standard read and seek methods similar to a File Object. Could also be a string representing a path to a PDF file. 一个文件对象（python中用open（）创建的对象）或者类似文件对象的能够支持标准读取和寻找方法的对象，也可以是一个代表指向PDF文件路径的字符串 bookmark (str) – Optionally, you may specify a bookmark to be applied at the beginning of the included file by supplying the text of the bookmark. pages – can be a Page Range or a (start, stop[, step]) tuple to merge only the specified range of pages from the source document into the output document. 可以是一个页码序列或者一个(start, stop[, step])元组，用来合并指定范围的源文件页面到输出文件

import_bookmarks (bool) – You may prevent the source document’s bookmarks from being imported by specifying this as False.

在这里插入代码片

方法2、merge（）

merge(position, fileobj, bookmark=None, pages=None, import_bookmarks=True)

Merges the pages from the given file into the output file at the specified page number. 从指定位置合并来自给定文件的页面到输出文件

Parameters: position (int) – The page number to insert this file. File will be inserted after the given number. 插入文件的页码数，将插入到给定页数的后面 0口1 口2 口3 fileobj – A File Object or an object that supports the standard read and seek methods similar to a File Object. Could also be a string representing a path to a PDF file. bookmark (str) – Optionally, you may specify a bookmark to be applied at the beginning of the included file by supplying the text of the bookmark. pages – can be a Page Range or a (start, stop[, step]) tuple to merge only the specified range of pages from the source document into the output document.

import_bookmarks (bool) – You may prevent the source document’s bookmarks from being imported by specifying this as False.

注意：position和pages均指的是下图绿色数字，pages的范围是绿色数字之间囊括的页面

方法3、write（）

write(fileobj) Writes all data that has been merged to the given output file. 将所有被合并的数据写入到给定的输出文件中

Parameters: fileobj – Output file. Can be a filename or any kind of file-like object. 输出文件，可以是一个文件名或者所有类似文件对象的对象

实例一：删除

#PDF_delete.py from PyPDF2 import PdfFileWriter, PdfFileReader def PDF_delete(index): output = PdfFileWriter() # 声明一个用于输出PDF的实例 input1 = PdfFileReader(open("C:/Users/Yuanzheng/Desktop/Test1.pdf", "rb")) # 读取本地PDF文件 pages = input1.getNumPages() # 读取文档的页数 for i in range(pages): if i + 1 in index: continue # 待删除的页面 output.addPage(input1.getPage(i)) # 读取PDF的第i页，添加到输出Output实例中 outputStream = open("C:/Users/Yuanzheng/Desktop/Test-Output1.pdf", "wb") output.write(outputStream) # 把编辑后的文档保存到本地 PDF_delete([2])

实例二：合并

#PDF_merger.py from PyPDF2 import PdfFileMerger merger = PdfFileMerger() input1 =open("C:/Users/Yuanzheng/Desktop/Test1.pdf","rb") input2 = open("C:/Users/Yuanzheng/Desktop/Test2.pdf","rb") merger.append(fileobj= input1) merger.merge(position=0,fileobj=input2,pages=(1,3)) output = open("C:/Users/Yuanzheng/Desktop/PyPDF-Output2.pdf","wb") merger.write(output)

Reference：https://pythonhosted.org/PyPDF2/PageObject.html

Processed: 0.028, SQL: 9