fitz.open的文档

最新推荐文章于 2024-05-04 23:30:16 发布

morninglop

最新推荐文章于 2024-05-04 23:30:16 发布

阅读量820

点赞数

文章标签： python 开发语言

关于包fitz的解释
下载了 PyMuPDF就可以直接import

FILE
d:\anaconda\lib\site-packages\fitz_init_.py

Help on Document in module fitz.fitz object:

class Document(builtins.object)

| Document(filename=None, stream=None, filetype=None, rect=None, width=0, height=0, fontsize=11)
|
| Methods defined here:
|
| contains(self, loc) -> bool
|
| del(self)
|
| delitem(self, i: Any) -> None
|
| enter(self)
|
| exit(self, *args)
|
| getitem(self, i: int = 0) -> ‘Page’
|
| init(self, filename=None, stream=None, filetype=None, rect=None, width=0, height=0, fontsize=11)
| Creates a document. Use ‘open’ as a synonym.
|
| Notes:
| Basic usages:基本使用方法
| open() - new PDF document基于已有pdf文件创建这样一个类型
| open(filename) - string, pathlib.Path, or file object.
| open(filename, fileype=type) - overwrite filename extension.
| open(type, buffer) - type: extension, buffer: bytes object.
| open(stream=buffer, filetype=type) - keyword version of previous.
| Parameters rect, width, height, fontsize: layout reflowable
| document on open (e.g. EPUB). Ignored if n/a.
|
| len(self) -> int 返回一个页数

pdf_document = r"E:\newJupterGit\trans\input_file\PracticalExplainable.pdf"
doc = fitz.open(pdf_document)
doc.__len__()#356

| repr(self) -> str
| Return repr(self).

pdf_document = r"E:\newJupterGit\trans\input_file\PracticalExplainable.pdf"
doc = fitz.open(pdf_document)
doc.__repr__()
#"Document('E:\\newJupterGit\\trans\\input_file\\PracticalExplainable.pdf')"

| embfile_add(self, name: str, buffer: ByteString, filename: Optional[str] = None, ufilename: Optional[str] = None, desc: Optional[str] = None) -> None
| Add an item to the EmbeddedFiles array.
| 向 EmbeddedFiles 数组添加一个项。
| Args:
| name: name of the new item, must not already exist.
| buffer: (binary data) the file content.
| filename: (str) the file name, default: the name
| ufilename: (unicode) the file name, default: filename
| desc: (str) the description.
|
| embfile_count(self) -> int
| Get number of EmbeddedFiles.
|
| embfile_del(self, item: Union[int, str])
| Delete an entry from EmbeddedFiles.
|
| Notes:
| The argument must be name or index of an EmbeddedFiles item.
| Physical deletion of data will happen on save to a new
| file with appropriate garbage option.
| Args:
| item: name or number of item.
| Returns:
| None
|
| embfile_get(self, item: Union[int, str]) -> bytes
| Get the content of an item in the EmbeddedFiles array.
|
| Args:
| item: number or name of item.
| Returns:
| (bytes) The file content.
|
| embfile_info(self, item: Union[int, str]) -> dict
| Get information of an item in the EmbeddedFiles array.
|
| Args:
| item: number or name of item.
| Returns:
| Information dictionary.
|
| embfile_names(self) -> list
| Get list of names of EmbeddedFiles.
|
| embfile_upd(self, item: Union[int, str], buffer: Optional[ByteString] = None, filename: Optional[str] = None, ufilename: Optional[str] = None, desc: Optional[str] = None) -> None
| Change an item of the EmbeddedFiles array.
|
| Notes:
| Only provided parameters are changed. If all are omitted,
| the method is a no-op.
| Args:
| item: number or name of item.
| buffer: (binary data) the new file content.
| filename: (str) the new file name.
| ufilename: (unicode) the new filen ame.
| desc: (str) the new description.
|
| extract_font(self, xref=0, info_only=0, named=None)
| Get a font by xref. Returns a tuple or dictionary.
| 通过 xref 获取字体。返回元组或字典。
|
| extract_image(self, xref)
| Get image by xref. Returns a dictionary.
|
| ez_save(self, filename, garbage=3, clean=False, deflate=True, deflate_images=True, deflate_fonts=True, incremental=False, ascii=False, expand=False, linear=False, pretty=False, encryption=1, permissions=4095, owner_pw=None, user_pw=None, no_new_id=True)
| Save PDF using some different defaults
|
| find_bookmark(self, bm)
| Find new location after layouting a document.
|
| fullcopy_page(self, pno, to=-1)
| Make a full page duplicate.
|
| get_char_widths(doc: fitz.fitz.Document, xref: int, limit: int = 256, idx: int = 0, fontdict: Optional[dict] = None) -> list
| Get list of glyph information of a font.
|
| Notes:
| Must be provided by its XREF number. If we already dealt with the
| font, it will be recorded in doc.FontInfos. Otherwise we insert an
| entry there.
| Finally we return the glyphs for the font. This is a list of
| (glyph, width) where glyph is an integer controlling the char
| appearance, and width is a float controlling the char’s spacing:
| width * fontsize is the actual space.
| For ‘simple’ fonts, glyph == ord(char) will usually be true.
| Exceptions are ‘Symbol’ and ‘ZapfDingbats’. We are providing data for these directly here.
|
| get_layer(self, config=-1)
| Content of ON, OFF, RBGroups of an OC layer.
|
| get_layers(self)
| Show optional OC layers.
|
| get_new_xref(self)
| Make a new xref.
|
| get_oc(doc: fitz.fitz.Document, xref: int) -> int
| Return optional content object xref for an image or form xobject.
|
| Args:
| xref: (int) xref number of an image or form xobject.
|
| get_ocgs(self)
| Show existing optional content groups.
|
| get_ocmd(doc: fitz.fitz.Document, xref: int) -> dict
| Return the definition of an OCMD (optional content membership dictionary).
|
| Recognizes PDF dict keys /OCGs (PDF array of OCGs), /P (policy string) and
| /VE (visibility expression, PDF array). Via string manipulation, this
| info is converted to a Python dictionary with keys “xref”, “ocgs”, “policy”
| and “ve” - ready to recycle as input for ‘set_ocmd()’.
|
| get_outline_xrefs(self)
| Get list of outline xref numbers.
|
| get_page_fonts(self, pno: int, full: bool = False) -> list
| Retrieve a list of fonts used on a page.
|
| get_page_images(self, pno: int, full: bool = False) -> list
| Retrieve a list of images used on a page.
|
| get_page_labels(doc)
| Return page label definitions in PDF document.
|
| Args:
| doc: PDF document (resp. ‘self’).
| Returns:
| A list of dictionaries with the following format:
| {‘startpage’: int, ‘prefix’: str, ‘style’: str, ‘firstpagenum’: int}.
|
| get_page_numbers(doc, label, only_one=False)
| Return a list of page numbers with the given label.
|
| Args:
| doc: PDF document object (resp. ‘self’).
| label: (str) label.
| only_one: (bool) stop searching after first hit.
| Returns:
| List of page numbers having this label.
|
| get_page_pixmap(doc: fitz.fitz.Document, pno: int, *, matrix: ‘matrix_like’ = IdentityMatrix(1.0, 0.0, 0.0, 1.0, 0.0, 0.0), dpi=None, colorspace: fitz.fitz.Colorspace = Colorspace(CS_RGB) - DeviceRGB, clip: ‘rect_like’ = None, alpha: bool = False, annots: bool = True) -> fitz.fitz.Pixmap
| Create pixmap of document page by page number.
|
| Notes:
| Convenience function calling page.get_pixmap.
| Args:
| pno: (int) page number
| matrix: Matrix for transformation (default: Identity).
| colorspace: (str,Colorspace) rgb, rgb, gray - case ignored, default csRGB.
| clip: (irect-like) restrict rendering to this area.
| alpha: (bool) include alpha channel
| annots: (bool) also render annotations
|
| get_page_text(doc: fitz.fitz.Document, pno: int, option: str = ‘text’, clip: ‘rect_like’ = None, flags: Optional[int] = None, textpage: fitz.fitz.TextPage = None, sort: bool = False) -> Any
| Extract a document page’s text by page number.
|
| Notes:
| Convenience function calling page.get_text().
| Args:
| pno: page number
| option: (str) text, words, blocks, html, dict, json, rawdict, xhtml or xml.
| Returns:
| output from page.TextPage().
|
| get_page_xobjects(self, pno: int) -> list
| Retrieve a list of XObjects used on a page.
|
| get_sigflags(self)
| Get the /SigFlags value.
|
| get_toc(doc: fitz.fitz.Document, simple: bool = True) -> list
| Create a table of contents.
|
| Args:
| simple: a bool to control output. Returns a list, where each entry consists of outline level, title, page number and link destination (if simple = False). For details see PyMuPDF’s documentation.
|
| get_xml_metadata(self)
| Get document XML metadata.
|
| has_annots(doc: fitz.fitz.Document) -> bool
| Check whether there are annotations on any page.
|
| has_links(doc: fitz.fitz.Document) -> bool
| Check whether there are links on any page.
|
| init_doc(self)
|
| insert_page(doc: fitz.fitz.Document, pno: int, text: Union[str, list, NoneType] = None, fontsize: float = 11, width: float = 595, height: float = 842, fontname: str = ‘helv’, fontfile: Optional[str] = None, color: Optional[Sequence] = (0,)) -> int
| Create a new PDF page and insert some text.
|
| Notes:
| Function combining Document.new_page() and Page.insert_text().
| For parameter details see these methods.
|
| insert_pdf(self, docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=1, annots=1, show_progress=0, final=1, _gmap=None)
| Insert a page range from another PDF.
|
| Args:
| docsrc: PDF to copy from. Must be different object, but may be same file.
| from_page: (int) first source page to copy, 0-based, default 0.
| to_page: (int) last source page to copy, 0-based, default last page.
| start_at: (int) from_page will become this page number in target.
| rotate: (int) rotate copied pages, default -1 is no change.
| links: (int/bool) whether to also copy links.
| annots: (int/bool) whether to also copy annotations.
| show_progress: (int) progress message interval, 0 is no messages.
| final: (bool) indicates last insertion from this source PDF.
| _gmap: internal use only
|
| Copy sequence reversed if from_page > to_page.
|
| journal_can_do(self)
| Show if undo and / or redo are possible.
|
| journal_enable(self)
| Activate document journalling.
|
| journal_is_enabled(self)
| Check if journalling is enabled.
|
| journal_load(self, filename)
| Load a journal from a file.
|
| journal_op_name(self, step)
| Show operation name for given step.
|
| journal_position(self)
| Show journalling state.
|
| journal_redo(self)
| Move forward in the journal.
|
| journal_save(self, filename)
| Save journal to a file.
|
| journal_start_op(self, name=None)
| Begin a journalling operation.
|
| journal_stop_op(self)
| End a journalling operation.
|
| journal_undo(self)
| Move backwards in the journal.
|
| layer_ui_configs(self)
| Show OC visibility status modifyable by user.
|
| layout(self, rect=None, width=0, height=0, fontsize=11)
| Re-layout a reflowable document.
|
| load_page(self, page_id)
| Load a page.
|
| ‘page_id’ is either a 0-based page number or a tuple (chapter, pno),
| with chapter number and page number within that chapter.
|
| location_from_page_number(self, pno)
| Convert pno to (chapter, page).
|
| make_bookmark(self, loc)
| Make a page pointer before layouting document.
|
| move_page(self, pno: int, to: int = -1)
| Move a page within a PDF document.
|
| Args:
| pno: source page number.
| to: put before this page, ‘-1’ means after last page.
|
| need_appearances(self, value=None)
| Get/set the NeedAppearances value.
|
| new_page(doc: fitz.fitz.Document, pno: int = -1, width: float = 595, height: float = 842) -> fitz.fitz.Page
| Create and return a new page object.
|
| Args:
| pno: (int) insert before this page. Default: after last page.
| width: (float) page width in points. Default: 595 (ISO A4 width).
| height: (float) page height in points. Default 842 (ISO A4 height).
| Returns:
| A Page object.
|
| next_location(self, page_id)
| Get (chapter, page) of next page.
|
| page_annot_xrefs(self, pno)
| Get list annotations of page number.
|
| page_cropbox(self, pno)
| Get CropBox of page number (without loading page).
|
| page_number_from_location(self, page_id)
| Convert (chapter, pno) to page number.
|
| page_xref(self, pno)
| Get xref of page number.
|
| pages(self, start: Optional[int] = None, stop: Optional[int] = None, step: Optional[int] = None)
| Return a generator iterator over a page range.
|
| Arguments have the same meaning as for the range() built-in.
|
| pdf_catalog(self)
| Get xref of PDF catalog.
|
| pdf_trailer(self, compressed: bool = False, ascii: bool = False) -> str
| Get PDF trailer as a string.
|
| prev_location(self, page_id)
| Get (chapter, page) of previous page.
|
| reload_page(self, page: ‘struct Page *’) -> ‘struct Page *’
| Make a fresh copy of a page.
|
| resolve_link(self, uri=None, chapters=0)
| Calculate internal link destination.
|
| Args:
| uri: (str) some Link.uri
| chapters: (bool) whether to use (chapter, page) format
| Returns:
| (page_id, x, y) where x, y are point coordinates on the page.
| page_id is either page number (if chapters=0), or (chapter, pno).
|
| save(self, filename, garbage=0, clean=0, deflate=0, deflate_images=0, deflate_fonts=0, incremental=0, ascii=0, expand=0, linear=0, no_new_id=0, appearance=0, pretty=0, encryption=1, permissions=4095, owner_pw=None, user_pw=None)
| Save PDF to file, pathlib.Path or file pointer.
|
| saveIncr(self)
| Save PDF incrementally
|
| save_snapshot(self, filename)
| Save a file snapshot suitable for journalling.
|
| scrub(doc: fitz.fitz.Document, attached_files: bool = True, clean_pages: bool = True, embedded_files: bool = True, hidden_text: bool = True, javascript: bool = True, metadata: bool = True, redactions: bool = True, redact_images: int = 0, remove_links: bool = True, reset_fields: bool = True, reset_responses: bool = True, thumbnails: bool = True, xml_metadata: bool = True) -> None
| # ------------------------------------------------------------------------------
| # Remove potentially sensitive data from a PDF. Similar to the Adobe
| # Acrobat ‘sanitize’ function
| # ------------------------------------------------------------------------------
|
| search_page_for(doc: fitz.fitz.Document, pno: int, text: str, quads: bool = False, clip: ‘rect_like’ = None, flags: int = 83, textpage: fitz.fitz.TextPage = None) -> list
| Search for a string on a page.
|
| Args:
| pno: page number
| text: string to be searched for
| clip: restrict search to this rectangle
| quads: (bool) return quads instead of rectangles
| flags: bit switches, default: join hyphened words
| textpage: reuse a prepared textpage
| Returns:
| a list of rectangles or quads, each containing an occurrence.
|
| select(self, pyliste)
| Build sub-pdf with page numbers in the list.
|
| set_language(self, language=None)
|
| set_layer(self, config, basestate=None, on=None, off=None, rbgroups=None)
| Set the PDF keys /ON, /OFF, /RBGroups of an OC layer.
|
| set_layer_ui_config(self, number, action=0)
| Set / unset OC intent configuration.
|
| set_metadata(doc: fitz.fitz.Document, m: dict) -> None
| Update the PDF /Info object.
|
| Args:
| m: a dictionary like doc.metadata.
|
| set_oc(doc: fitz.fitz.Document, xref: int, oc: int) -> None
| Attach optional content object to image or form xobject.
|
| Args:
| xref: (int) xref number of an image or form xobject
| oc: (int) xref number of an OCG or OCMD
|
| set_ocmd(doc: fitz.fitz.Document, xref: int = 0, ocgs: Optional[list] = None, policy: Optional[str] = None, ve: Optional[list] = None) -> int
| Create or update an OCMD object in a PDF document.
|
| Args:
| xref: (int) 0 for creating a new object, otherwise update existing one.
| ocgs: (list) OCG xref numbers, which shall be subject to ‘policy’.
| policy: one of ‘AllOn’, ‘AllOff’, ‘AnyOn’, ‘AnyOff’ (any casing).
| ve: (list) visibility expression. Use instead of ‘ocgs’ with ‘policy’.
|
| Returns:
| Xref of the created or updated OCMD.
|
| set_page_labels(doc, labels)
| Add / replace page label definitions in PDF document.
|
| Args:
| doc: PDF document (resp. ‘self’).
| labels: list of label dictionaries like:
| {‘startpage’: int, ‘prefix’: str, ‘style’: str, ‘firstpagenum’: int},
| as returned by get_page_labels().
|
| set_toc(doc: fitz.fitz.Document, toc: list, collapse: int = 1) -> int
| Create new outline tree (table of contents, TOC).
|
| Args:
| toc: (list, tuple) each entry must contain level, title, page and
| optionally top margin on the page. None or ‘()’ remove the TOC.
| collapse: (int) collapses entries beyond this level. Zero or None
| shows all entries unfolded.
| Returns:
| the number of inserted items, or the number of removed items respectively.
|
| set_toc_item(doc: fitz.fitz.Document, idx: int, dest_dict: Optional[dict] = None, kind: Optional[int] = None, pno: Optional[int] = None, uri: Optional[str] = None, title: Optional[str] = None, to: ‘point_like’ = None, filename: Optional[str] = None, zoom: float = 0) -> None
| Update TOC item by index.
|
| It allows changing the item’s title and link destination.
|
| Args:
| idx: (int) desired index of the TOC list, as created by get_toc.
| dest_dict: (dict) destination dictionary as created by get_toc(False).
| Outrules all other parameters. If None, the remaining parameters
| are used to make a dest dictionary.
| kind: (int) kind of link (LINK_GOTO, etc.). If None, then only the
| title will be updated. If LINK_NONE, the TOC item will be deleted.
| pno: (int) page number (1-based like in get_toc). Required if LINK_GOTO.
| uri: (str) the URL, required if LINK_URI.
| title: (str) the new title. No change if None.
| to: (point-like) destination on the target page. If omitted, (72, 36)
| will be used as taget coordinates.
| filename: (str) destination filename, required for LINK_GOTOR and
| LINK_LAUNCH.
| name: (str) a destination name for LINK_NAMED.
| zoom: (float) a zoom factor for the target location (LINK_GOTO).
|
| set_xml_metadata(self, metadata)
| Store XML document level metadata.
|
| subset_fonts(doc: fitz.fitz.Document, verbose: bool = False) -> None
| Build font subsets of a PDF. Requires package ‘fontTools’.
|
| Eligible fonts are potentially replaced by smaller versions. Page text is
| NOT rewritten and thus should retain properties like being hidden or
| controlled by optional content.
|
| switch_layer(self, config, as_default=0)
| Activate an OC layer.
|
| tobytes = write(self, garbage=False, clean=False, deflate=False, deflate_images=False, deflate_fonts=False, incremental=False, ascii=False, expand=False, linear=False, no_new_id=False, appearance=False, pretty=False, encryption=1, permissions=4095, owner_pw=None, user_pw=None)
|
| update_object(self, xref, text, page=None)
| Replace object definition source.
|
| update_stream(self, xref=0, stream=None, new=1, compress=1)
| Replace xref stream part.
|
| write(self, garbage=False, clean=False, deflate=False, deflate_images=False, deflate_fonts=False, incremental=False, ascii=False, expand=False, linear=False, no_new_id=False, appearance=False, pretty=False, encryption=1, permissions=4095, owner_pw=None, user_pw=None)
|
| xref_copy(doc: fitz.fitz.Document, source: int, target: int, *, keep: list = None) -> None
| Copy a PDF dictionary object to another one given their xref numbers.
|
| Args:
| doc: PDF document object
| source: source xref number
| target: target xref number, the xref must already exist
| keep: an optional list of 1st level keys in target that should not be
| removed before copying.
| Notes:
| This works similar to the copy() method of dictionaries in Python. The
| source may be a stream object.
|
| xref_get_key(self, xref, key)
| Get PDF dict key value of object at ‘xref’.
|
| xref_get_keys(self, xref)
| Get the keys of PDF dict object at ‘xref’. Use -1 for the PDF trailer.
|
| xref_is_font(self, xref)
| Check if xref is a font object.
|
| xref_is_image(self, xref)
| Check if xref is an image object.
|
| xref_is_stream(self, xref=0)
| Check if xref is a stream object.
|
| xref_is_xobject(self, xref)
| Check if xref is a form xobject.
|
| xref_length(self)
| Get length of xref table.
|
| xref_object(self, xref, compressed=0, ascii=0)
| Get xref object source as a string.
|
| xref_set_key(self, xref, key, value)
| Set the value of a PDF dictionary key.
|
| xref_stream(self, xref)
| Get decompressed xref stream.
|
| xref_stream_raw(self, xref)
| Get xref stream without decompression.
|
| xref_xml_metadata(self)
| Get xref of document XML metadata.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| swig_destroy = delete_Document(…)
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| FormFonts
| Get list of field font resource names.
|
| chapter_count
| Number of chapters.
|
| has_old_style_xrefs
| Check if xref table is old style.
|
| has_xref_streams
| Check if xref table is a stream.
|
| is_dirty
| True if PDF has unsaved changes.
|
| is_form_pdf
| Either False or PDF field count.
|
| is_pdf
| Check for PDF.
|
| is_reflowable
| Check if document is layoutable.
|
| is_repaired
| Check whether PDF was repaired.
|
| language
| Document language.
|
| last_location
| Id (chapter, page) of last page.
|
| needs_pass
| Indicate password required.
|
| outline
|
| page_count
| Number of pages.
|
| permissions
| Document permissions.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| dict
| dictionary for instance variables (if defined)
|
| weakref
| list of weak references to the object (if defined)
|
| thisown
| The membership flag