【pandoc API python】

gskyi

已于 2023-08-28 17:24:57 修改

阅读量372

点赞数

分类专栏： pandoc 文章标签： python

于 2023-08-28 17:23:33 首次发布

本文链接：https://blog.csdn.net/a731637163/article/details/132543369

版权

pandoc 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

使用 pandoc API
Pandoc 可以用作 Haskell 库，用于编写您自己的转换工具或为 Web 应用程序提供支持。本文档介绍了如何使用 pandoc API。

各个函数和类型级别的详细 API 文档可在https://hackage.haskell.org/package/pandoc上找到。

Pandoc的架构
Pandoc 的结构为一组reader和一组 writer，它们将各种输入格式转换为表示结构化文档的抽象语法树（Pandoc AST），而一组 writer则将此 AST 渲染为各种输出格式。如图所示：

[input format] ==reader==> [Pandoc AST] ==writer==> [output format]

该架构允许 pandoc 使用M 个读取器和N 个写入器执行M × N转换。

Pandoc AST 在pandoc-types包中定义。您应该首先查看 Haddock 文档中的Text.Pandoc.Definition。正如您将看到的，aPandoc 由一些元数据和 s 列表组成Block。有多种类型Block，包括Para（段落）、Header（章节标题）和 BlockQuote。其中一些 Block（如BlockQuote）包含 s 列表 Block，而其他（如Para）包含 s 列表Inline，还有一些（如CodeBlock）包含纯文本或不包含任何内容。Inlines 是段落的基本元素。Block和之间的区别Inline例如，在类型系统中，无法表示Inline链接文本为块引用 ( Block) 的链接 ( )。这种表达上的限制主要是一种帮助而不是一种障碍，因为 pandoc 支持的许多格式都有类似的限制。

探索 pandoc AST 的最佳方法是使用pandoc -t native，它将显示与某些 Markdown 输入相对应的 AST：

% echo -e "1. *foo*\n2. bar" | pandoc -t native
[OrderedList (1,Decimal,Period)
 [[Plain [Emph [Str "foo"]]]
 ,[Plain [Str "bar"]]]]

一个简单的例子
以下是使用 pandoc reader 和 writer 执行转换的简单示例：

import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO

main :: IO ()
main = do
  result <- runIO $ do
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc
  rst <- handleError result
  TIO.putStrLn rst

一些注意事项：

第一部分构造一个转换管道：输入字符串传递给readMarkdown，doc然后由渲染生成的 Pandoc AST ( ) writeRST。转换管道是由runIO以下人员“运行”的。

result有类型Either PandocError Text. 我们可以手动对此进行模式匹配，但在这种情况下使用handleErrorText.Pandoc.Error 中的函数更简单。如果值为 a ，则退出并显示相应的错误代码和消息；如果值为 a ，则Left返回。TextRight

PandocMonad 类
readMarkdown让我们看看和的类型writeRST：

readMarkdown :: (PandocMonad m, ToSources a)
             => ReaderOptions
             -> a
             -> m Pandoc
writeRST     :: PandocMonad m
             => WriterOptions
             -> Pandoc
             -> m Text

该PandocMonad m =>部分是类型类约束。它说明了这一点readMarkdown，并writeRST定义了可以在PandocMonad类型类的任何实例中使用的计算。PandocMonad定义在模块Text.Pandoc.Class中。

提供了两个实例PandocMonad：PandocIO和PandocPure。不同之处在于，in 中的计算PandocIO允许进行 IO（例如，读取文件），而 in 中的计算PandocPure没有任何副作用。PandocPure当您想要防止用户执行任何恶意操作时，对于沙盒环境非常有用。要在中运行转换PandocIO，请使用runIO（如上所述）。要运行它 PandocPure，请使用runPure.

正如您从 Haddocks 中看到的，Text.Pandoc.Class 导出了许多可以在的任何实例中使用的辅助函数PandocMonad。例如：

-- | Get the verbosity level.
getVerbosity :: PandocMonad m => m Verbosity

-- | Set the verbosity level.
setVerbosity :: PandocMonad m => Verbosity -> m ()

-- Get the accumulated log messages (in temporal order).
getLog :: PandocMonad m => m [LogMessage]
getLog = reverse <$> getsCommonState stLog

-- | Log a message using 'logOutput'.  Note that 'logOutput' is
-- called only if the verbosity level exceeds the level of the
-- message, but the message is added to the list of log messages
-- that will be retrieved by 'getLog' regardless of its verbosity level.
report :: PandocMonad m => LogMessage -> m ()

-- | Fetch an image or other item from the local filesystem or the net.
-- Returns raw content and maybe mime type.
fetchItem :: PandocMonad m
          => Text
          -> m (B.ByteString, Maybe MimeType)

-- Set the resource path searched by 'fetchItem'.
setResourcePath :: PandocMonad m => [FilePath] -> m ()

如果我们在上一节中定义的转换过程中想要更详细的信息消息，我们可以这样做：

  result <- runIO $ do
    setVerbosity INFO
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc

请注意，这PandocIO是的一个实例MonadIO，因此您可以使用它liftIO在 pandoc 转换链内执行任意 IO 操作。

readMarkdown它的第二个参数是多态的，它可以是作为类型类实例的任何类型ToSources。您可以使用Text，如上面的示例所示。[(FilePath, Text)]但如果输入来自多个文件并且您希望准确跟踪源位置，您也可以使用。

选项
每个读者或作者的第一个参数是控制读者或作者行为的选项：ReaderOptions对于读者和 WriterOptions作者。这些在Text.Pandoc.Options中定义。研究这些选项以了解可以调整哪些内容是个好主意。

def（来自 Data.Default）表示每种选项的默认值。（您也可以使用defaultWriterOptions和defaultReaderOptions。）通常您会希望使用默认值并仅在需要时修改它们，例如：

writeRST def{ writerReferenceLinks = True }

需要了解的一些特别重要的选项：

writerTemplate：默认为，Nothing表示将生成一个文档片段。如果您想要完整的文档，则需要指定Just template，其中是template包含模板内容（而不是路径）的Text.Pandoc.Templates 。Template Text

readerExtensions 和writerExtensions：这些指定在解析和渲染中使用的扩展。扩展在Text.Pandoc.Extensions中定义。

建设者
有时以编程方式构建 Pandoc 文档很有用。为了使这更容易，我们提供了模块Text.Pandoc.Builder pandoc-types。

因为连接列表很慢，所以我们使用特殊类型Inlines并Blocks包装SequenceofInline和Block元素。这些是 Monoid 类型类的实例，可以轻松连接：

import Text.Pandoc.Builder

mydoc :: Pandoc
mydoc = doc $ header 1 (text (T.pack "Hello!"))
           <> para (emph (text (T.pack "hello world")) <> text (T.pack "."))

main :: IO ()
main = print mydoc

如果您使用OverloadedStringspragma，您可以进一步简化：

mydoc = doc $ header 1 "Hello!"
           <> para (emph "hello world" <> ".")

这是一个更现实的例子。假设你的老板说：用 Word 给我写一封信，列出芝加哥所有接受 Voyager 卡的加油站。您会发现一些采用以下格式的 JSON 数据 ( fuel.json)：

[ {
  "state" : "IL",
  "city" : "Chicago",
  "fuel_type_code" : "CNG",
  "zip" : "60607",
  "station_name" : "Clean Energy - Yellow Cab",
  "cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
  "street_address" : "540 W Grenshaw"
}, ...

然后使用 aeson 和 pandoc 解析 JSON 并创建 Word 文档：

{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Builder
import Text.Pandoc
import Data.Monoid ((<>), mempty, mconcat)
import Data.Aeson
import Control.Applicative
import Control.Monad (mzero)
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.List (intersperse)

data Station = Station{
    address        :: T.Text
  , name           :: T.Text
  , cardsAccepted  :: [T.Text]
  } deriving Show

instance FromJSON Station where
    parseJSON (Object v) = Station <$>
       v .: "street_address" <*>
       v .: "station_name" <*>
       (T.words <$> (v .:? "cards_accepted" .!= ""))
    parseJSON _          = mzero

createLetter :: [Station] -> Pandoc
createLetter stations = doc $
    para "Dear Boss:" <>
    para "Here are the CNG stations that accept Voyager cards:" <>
    simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
           (map stationToRow stations) <>
    para "Your loyal servant," <>
    plain (image "JohnHancock.png" "" mempty)
  where
    stationToRow station =
      [ plain (text $ name station)
      , plain (text $ address station)
      , plain (mconcat $ intersperse linebreak
                       $ map text $ cardsAccepted station)
      ]

main :: IO ()
main = do
  json <- BL.readFile "fuel.json"
  let letter = case decode json of
                    Just stations -> createLetter [s | s <- stations,
                                        "Voyager" `elem` cardsAccepted s]
                    Nothing       -> error "Could not decode JSON"
  docx <- runIO (writeDocx def letter) >>= handleError
  BL.writeFile "letter.docx" docx
  putStrLn "Created letter.docx"

瞧！您写这封信时没有使用 Word，也没有查看数据。

数据文件
Pandoc 有许多数据文件，可以在 data/存储库的子目录中找到。它们与 pandoc 一起安装（或者，如果 pandoc 是使用该embed_data_files标志编译的，则它们嵌入在二进制文件中）。您可以使用 Text.Pandoc.Class 检索数据文件readDataFile。首先会在“用户数据目录”（、）readDataFile中查找该文件，如果没有找到，则返回系统默认安装的文件。要强制使用默认值，.setUserDataDirgetUserDataDirsetUserDataDir Nothing

元数据文件
Pandoc 可以将元数据添加到文档中，如用户指南中所述。与数据文件类似，可以使用readMetadataFileText.Pandoc.Class 检索元数据 YAML 文件。首先会在工作目录中查找该文件，如果没有找到，则会在用户数据目录的子目录（, ）readMetadataFile中查找。metadatasetUserDataDirgetUserDataDir

模板
Pandoc 有自己的模板系统，在用户指南中进行了描述。要检索系统的默认模板，请getDefaultTemplate使用Text.Pandoc.Templates。请注意，这首先会在templates用户数据目录的子目录中查找，从而允许用户覆盖系统默认值。如果您想禁用此行为，请使用setUserDataDir Nothing.

要呈现模板，请使用renderTemplate’，它接受两个参数：模板 (Text) 和上下文（ToJSON 的任何实例）。如果您想从 Pandoc 文档的元数据部分创建上下文，请metaToJSON’使用Text.Pandoc.Writers.Shared。如果您还想合并变量中的值，请改为使用metaToJSON，并确保writerVariables在中设置WriterOptions。

处理错误和警告
runIO并runPure返回一个Either PandocError a. 运行计算时引发的所有错误PandocMonad都将被捕获并作为Left值返回，因此它们可以由调用程序处理。要查看的构造函数，请参阅Text.Pandoc.ErrorPandocError的文档。

PandocError 要从计算内部引发 a PandocMonad ，请使用throwError。

除了停止执行转换管道的错误之外，还可以生成信息性消息。使用reportText.Pandoc.Class 发出. _ LogMessage有关的构造函数列表LogMessage，请参阅Text.Pandoc.Logging。请注意，每种类型的日志消息都与详细级别相关联。详细级别 ( setVerbosity/ getVerbosity) 确定报告是否将打印到 stderr （在中运行时 PandocIO），但无论详细级别如何，所有报告的消息都在内部存储，并且可以使用检索getLog。

走 AST
遍历 Pandoc AST 来提取信息（例如，本文档中链接到的所有 URL 是什么？所有代码示例都编译了吗？）或转换文档（例如，增加每个部分的级别）通常很有用。标题、删除强调或用图像替换专门标记的代码块）。为了使这更容易、更高效，pandoc-types包含一个模块 Text.Pandoc.Walk。

这是必要的文档：

class Walkable a b where
  -- | @walk f x@ walks the structure @x@ (bottom up) and replaces every
  -- occurrence of an @a@ with the result of applying @f@ to it.
  walk  :: (a -> a) -> b -> b
  walk f = runIdentity . walkM (return . f)
  -- | A monadic version of 'walk'.
  walkM :: (Monad m, Functor m) => (a -> m a) -> b -> m b
  -- | @query f x@ walks the structure @x@ (bottom up) and applies @f@
  -- to every @a@, appending the results.
  query :: Monoid c => (a -> c) -> b -> c

Walkable实例是为 Pandoc 类型的大多数组合定义的。例如，该 Walkable Inline Block 实例允许您获取一个函数Inline -> Inline并将其应用于Block. 并Walkable [Inline] Pandoc允许您采用一个函数[Inline] -> [Inline]并将其应用于Inlinea 中的每个最大 s列表Pandoc。

下面是一个提升标题级别的函数的简单示例：

promoteHeaderLevels :: Pandoc -> Pandoc
promoteHeaderLevels = walk promote
  where promote :: Block -> Block
        promote (Header lev attr ils) = Header (lev + 1) attr ils
        promote x = x

walkM是的一元版本walk；例如，当您需要转换来执行 IO 操作、使用 PandocMonad 操作或更新内部状态时，可以使用它。下面是一个使用 State monad 向每个代码块添加唯一标识符的示例：

addCodeIdentifiers :: Pandoc -> Pandoc
addCodeIdentifiers doc = evalState (walkM addCodeId doc) 1
  where addCodeId :: Block -> State Int Block
        addCodeId (CodeBlock (_,classes,kvs) code) = do
          curId <- get
          put (curId + 1)
          return $ CodeBlock (show curId,classes,kvs) code
        addCodeId x = return x

query用于从 AST 收集信息。它的参数是一个查询函数，它产生某种幺半群类型的结果（例如列表）。结果连接在一起。下面是一个返回文档中链接的 URL 列表的示例：

listURLs :: Pandoc -> [Text]
listURLs = query urls
  where urls (Link _ _ (src, _)) = [src]
        urls _                   = []

创建前端
命令行程序的所有功能pandoc都已抽象到Text.Pandoc.AppconvertWithOpts模块中。因此，为 pandoc 创建 GUI 前端只需填充结构并调用此函数即可。Opts

在Web应用程序中使用pandoc的注意事项
Pandoc 的解析器可能会对某些输入表现出病态行为。System.Timeout.timeout因此，将 pandoc 的使用包装在超时函数（例如from base）中以防止 DoS 攻击始终是一个好主意。

如果 pandoc 从不受信任的用户输入生成 HTML，那么通过清理程序（例如）过滤生成的 HTML 始终是一个好主意，xss-sanitize以避免安全问题。

使用runPure 而不是runIO确保pandoc的函数不执行IO操作（例如写入文件）。如果需要使某些资源可用，则在可用状态内提供“假环境” runPure（请参阅 Text.Pandoc.ClassPureState及其相关函数）。还可以编写一个自定义实例，例如，使 wiki 资源在虚假环境中作为文件可用，同时将 pandoc 与系统的其余部分隔离。PandocMonad

gskyi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【pandoc API python】

例如，该 Walkable Inline Block 实例允许您获取一个函数Inline -> Inline并将其应用于Block. 并Walkable [Inline] Pandoc允许您采用一个函数[Inline] -> [Inline]并将其应用于Inlinea 中的每个最大 s列表Pandoc。如果您想要完整的文档，则需要指定Just template，其中是template包含模板内容（而不是路径）的Text.Pandoc.Templates。您可以使用Text，如上面的示例所示。
复制链接

扫一扫