PDF文件可以被用来存储文件、图像和其他数据。如果你想从一个PDF文件中获取所有的图像,或者有数百个或更多的PDF文件需要处理,Leadtools提供以下三种方式。
使用LEADTOOLS提取嵌入在PDF文件中的图像很容易。下面是使用LEADTOOLS从PDF文件中提取图像的C#、Java和PowerShell代码样本。
用C#代码提取嵌入PDF中的图像
/// <summary>
/// 提取PDF文档为TIFF
/// </summary>
/// <param name="pdfPath"></param>
private static void ExtractImagesFromPdf(string pdfPath)
{
var destinationPath = Path.Combine(Path.GetDirectoryName(pdfPath), @"images\");
var documentName = Path.GetFileNameWithoutExtension(pdfPath);
using var pdfDocument = new PDFDocument(pdfPath);
pdfDocument.ParsePages(PDFParsePagesOptions.Objects, 1, -1);
foreach (var page in pdfDocument.Pages)
{
var embeddedImages = page.Objects.Where(o => o.ObjectType == PDFObjectType.Image).ToArray();
using var codecs = new RasterCodecs();
foreach (var imgObj in embeddedImages)
{
var destinationFilePath = destinationPath + documentName + "~page-" + page.PageNumber + "~" + imgObj.ImageObjectNumber + ".tif";
using var image = pdfDocument.DecodeImage(imgObj.ImageObjectNumber);
codecs.Save(image, destinationFilePath, RasterImageFormat.TifLzw, image.BitsPerPixel, 1, 1, -1, CodecsSavePageMode.Append);
}
}
}
用Java代码提取嵌入PDF中的图像
/**
* 提取PDF文件并另存为到子目录
* e.g. getFileName("c:\\temp\\") will return "c:\\temp\\images\\"
*
*
* @param pdfPath
*/
private static void extractImagesFromPdf(String pdfPath) {
final String destinationFolder = getOutputFolder(pdfPath);
final String documentName = getBaseName(getFileName(pdfPath));
final PDFDocument pdfDocument = new PDFDocument(pdfPath);
pdfDocument.parsePages(PDFParsePagesOptions.OBJECTS.getValue(), 1, -1);
final RasterCodecs codecs = new RasterCodecs();
try {
final List<PDFDocumentPage> pages = pdfDocument.getPages();
for (PDFDocumentPage page : pages) {
final int pageNumber = page.getPageNumber();
for (final PDFObject object : page.getObjects()) {
if (object.getObjectType() == PDFObjectType.IMAGE) {
final String imageObjectNumber = object.getImageObjectNumber();
final String destinationFilePath = destinationFolder + documentName + "~page-" + pageNumber + "~"
+ imageObjectNumber + ".tif";
final RasterImage image = pdfDocument.decodeImage(imageObjectNumber);
try {
codecs.save(image, destinationFilePath, RasterImageFormat.TIFLZW, image.getBitsPerPixel(),
1, 1, -1, CodecsSavePageMode.OVERWRITE);
} finally {
image.dispose();
}
}
}
}
} finally {
codecs.dispose();
}
}
用PowerShell代码提取嵌入PDF中的图像
function Export-LtImagesFromPdf {
<#
.SYNOPSIS
Exports images embedded in a PDF file
.DESCRIPTION
Exports images embedded in a PDF file
.PARAMETER PdfPath
File path to the PDF file that has embedded images to be exported
.PARAMETER Path
Folder path to export the embedded images
.EXAMPLE
Export-LtImagesFromPdf -PdfPath "c:\temp\a.pdf" -Path "c:\temp\images\"
.INPUTS
String
.OUTPUTS
void
.NOTES
Author: LEAD Technologies, Inc.
Website: https://www.leadtools.com
Twitter: @leadtools
#>
[CmdletBinding()]
param(
[Parameter(Mandatory)]
[string]$PdfPath,
[Parameter(Mandatory)]
[string]$Path
)
if( -not(Test-Path -Path $PdfPath -PathType Leaf) ) {
Write-Error "File does not exist."
return $false
}
if( -not(Test-Path -Path $Path -PathType Container) ) {
New-Item -Path $Path -ItemType Directory
}
$baseFileName = (Get-Item $PdfPath).Basename
$pdfDocument = New-Object -TypeName Leadtools.Pdf.PDFDocument -ArgumentList $PdfPath
$pdfDocument.ParsePages(1, 1, -1)
ForEach ($page in $pdfDocument.Pages){
ForEach($object in $page.Objects){
if( $object.ObjectType -eq [Leadtools.Pdf.PDFObjectType]::Image ){
$imageObjectNumber = $object.ImageObjectNumber
$pageNumber = $page.PageNumber
$image = $pdfDocument.DecodeImage($imageObjectNumber)
$outputFilePath = (Join-Path -Path $Path -ChildPath ($baseFileName + "~page#-" + $pageNumber + "~" + $imageObjectNumber + ".tif"))
Export-LTImage -RasterImage $image -Path $outputFilePath -Format ([Leadtools.RasterImageFormat]::Tif)
}
}
}
}
有了LEADTOOLS的工具包,就没有什么是你不能做的PDF文件了。点击这里可以下载Leadtools全套SDK。