MVC如何将PDF转换成HTML,如何使用iTextSharp将HTML转换为PDF

最新推荐文章于 2023-07-05 13:26:14 发布

weixin_39638708

最新推荐文章于 2023-07-05 13:26:14 发布

阅读量243

点赞数

文章标签： MVC如何将PDF转换成HTML

首先，HTML和PDF并不相关，尽管它们是在同一时间创建的。HTML旨在传递更高级别的信息，如段落和表。虽然有控制它的方法，但最终要由浏览器来绘制这些更高层次的概念。PDF意在传达文件和文件必不管它们呈现在哪里，“看”都是一样的。

在HTML文档中，你可能有一个100%宽的段落，取决于显示器的宽度，它可能需要2行或10行，当你打印的时候可能是7行，当你在手机上看它时，它可能需要20行。然而，PDF文件，一定是与渲染设备无关，所以不管屏幕大小如何必须永远渲染完全一样。

因为必修上面，PDF不支持抽象的东西，如“表格”或“段落”。PDF支持的基本内容有三：文本、线条/形状和图像。(还有其他事情，比如注释和电影，但我在这里尽量保持简单。)在PDF中，你不会说“这是一个段落，浏览器做你的事情！”相反，您可以这样说：“使用这个精确的字体在这个精确的X，Y位置画这个文本，别担心，我之前已经计算过文本的宽度，所以我知道它都适合这一行。”你也不是说“这是一个表”，而是说“在这个准确的位置画这个文本，然后在我之前计算过的另一个准确的位置画一个矩形，这样我知道它看起来就在文本周围”。

第二，iText和iTextSharp解析HTML和CSS。就这样。ASP.NET、MVC、Razor、Struts、Spring等都是HTML框架，但iText/iTextSharp 100%不知道它们。与DataGridViews、中继器、模板、视图等一样，这些都是特定于框架的抽象。它是你的从您选择的框架中获取HTML的责任，iText不会对您有所帮助。如果你得到一个例外说The document has no pages或者你认为“iText没有解析我的HTML”-几乎可以肯定的是不要真的有HTML你只觉得你知道。

第三，已经存在多年的内置类是HTMLWorker然而，这已经被替换为XMLWorker (爪哇 / .net)。没有做任何工作HTMLWorker，它不支持CSS文件，只对最基本的CSS属性提供有限的支持，实际上某些标签上的断裂..如果您没有看到此文件中的HTML属性或CSS属性值那么它可能不受HTMLWorker. XMLWorker有时可能会更复杂，但这些并发症也搞定更多可扩展.

下面是C#代码，它展示了如何将HTML标记解析为iText抽象，这些抽象被自动添加到您正在处理的文档中。C#和Java非常相似，因此它应该比较容易转换。示例1使用内置HTMLWorker若要解析HTML字符串，请执行以下操作。由于只支持内联样式，所以class="headline"会被忽略，但其他一切都应该正常工作。示例2与第一个示例相同，但它使用XMLWorker相反。示例3还分析了简单的CSS示例。//Create a byte array that will eventually hold our final PDFByte[] bytes;//Boilerplate iTextSharp setup here

//Create a stream that we can write to, in this case a MemoryStreamusing (var ms = new MemoryStream()) {

//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF

using (var doc = new Document()) {

//Create a writer that's bound to our PDF abstraction and our stream

using (var writer = PdfWriter.GetInstance(doc, ms)) {

//Open the document for writing

doc.Open();

//Our sample HTML and CSS

var example_html = @"

This is some

sample text!!!

var example_css = @".headline{font-size:200%}";

/**************************************************

* Example #1 *

* *

* Use the built-in HTMLWorker to parse the HTML. *

* Only inline CSS is supported. *

* ************************************************/

//Create a new HTMLWorker bound to our document

using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)

using (var sr = new StringReader(example_html)) {

//Parse the HTML

htmlWorker.Parse(sr);

}

/**************************************************

* Example #2 *

* *

* Use the XMLWorker to parse the HTML. *

* Only inline CSS and absolutely linked *

* CSS is supported *

* ************************************************/

//XMLWorker also reads from a TextReader and not directly from a string

using (var srHtml = new StringReader(example_html)) {

//Parse the HTML

iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);

}

/**************************************************

* Example #3 *

* *

* Use the XMLWorker to parse HTML and CSS *

* ************************************************/

//In order to read CSS as a string we need to switch to a different constructor

//that takes Streams instead of TextReaders.

//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams

using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {

using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

//Parse the HTML

iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);

}

doc.Close();

}

//After all of the PDF "stuff" above is done and closed but **before** we

//close the MemoryStream, grab all of the active bytes from the stream

bytes = ms.ToArray();}//Now we just need to do something with those bytes.

//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.

//You could also write the bytes to a database in a varbinary() column (but please don't) or you

//could pass them to another function for further PDF processing.

var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

System.IO.File.WriteAllBytes(testFile, bytes);

对于HTML到PDF的需求来说，有个好消息。如这个答案表明, W3C标准CSS-中断-3会解决这个问题..这是一个候选的建议，并计划在今年成为最终的推荐，经过测试。

正如不那么标准的，有一些解决方案，有C#的插件，如下所示打印-css.rocks.

weixin_39638708

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MVC如何将PDF转换成HTML,如何使用iTextSharp将HTML转换为PDF

首先，HTML和PDF并不相关，尽管它们是在同一时间创建的。HTML旨在传递更高级别的信息，如段落和表。虽然有控制它的方法，但最终要由浏览器来绘制这些更高层次的概念。PDF意在传达文件和文件必不管它们呈现在哪里，“看”都是一样的。在HTML文档中，你可能有一个100%宽的段落，取决于显示器的宽度，它可能需要2行或10行，当你打印的时候可能是7行，当你在手机上看它时，它可能需要20行。然而，PDF文...
复制链接

扫一扫