php中xml在线解析,你如何在PHP中解析和处理HTML / XML?

如何解析HTML / XML并从中提取信息?

解决方法:

原生XML扩展

我更喜欢使用native XML extensions中的一个,因为它们与PHP捆绑在一起,通常比所有第三方库更快,并且在标记上给我所需的所有控制权.

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C’s Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM能够解析和修改现实世界(破碎)的HTML,它可以执行XPath queries.它基于libxml.

使用DOM需要一些时间才能提高效率,但这个时间非常值得IMO.由于DOM是一个与语言无关的接口,因此您可以找到多种语言的实现,因此如果您需要更改编程语言,那么您很可能已经知道如何使用该语言的DOM API.

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

与DOM一样,XMLReader基于libxml.我不知道如何触发HTML解析器模块,因此使用XMLReader解析损坏的HTML的可能性可能不如使用DOM,因为您可以明确告诉它使用libxml的HTML解析器模块.

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

XML Parser库也基于libxml,并实现了SAX样式的XML推送解析器.它可能是比DOM或SimpleXML更好的内存管理选择,但是比XMLReader实现的pull解析器更难以使用.

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

当您知道HTML是有效的XHTML时,SimpleXML是一个选项.如果你需要解析破碎的HTML,甚至不要考虑SimpleXml,因为它会窒息.

第三方库(基于libxml)

如果您更喜欢使用第三方库,我建议使用实际上使用下面的DOM/libxml而不是字符串解析的库.

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML

documents using It requires 070019 for traversing the

DOM tree and extends it by adding methods for manipulating the DOM

tree of HTML documents.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple “xml to object/array” mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML is a PHP library for manipulating XML with a concise and fluent API.

It leverages XPath and the fluent programming pattern to be fun and effective.

第三方(不是基于libxml的)

构建DOM / libxml的好处是,您可以获得良好的开箱即用性能,因为您基于本机扩展.但是,并非所有第三方库都沿着这条路线行进.其中一些列在下面

An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!

Require PHP 5+.

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

我一般不推荐这个解析器.代码库很糟糕,解析器本身很慢而且内存很耗.并非所有jQuery选择器(例如child selectors)都是可能的.任何基于libxml的库都应该比这更容易.

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it’s valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

同样,我不推荐这个解析器. CPU使用率很高,速度相当慢.还没有清除已创建DOM对象的内存的功能.这些问题尤其适用于嵌套循环.文档本身不准确且拼写错误,自4月14日以来没有回复修复.

A universal tokenizer and HTML/XML/RSS DOM Parser

Ability to manipulate elements and their attributes

Supports invalid HTML and UTF8

Can perform advanced CSS3-like queries on elements (like jQuery — namespaces supported)

A HTML beautifier (like HTML Tidy)

Minify CSS and Javascript

Sort attributes, change character case, correct indentation, etc.

Extensible

Parsing documents using callbacks based on current character/token

Operations separated in smaller functions for easy overriding

Fast and Easy

从未使用过它.不知道它是否有用.

HTML 5

您可以使用上面的解析HTML5,但由于HTML5允许标记there can be quirks.因此,对于HTML5,您要考虑使用专用解析器,例如

A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

HTML5最终确定后,我们可能会看到更多专用解析器. W3的标题为How-To for html 5 parsing的博客文章值得一试.

网页服务

如果您不想编写PHP,您也可以使用Web服务.一般来说,我发现这些实用程序很少,但那只是我和我的用例.

ScraperWiki’s external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

常用表达

最后也是最不推荐的,您可以使用regular expressions从HTML中提取数据.通常不鼓励在HTML上使用正则表达式.

您可以在网上找到与标记相匹配的大多数片段都很脆弱.在大多数情况下,它们只适用于非常特殊的HTML.微小的标记更改,例如在某处添加空格,或添加或更改标记中的属性,可以使RegEx在未正确编写时失败.在HTML上使用RegEx之前,您应该知道自己在做什么.

HTML解析器已经知道HTML的语法规则.必须为您编写的每个新RegEx讲授正则表达式. RegEx在某些情况下很好,但它实际上取决于您的用例.

你can write more reliable parsers,但是用前面的库已经存在并且在这方面做得更好的时候,用正则表达式编写一个完整可靠的自定义解析器是浪费时间的.

图书

如果你想花一些钱,看看吧

我不隶属于PHP Architect或作者.

标签:php,xml,parsing,xml-parsing,html-parsing

来源: https://codeday.me/bug/20190910/1802427.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值