C#解析HTML

有两种方法:


C# HTML parser 和 HTML Agility Pack。

http://developer.51cto.com/art/200909/149097.htm


(1).NET HTML parser (C#) :是一个完全开放的.NET模块。整个源代码不过5K行。能在Mono下很好的运行。

源代码:

Download: HTMLparser.zip v3.1.4 (08/08/08) (625 KB)
Old versions (avoid using them as they are no longer supported):
HTMLparser.zip v2.0.0 (04/12/06) (150 KB)
HTMLparser.zip v1.0.1 (04/02/06) (80 KB)


Copy from  http://www.majestic12.co.uk/projects/html_parser.php

Free .NET HTML parser (C#) is an open source high-performance .NET C# module that was created to parse HTML for links, indexing and other purposes. Full source code (~5k lines) is available under BSD license (this means you can use it in your commercial applications). This cross-platform code is verified to run very well under Mono. The parser is 100% self-contained managed code that does not depend on any external DLLs apart from core .NET libraries. We use this parser to process well over 3 TB of HTML every day.

I created this module for use in Distributed Search Engine that required processing of terabytes of HTML on a daily basis, and naturally it had to be done very fast. Thus, the focus for this project was its high performance. I've spent countless hours making sure its fast, and you will be able to benchmark it on your own hardware, but Majestic-12's homepage snapshot (20 KB) is parsed as fast as under2 ms (v1.0) 0.47 msecs (v3.0) on an Athlon x2 3800 (2 Ghz) PC (using single core, dual channel DDR 400).

Current version is about 2 4 (!) times faster than the one released last year, it also supports non-English words support via encodings (see Main.cs for details) as well as Unicode characters set via entities, it should also be more suitable for XML parsing.

There are NUnit tests that cover approximately 71% of code, with 91% of key TagParser.cs that deals with tag parsing - you can help by adding to existing tests, best to useTestDriven.NET as they allow to easily test tests and see how much of the code is covered by those tests.

I would be very interested to know how this module compares to others, so if you made some testing then pleaseemail me the results. Also it would be nice to get a few NUnit test cases for automated testing as it is very easy to break parser in a subtle way that won't be immediately apparent.

Finally, if you manage to squeeze more speed out of it, then it would be nice for you to share the changes with me, this would help you too, because I am certainly going to try to make it faster than it is now, so if you share your changes it would mean you won't have to merge my changes into yours.


(2)HTML Agility Pack

http://msdn.microsoft.com/zh-tw/evalcenter/ee787055.aspx

获取Html Agility Pack  from codeplex:http://htmlagilitypack.codeplex.com/



评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值