C# 查找Html

Scraping HTMLextracts important page elements.It has many legal uses for webmasters and ASP.NET developers.With the Regex type and WebClient,we implement screen scraping for HTML.

Example

First, for my demonstration I will scrape HTML links from Wikipedia.org.This is permitted by Wikipedia's GPL license, and this demonstration is fair use.Here we see code that downloads the English Wikipedia page.

Note:What it does is opena connection to Wikipedia.org and download the content at the specified URL. Part2 uses my special code to loop over each link and its text.

Program that scrapes HTML: C#

using System.Diagnostics;
using System.Net;

class Program
{
    static void Main()
    {
	// Scrape links from wikipedia.org

	// 1.
	// URL: http://en.wikipedia.org/wiki/Main_Page
	WebClient w = new WebClient();
	string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");

	// 2.
	foreach (LinkItem i in LinkFinder.Find(s))
	{
	    Debug.WriteLine(i);
	}
    }
}

Regular expressions

Regex type

Here I show a simple class that receives the HTML string and then extractsall the links and their text into structs. It is fairly fast, but I offer some optimizationtips further down. It would be better to use a class here and offer methods thatact on its contents.

Program that scrapes with Regex: C#

using System.Collections.Generic;
using System.Text.RegularExpressions;

public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
	return Href + "\n\t" + Text;
    }
}

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
	List<LinkItem> list = new List<LinkItem>();

	// 1.
	// Find all matches in file.
	MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
	    RegexOptions.Singleline);

	// 2.
	// Loop over each match.
	foreach (Match m in m1)
	{
	    string value = m.Groups[1].Value;
	    LinkItem i = new LinkItem();

	    // 3.
	    // Get href attribute.
	    Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
		RegexOptions.Singleline);
	    if (m2.Success)
	    {
		i.Href = m2.Groups[1].Value;
	    }

	    // 4.
	    // Remove inner tags from text.
	    string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
		RegexOptions.Singleline);
	    i.Text = t;

	    list.Add(i);
	}
	return list;
    }
}
Steps

This examplefirst finds all hyperlink tags. We storeall the complete A tags into a MatchCollection. These are objects that store thecomplete HTML strings.

In step 2it loops over all hyperlink tag strings. In the algorithm,the next part examines all the text of the A tags. This is necessary for readingthe parts of the A tags. For each A tag, it reads in the HREF attribute. This attributepoints to other web resources. This part is not failsafe, but almost always works.

Finally,the method returns the List of LinkItem objects ithas built up. This list can then be used in the foreach loop from the first C# example.The ToString method override above simply provides a standard way of printing thelinks.

Tests

Note

My first two attempts at this code were incorrect and had unacceptable bugs, butthe version shown here seems to work well. You need to use RegexOptions.SingleLine.In .NET, the dot in a Regex matches all characters except a newline unless thisis specified. To match multiline links, we require RegexOptions.Singleline.

RegexOptions: MSDN

Run the programon your website and it will print out the matches to theconsole. Here we see part of the current results for the Wikipedia home page. Theoriginal HTML shows where the links were extracted. They are contained in a LI tag.

Note:You will see my program successfully extracted the anchor text and also the HREF value.

Output

#column-one
    navigation
#searchInput
    search
/wiki/Wikipedia
    Wikipedia
/wiki/Free_content
    free
/wiki/Encyclopedia
    encyclopedia
/wiki/Wikipedia:Introduction
    anyone can edit
/wiki/Special:Statistics
    2,617,101
/wiki/English_language
    English
/wiki/Portal:Arts
    Arts
/wiki/Portal:Biography
    Biography
/wiki/Portal:Geography
    Geography
/wiki/Portal:History
    History
/wiki/Portal:Mathematics
    Mathematics
/wiki/Portal:Science
    Science
/wiki/Portal:Society
    Society
/wiki/Portal:Technology_and_applied_sciences
    Technology

Original website HTML

<ul>
<li><a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a></li>
<li><a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a></li>
<li><a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a></li>

</ul>

SingleLine

Programming tip

Many C# developers make the mistake of not specifying that the Regexes work on multiplelines, treating newlines as regular characters. MSDN states that SingleLine "Specifiessingle-line mode. Changes the meaning of the dot (.) so it matches every character(instead of every character except \n)."

Performance

Performance optimization

You can improve performance of the regular expressions by specifying RegexOptions.Compiled,and also use instance Regex objects, not the static methods I show. Normally, yourInternet connection will be the bottleneck.

Regex.Match Examples

Summary

The C# programming language

We scraped HTML content from the Internet.The code is more flexible than some other approaches.Using three regular expressions,you can extract HTML links into objects with a fair degree of accuracy.

Note:I have tested this code on several sites where it is legal.It is a valuable tool for webmasters.

原文



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值