【C#】57. .Net中的并发集合——ConcurrentBag

最新推荐文章于 2024-08-19 20:21:23 发布

White_Hacker

最新推荐文章于 2024-08-19 20:21:23 发布

阅读量7.2k

点赞数 1

分类专栏： c# concurrentbag 文章标签：爬虫并发 concurrentbag c#

c# 同时被 2 个专栏收录

60 篇文章 7 订阅

订阅专栏

concurrentbag

1 篇文章 0 订阅

订阅专栏

这个是一个直接的爬虫可伸缩应用，虽然大致看明白了，但是真的用到实际中，如何处理超时和网络连接失败呢？

Crawling类型：用于标示需要爬取的网页的URL，以及是由哪个爬虫找到的。

class Crawling
{
public string UrlToCrawl { get; set; }
public string ProducerName { get; set; }
}

GetRandomDelay任务：模拟随机等待时间，用await调用。

static Task GetRandomDelay()
{
int delay = new Random(DateTime.Now.Millisecond).Next(150, 200);
return Task.Delay(delay);
}

CreateLinks：生成的网页地址（主键）以及该网页上的其他网址（值）

//当前页面作为主键，当前页面下的链接作为值
static Dictionary<string, string[]> _contentEmulation = new Dictionary<string, string[]>();

static void CreateLinks()
{
_contentEmulation["http://microsoft.com/"] = new [] { "http://microsoft.com/a.html", "http://microsoft.com/b.html" };
_contentEmulation["http://microsoft.com/a.html"] = new[] { "http://microsoft.com/c.html", "http://microsoft.com/d.html" };
_contentEmulation["http://microsoft.com/b.html"] = new[] { "http://microsoft.com/e.html" };

_contentEmulation["http://google.com/"] = new[] { "http://google.com/a.html", "http://google.com/b.html" };
_contentEmulation["http://google.com/a.html"] = new[] { "http://google.com/c.html", "http://google.com/d.html" };
_contentEmulation["http://google.com/b.html"] = new[] { "http://google.com/e.html", "http://google.com/f.html" };
_contentEmulation["http://google.com/c.html"] = new[] { "http://google.com/h.html", "http://google.com/i.html" };

_contentEmulation["http://facebook.com/"] = new [] { "http://facebook.com/a.html", "http://facebook.com/b.html" };
_contentEmulation["http://facebook.com/a.html"] = new[] { "http://facebook.com/c.html", "http://facebook.com/d.html" };
_contentEmulation["http://facebook.com/b.html"] = new[] { "http://facebook.com/e.html" };

_contentEmulation["http://twitter.com/"] = new[] { "http://twitter.com/a.html", "http://twitter.com/b.html" };
_contentEmulation["http://twitter.com/a.html"] = new[] { "http://twitter.com/c.html", "http://twitter.com/d.html" };
_contentEmulation["http://twitter.com/b.html"] = new[] { "http://twitter.com/e.html" };
_contentEmulation["http://twitter.com/c.html"] = new[] { "http://twitter.com/f.html", "http://twitter.com/g.html" };
_contentEmulation["http://twitter.com/d.html"] = new[] { "http://twitter.com/h.html" };
_contentEmulation["http://twitter.com/e.html"] = new[] { "http://twitter.com/i.html" };
}

GetLinksFromContent：用于模拟从网页（content）获得其上面其他网址的操作

//根据Crawling中指定的UrlToCrawl，返回需要爬取一个或者多个的Urls string
static async Task<IEnumerable<string>> GetLinksFromContent(Crawling task)
{
await GetRandomDelay();
if (_contentEmulation.ContainsKey(task.UrlToCrawl)) return _contentEmulation[task.UrlToCrawl];
return null;
}

Crawl任务：用于将ConcurrentBag中的Crawling实例取出，并且使用异步操作GetLinksFromContent，将Crawling实例中的UrlToCrawl页面下的其他url找出来，构成新的Crawling实例s，然后添加到bag中，再次循环取出Crawling实例。

static async Task Crawl(ConcurrentBag<Crawling> bag, string crawlerName)
{
Crawling task;
while (bag.TryTake(out task)) //TryTake从bag中取出一个Crawling对象，并且从bag中移除。
{
IEnumerable<string> urls = await GetLinksFromContent(task);//模拟从网页中抓取到该网页（键）中存在的其他网页（值）
if (urls != null)
{
foreach (var url in urls)
{
Crawling t = new Crawling{UrlToCrawl = url,ProducerName = crawlerName};
bag.Add(t); //将新构成的t加入到bag
}
}
Console.WriteLine("Indexing url {0} posted by {1} is completed by {2}!",task.UrlToCrawl, task.ProducerName, crawlerName);
}
}

RunProgram：构造四个爬虫任务来处理bag中的任务

static async Task RunProgram()
{
var bag = new ConcurrentBag<Crawling>();
//CreateLinks()中的部分主键，也可以理解为入口！
string[] urls = new[] {"http://microsoft.com/", "http://google.com/", "http://facebook.com/", "http://twitter.com/"};
		
var crawlers = new Task[4]; //四个爬虫任务
for (int i = 1; i <= 4; i++)
{
string crawlerName = "Crawler " + i.ToString();
bag.Add(new Crawling { UrlToCrawl = urls[i-1], ProducerName = "root"});
crawlers[i - 1] = Task.Run(() => Crawl(bag, crawlerName));  //让爬虫执行任务Crawl，从而体现ConcurrentBag的并发性！
}

await Task.WhenAll(crawlers);
}

主线程函数：

static void Main(string[] args)
{
CreateLinks();
Task t = RunProgram();
t.Wait();
Console.Read();
}

简单再说明一下，google.com是由root post的，由crawler 2从bag中取出并处理，而后crawler 2将google.com下面的两个页面地址google.com/a.html 和 google.com/b.html post出来并且放入bag中。巧合的是，这两个网址之后也是有crawler 2 index的。但是google.com/e.html是由crawler 2 post出的，但是却是由Crawler 1处理完成的。所以，并行处理是存在的，且没有冲突。