Automatically Grab Images From a Website With C#

Requirement definition


The application should be able to:

  1. Retrieve the HTML markup in text from a given URL, through HTTP request;
  2. Extract the URI of the image from the HTML markup;
  3. Save the image from its URI to local disk.


First of all, let's set up a console application in VS 2012:



Get HTML markup


Googling for a while led me to the post: Get HTML code from a website C#. There are plenty of methods to do that, in different levels. But for such a typical common task, there are always convenient shortcuts.This answer is quite eye-catching, so I went to http://fizzlerex.codeplex.com/.


I downloaded the zip file and extracted the pack into:



Add the references of the dll files:



Just add these two dlls, if you add Fizzler.dll, you will get a compilation error later. According to this site's instruction, I come to the code which reads like:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end

class Program
{
	static void Main(string[] args)
	{
		string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";
		HtmlWeb web = new HtmlWeb();
		HtmlDocument document = web.Load(httpURL);
		HtmlNode page = document.DocumentNode;
		
		var items = page.QuerySelectorAll("img");
		//...
		
		Console.ReadLine();
	}
}


It doesn't work, when the execution reaches page.QuerySelectorAll(), it gives an exception, which is documented very well here: http://stackoverflow.com/questions/29053667/fizzelerex-throws-an-exception-when-trying-to-web-scrape-a-website-in-c-sharp-20, not a mature solution apparently.


However, at least we can still make use of its another property to get the HTML markup:


class Program
{
	static void Main(string[] args)
	{
		string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";
		HtmlWeb web = new HtmlWeb();
		HtmlDocument document = web.Load(httpURL);
		HtmlNode page = document.DocumentNode;
		
		string markup = page.InnerHtml;
		
		Console.ReadLine();
	}
}


Scrape HTML markup


To this end, we cannot count on Fizzler any more. We have to resort to old fashion: Regular Expression.

But before that, let's take a close look at the site we are going to scrape:



Notice the URL, the only part that matters is the value of query string detail, so its pattern can be referred as:

http://www.torlundvall.com/gallery.asp?start=1&detail=<number>


And its HTML is:

<html>
<head>
<title>Gallery</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>
	.maintext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:000000;}
	.soldtext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:#CC0000;}
	A {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: Underline; color:000000;}
</style>
<script LANGUAGE="JavaScript">
var popup_url = 'photo_popup.asp?photo=';
var windowvars;
var popupwindow = null;
var popupwindow_open = false;

function popup(img,imgw,imgh) {
	if (popupwindow_open) {
        closePopupwindow();
    }

	windowvars = 'menubar=0,scrollbars=0,toolbar=0,location=0,resizeable=0,width='+imgw+',height='+imgh+',top=100';
    popupwindow = window.open(popup_url+img,"PhotoPopUp",windowvars);
	popupwindow_open = true;
	if (window.focus) {
		popupwindow.focus();
	}
}

function closePopupwindow() {
    if (popupwindow != null) {
        if (popupwindow_open) {
            popupwindow_open = false;
            popupwindow.close();
        }
    }
}
</script>
</head>
<body bgcolor="#FFFFFF" leftmargin="0" marginwidth="0" marginheight="0">
<table id="Table_01" width="550" height="768" border="0" cellpadding="0" cellspacing="0" align="center">
	<tr>
		<td colspan="15">
			<img src="images/gallery_01.jpg" width="550" height="16" alt=""></td>
	</tr>
	<tr>
		<td rowspan="2" background="images/gallery_02_tall.jpg">
			<img src="images/gallery_02.jpg" width="17" height="572" alt=""></td>
		<td>
			<a href="news.html"><img src="images/gallery_03.jpg" width="56" height="64" title="news" border="0"></a></td>
		<td>
			<img src="images/gallery_04.jpg" width="24" height="64" alt=""></td>
		<td>
			<a href="gallery.asp"><img src="images/gallery_05.jpg" width="56" height="64" title="gallery" border="0"></a></td>
		<td>
			<img src="images/gallery_06.jpg" width="23" height="64" alt=""></td>
		<td>
			<a href="discography.html"><img src="images/gallery_07.jpg" width="54" height="64" title="discography" border="0"></a></td>
		<td>
			<img src="images/gallery_08.jpg" width="22" height="64" alt=""></td>
		<td>
			<a href="ps.asp"><img src="images/gallery_09.jpg" width="56" height="64" title="personal statement" border="0"></a></td>
		<td>
			<img src="images/gallery_10.jpg" width="21" height="64" alt=""></td>
		<td>
			<a href="cv.asp"><img src="images/gallery_11.jpg" width="55" height="64" title="curriculum vitae" border="0"></a></td>
		<td>
			<img src="images/gallery_12.jpg" width="18" height="64" alt=""></td>
		<td>
			<a href="photos.html"><img src="images/gallery_13.jpg" width="56" height="64" title="photos" border="0"></a></td>
		<td>
			<img src="images/gallery_14.jpg" width="20" height="64" alt=""></td>
		<td>
			<a href="links.html"><img src="images/gallery_15.jpg" width="56" height="64" title="links" border="0"></a></td>
		<td rowspan="2" background="images/gallery_16_tall.jpg">
			<img src="images/gallery_16.jpg" width="16" height="572" alt=""></td>
	</tr>
	<tr>
		<td colspan="13" background="images/gallery_bg_stretched.jpg" valign="top">
		<table cellpadding="0" cellspacing="0" border="0" align="center">
		
		<!--show detail-->
			<tr>
				<td height="5" width="30"></td>
				<td height="5" width="457"></td>
				<td height="5" width="30"></td>
			</tr>
			<tr>
				<td colspan="3" align="center" height="470" valign="middle"><a href="gallery.asp?start=1"><img src="images/paintings/tl-waiting_1.jpg" border="0"></a></td>
			</tr>
			<tr>
				<td align="left"><!--<a href="gallery.asp?detail=0"><font face="Arial, Verdana" size="2"><br>back</font></a>--></td>
				<td align="center"><a href="gallery.asp?start=1"><font face="Arial, Verdana" size="2"><br>return to gallery</font></a></td>
				<td align="right"><!--<a href="gallery.asp?detail=2"><font face="Arial, Verdana" size="2"><br>next</font></a>--></td>
			</tr>
			
		</table>
        
        </td>
	</tr>
	<!--<tr>
    	<td background="images/gallery_02.jpg"></td>
		<td colspan="13" background="images/gallery_flipped_bg.jpg" align="center">
        	<a href="javascript:popup('bio_gallery.jpg',533,404);" alt=""><img src="images/discography/details_button.jpg" border="0"></a></td>
        <td background="images/gallery_17.jpg"></td>
	</tr>-->
    <tr>
    	<td colspan="15">
        	<img src="images/gallery_18.jpg" width="550" height="180" alt=""></td>
    </tr>
</table>
</body>
</html>

The only information we need is the src attribute of the <img> tag most inside:

<img src="images/paintings/tl-waiting_1.jpg" border="0">


And the name of the image file could contain lower letters, digits, and hyphen and underscore. According to the tutorial: http://www.dotnetperls.com/regex-match, I come to this:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end

// Regular Expression
using System.Text.RegularExpressions;
// end

class Program
{
	static void Main(string[] args)
	{
		string imgWebDir = "http://www.torlundvall.com/images/paintings/";
		string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";
		string imgSrcPattern =  @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";
		string imgExtensionName = ".jpg";
		
		HtmlWeb web = new HtmlWeb();
		for (int i = start; i <= end; i++)
		{
			HtmlDocument document = web.Load(httpURL + i.ToString());
			HtmlNode page = document.DocumentNode;

			string markup = page.InnerHtml;

			Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);

			if (match.Success)
			{
				string fileName = match.Groups[1].Value;
				string fullImgURL = imgWebDir + fileName + imgExtensionName;
				//...
			}
		
		Console.ReadLine();
	}
}


References:

Regex.Match Method (String)

Escape quotes in a C# Regex pattern


Save Image

How to download image from url using c#


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end

// Regular Expression
using System.Text.RegularExpressions;
// end

// WebClient
using System.Net;
// end

class Program
{
	static void Main(string[] args)
	{
		string imgWebDir = "http://www.torlundvall.com/images/paintings/";
		string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";
		string imgSrcPattern =  @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";
		string imgExtensionName = ".jpg";
		string imgSaveDir = @"C:\torlundvall\";
		
		HtmlWeb web = new HtmlWeb();
		for (int i = start; i <= end; i++)
		{
			HtmlDocument document = web.Load(httpURL + i.ToString());
			HtmlNode page = document.DocumentNode;

			string markup = page.InnerHtml;

			Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);

			if (match.Success)
			{
				string fileName = match.Groups[1].Value;
				string fullImgURL = imgWebDir + fileName + imgExtensionName;
				using (WebClient webClient = new WebClient())
				{
					webClient.DownloadFile(fullImgURL, imgSaveDir + fileName + imgExtensionName);
				}
			}
		
		Console.ReadLine();
	}
}



References:

WebClient.DownloadFile Method (Uri, String)














  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值