Requirement definition
The application should be able to:
- Retrieve the HTML markup in text from a given URL, through HTTP request;
- Extract the URI of the image from the HTML markup;
- Save the image from its URI to local disk.
First of all, let's set up a console application in VS 2012:
Get HTML markup
Googling for a while led me to the post: Get HTML code from a website C#. There are plenty of methods to do that, in different levels. But for such a typical common task, there are always convenient shortcuts.This answer is quite eye-catching, so I went to http://fizzlerex.codeplex.com/.
I downloaded the zip file and extracted the pack into:
Add the references of the dll files:
Just add these two dlls, if you add Fizzler.dll, you will get a compilation error later. According to this site's instruction, I come to the code which reads like:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end
class Program
{
static void Main(string[] args)
{
string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(httpURL);
HtmlNode page = document.DocumentNode;
var items = page.QuerySelectorAll("img");
//...
Console.ReadLine();
}
}
It doesn't work, when the execution reaches page.QuerySelectorAll(), it gives an exception, which is documented very well here: http://stackoverflow.com/questions/29053667/fizzelerex-throws-an-exception-when-trying-to-web-scrape-a-website-in-c-sharp-20, not a mature solution apparently.
However, at least we can still make use of its another property to get the HTML markup:
class Program
{
static void Main(string[] args)
{
string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(httpURL);
HtmlNode page = document.DocumentNode;
string markup = page.InnerHtml;
Console.ReadLine();
}
}
Scrape HTML markup
To this end, we cannot count on Fizzler any more. We have to resort to old fashion: Regular Expression.
But before that, let's take a close look at the site we are going to scrape:
Notice the URL, the only part that matters is the value of query string detail, so its pattern can be referred as:
http://www.torlundvall.com/gallery.asp?start=1&detail=<number>
And its HTML is:
<html>
<head>
<title>Gallery</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>
.maintext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:000000;}
.soldtext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:#CC0000;}
A {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: Underline; color:000000;}
</style>
<script LANGUAGE="JavaScript">
var popup_url = 'photo_popup.asp?photo=';
var windowvars;
var popupwindow = null;
var popupwindow_open = false;
function popup(img,imgw,imgh) {
if (popupwindow_open) {
closePopupwindow();
}
windowvars = 'menubar=0,scrollbars=0,toolbar=0,location=0,resizeable=0,width='+imgw+',height='+imgh+',top=100';
popupwindow = window.open(popup_url+img,"PhotoPopUp",windowvars);
popupwindow_open = true;
if (window.focus) {
popupwindow.focus();
}
}
function closePopupwindow() {
if (popupwindow != null) {
if (popupwindow_open) {
popupwindow_open = false;
popupwindow.close();
}
}
}
</script>
</head>
<body bgcolor="#FFFFFF" leftmargin="0" marginwidth="0" marginheight="0">
<table id="Table_01" width="550" height="768" border="0" cellpadding="0" cellspacing="0" align="center">
<tr>
<td colspan="15">
<img src="images/gallery_01.jpg" width="550" height="16" alt=""></td>
</tr>
<tr>
<td rowspan="2" background="images/gallery_02_tall.jpg">
<img src="images/gallery_02.jpg" width="17" height="572" alt=""></td>
<td>
<a href="news.html"><img src="images/gallery_03.jpg" width="56" height="64" title="news" border="0"></a></td>
<td>
<img src="images/gallery_04.jpg" width="24" height="64" alt=""></td>
<td>
<a href="gallery.asp"><img src="images/gallery_05.jpg" width="56" height="64" title="gallery" border="0"></a></td>
<td>
<img src="images/gallery_06.jpg" width="23" height="64" alt=""></td>
<td>
<a href="discography.html"><img src="images/gallery_07.jpg" width="54" height="64" title="discography" border="0"></a></td>
<td>
<img src="images/gallery_08.jpg" width="22" height="64" alt=""></td>
<td>
<a href="ps.asp"><img src="images/gallery_09.jpg" width="56" height="64" title="personal statement" border="0"></a></td>
<td>
<img src="images/gallery_10.jpg" width="21" height="64" alt=""></td>
<td>
<a href="cv.asp"><img src="images/gallery_11.jpg" width="55" height="64" title="curriculum vitae" border="0"></a></td>
<td>
<img src="images/gallery_12.jpg" width="18" height="64" alt=""></td>
<td>
<a href="photos.html"><img src="images/gallery_13.jpg" width="56" height="64" title="photos" border="0"></a></td>
<td>
<img src="images/gallery_14.jpg" width="20" height="64" alt=""></td>
<td>
<a href="links.html"><img src="images/gallery_15.jpg" width="56" height="64" title="links" border="0"></a></td>
<td rowspan="2" background="images/gallery_16_tall.jpg">
<img src="images/gallery_16.jpg" width="16" height="572" alt=""></td>
</tr>
<tr>
<td colspan="13" background="images/gallery_bg_stretched.jpg" valign="top">
<table cellpadding="0" cellspacing="0" border="0" align="center">
<!--show detail-->
<tr>
<td height="5" width="30"></td>
<td height="5" width="457"></td>
<td height="5" width="30"></td>
</tr>
<tr>
<td colspan="3" align="center" height="470" valign="middle"><a href="gallery.asp?start=1"><img src="images/paintings/tl-waiting_1.jpg" border="0"></a></td>
</tr>
<tr>
<td align="left"><!--<a href="gallery.asp?detail=0"><font face="Arial, Verdana" size="2"><br>back</font></a>--></td>
<td align="center"><a href="gallery.asp?start=1"><font face="Arial, Verdana" size="2"><br>return to gallery</font></a></td>
<td align="right"><!--<a href="gallery.asp?detail=2"><font face="Arial, Verdana" size="2"><br>next</font></a>--></td>
</tr>
</table>
</td>
</tr>
<!--<tr>
<td background="images/gallery_02.jpg"></td>
<td colspan="13" background="images/gallery_flipped_bg.jpg" align="center">
<a href="javascript:popup('bio_gallery.jpg',533,404);" alt=""><img src="images/discography/details_button.jpg" border="0"></a></td>
<td background="images/gallery_17.jpg"></td>
</tr>-->
<tr>
<td colspan="15">
<img src="images/gallery_18.jpg" width="550" height="180" alt=""></td>
</tr>
</table>
</body>
</html>
The only information we need is the src attribute of the <img> tag most inside:
<img src="images/paintings/tl-waiting_1.jpg" border="0">
And the name of the image file could contain lower letters, digits, and hyphen and underscore. According to the tutorial: http://www.dotnetperls.com/regex-match, I come to this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end
// Regular Expression
using System.Text.RegularExpressions;
// end
class Program
{
static void Main(string[] args)
{
string imgWebDir = "http://www.torlundvall.com/images/paintings/";
string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";
string imgSrcPattern = @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";
string imgExtensionName = ".jpg";
HtmlWeb web = new HtmlWeb();
for (int i = start; i <= end; i++)
{
HtmlDocument document = web.Load(httpURL + i.ToString());
HtmlNode page = document.DocumentNode;
string markup = page.InnerHtml;
Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);
if (match.Success)
{
string fileName = match.Groups[1].Value;
string fullImgURL = imgWebDir + fileName + imgExtensionName;
//...
}
Console.ReadLine();
}
}
References:
Escape quotes in a C# Regex pattern
Save Image
How to download image from url using c#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
// Fizzler
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
// end
// Regular Expression
using System.Text.RegularExpressions;
// end
// WebClient
using System.Net;
// end
class Program
{
static void Main(string[] args)
{
string imgWebDir = "http://www.torlundvall.com/images/paintings/";
string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";
string imgSrcPattern = @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";
string imgExtensionName = ".jpg";
string imgSaveDir = @"C:\torlundvall\";
HtmlWeb web = new HtmlWeb();
for (int i = start; i <= end; i++)
{
HtmlDocument document = web.Load(httpURL + i.ToString());
HtmlNode page = document.DocumentNode;
string markup = page.InnerHtml;
Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);
if (match.Success)
{
string fileName = match.Groups[1].Value;
string fullImgURL = imgWebDir + fileName + imgExtensionName;
using (WebClient webClient = new WebClient())
{
webClient.DownloadFile(fullImgURL, imgSaveDir + fileName + imgExtensionName);
}
}
Console.ReadLine();
}
}
References:
WebClient.DownloadFile Method (Uri, String)