前几天发现一个不错英文网站中有不少口语的资料(http://www.kekenet.com/kouyu/lening/),但是所有的内容都需要一张张页面打开才能浏览,实在太繁琐,于是乎,我就打开页面的HTML看了下,其中有这么一段
<EMBED src="http://y.kekenet.com/Sound/kouyu/kekenet6lening/101.mp3" width=300 height=56 type=audio/x-pn-realaudio-plugin controls="ControlPanel,StatusBar"> </EMBED>
很明显 http://y.kekenet.com/Sound/kouyu/kekenet6lening/101.mp3 因该就是MP3的文件地址,于是乎,我就打开FlashGet写了一个批处理的下载,把所有的连接一次性全下下来了。
这几天,突然发现没有英文字幕,看着还总不那么爽,可是所有的字幕都在页面里,让我打开再保存,我可没这么好的心情,这次手头什么利器可以完成这个问题,所以只能亲自动手了。
首先,我要创建一个HTTPWebRequst然后通过HTTPWebResponse取得到对应的返回结果。根据返回结果判断结果是否正常。
... {
try
...{
string requestURL = url;
System.Net.HttpWebRequest httpRequest = System.Net.HttpWebRequest.Create(requestURL) as System.Net.HttpWebRequest;
using (System.Net.HttpWebResponse httpResponse = httpRequest.GetResponse() as System.Net.HttpWebResponse)
...{
if (httpResponse.StatusCode == System.Net.HttpStatusCode.OK)
...{
//string encoding = httpResponse.CharacterSet;
string encoding = "GB2312";
using (System.IO.Stream receiveStream = httpResponse.GetResponseStream())
...{
Encoding encode = System.Text.Encoding.GetEncoding(encoding);
using (System.IO.StreamReader readStream = new System.IO.StreamReader(receiveStream, encode))
...{
string sourceHTML = readStream.ReadToEnd();
return sourceHTML;
}
}
}
else
...{
return "";
}
}
}
catch (System.Exception ee)
...{
return "";
}
}
然后就是将返回的HTML作过滤了,找出我需要的内容。要完成这个工作,要先写一个正则表达式,从每张页面的内容判断我需要的内容之前都有一个</script></SPAN>之后都有一个</SPAN>,所以对应的正则表达式就是 "(?<=</script></SPAN>)[/W/w]*?(?=</SPAN>)。
... {
try
...{
string sourceHTML = source;
string returnHTML = "";
string regularExpressions = @"(?<=</script></SPAN>)[Ww]*?(?=</SPAN>)";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(regularExpressions, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Match m = r.Match(sourceHTML);
while (m.Success)
...{
System.Text.RegularExpressions.CaptureCollection cc = m.Captures;
foreach (System.Text.RegularExpressions.Capture c in cc)
...{
returnHTML += string.Format(@"<P>{0}</P><BR/>", c.Value);
}
m = m.NextMatch();
}
return returnHTML;
}
catch (System.Exception ee)
...{
return "";
}
}
这样基本一个模型就出来了。然后要做的就是测试,然后再修改。
经过测试我发现,这个网站中的link不是按照内容排序的,其中有些link保存的并不是我现在所关心的,所以要将这些内容都过滤掉
... {
if (i >= 9189 && i <= 9192) return false;
if (i >= 9228 && i <= 9415) return false;
if (i >= 9597 && i <= 9618) return false;
if (i >= 9669 && i <= 9730) return false;
if (i >= 9762 && i <= 10053) return false;
if (i >= 10057 && i <= 10118) return false;
return true;
}
using System.Collections.Generic;
using System.Text;
namespace ConsoleApplication2
... {
class Program
...{
static bool ValidataLink(int i)
...{
if (i >= 9189 && i <= 9192) return false;
if (i >= 9228 && i <= 9415) return false;
if (i >= 9597 && i <= 9618) return false;
if (i >= 9669 && i <= 9730) return false;
if (i >= 9762 && i <= 10053) return false;
if (i >= 10057 && i <= 10118) return false;
return true;
}
static void Main(string[] args)
...{
string content = @"<html><head><meta http-equiv=""Content-Type"" content=""text/html; charset=UTF-8"" /><body>";
int start = 9010;
int end = 10119;
//int end = 9300;
for (int i = start; i <= end; i++)
...{
if (!ValidataLink(i)) continue;
string http = String.Format("http://www.kekenet.com/kouyu/{0}.shtml", i);
string currentContent = GetContentByHTTP(http);
content += currentContent;
Console.WriteLine(string.Format("{0}:{1}", i, currentContent));
}
content += @"</body></html>";
using (System.IO.StreamWriter sw = System.IO.File.CreateText(@"d: est.html"))
...{
sw.Write(content);
}
}
static string GetContentByHTTP(string url)
...{
string sourceHTML = GetHTMLfromHTTP(url);
string returnHTML = FiterHTML(sourceHTML);
return returnHTML;
}
static string GetHTMLfromHTTP(string url)
...{
try
...{
string requestURL = url;
System.Net.HttpWebRequest httpRequest = System.Net.HttpWebRequest.Create(requestURL) as System.Net.HttpWebRequest;
using (System.Net.HttpWebResponse httpResponse = httpRequest.GetResponse() as System.Net.HttpWebResponse)
...{
if (httpResponse.StatusCode == System.Net.HttpStatusCode.OK)
...{
//string encoding = httpResponse.CharacterSet;
string encoding = "GB2312";
using (System.IO.Stream receiveStream = httpResponse.GetResponseStream())
...{
Encoding encode = System.Text.Encoding.GetEncoding(encoding);
using (System.IO.StreamReader readStream = new System.IO.StreamReader(receiveStream, encode))
...{
string sourceHTML = readStream.ReadToEnd();
return sourceHTML;
}
}
}
else
...{
return "";
}
}
}
catch (System.Exception ee)
...{
return "";
}
}
static string FiterHTML(string source)
...{
try
...{
string sourceHTML = source;
string returnHTML = "";
string regularExpressions = @"(?<=</script></SPAN>)[Ww]*?(?=</SPAN>)";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(regularExpressions, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Match m = r.Match(sourceHTML);
while (m.Success)
...{
System.Text.RegularExpressions.CaptureCollection cc = m.Captures;
foreach (System.Text.RegularExpressions.Capture c in cc)
...{
returnHTML += string.Format(@"<P>{0}</P><BR/>", c.Value);
}
m = m.NextMatch();
}
return returnHTML;
}
catch (System.Exception ee)
...{
return "";
}
}
}
}