最近需要爬取一些网页数据做分析,回顾了一下表达式。
整理出下面顺口溜,基本爬取网页内容够用
- 牢记? + *,匹配0个1个用?号,0个多个用*号,至少1个用+号
- 以^开头,以$结尾
- 匹配字母用A-Z,大写用A-Z,小写用a-z
- 遇到字符用\S 、\W
- 遇到数字用\d 、0-9
- 遇到换行用\s
- 多次匹配用大括号{},{n,m}至少出现n次,最多m次
- 多种选择用|,转义前面加上\
- 匹配html标签结束请用它:前内容:[^>]+ 前后内容: [^>]*?、[^>]*
- 匹配不到莫慌张,后面带上*、+、?
- .是万能符号,任何字符都能用
- 万能匹配试试它:(.*?)、[\s\S]+、[\s\S]*?、[\w\W]+
根据上面规则举个栗子:
比如一段html代码如下:
<table>
<tr>
<td rowspan="2" align="center"><span style="font-size:18px; color:#fb4202; font-weight:bold;">103000312</span> <br />
<br />
2014-12-30 <br />
(星期二)</td>
<td><ul class="history_ball">
<li class="ball_blue">
<p>大小順序:</p>
<a href='/lotto_ballview_daily539_04.html' class='history_ball_link'>04</a> <a href='/lotto_ballview_daily539_05.html' class='history_ball_link'>05</a> <a href='/lotto_ballview_daily539_21.html' class='history_ball_link'>21</a> <a href='/lotto_ballview_daily539_24.html' class='history_ball_link'>24</a> <a href='/lotto_ballview_daily539_26.html' class='history_ball_link'>26</a> </li>
<li>
<p>落球順序:</p>
<a href='/lotto_ballview_daily539_26.html' class='history_ball_link'>26</a> <a href='/lotto_ballview_daily539_21.html' class='history_ball_link'>21</a> <a href='/lotto_ballview_daily539_04.html' class='history_ball_link'>04</a> <a href='/lotto_ballview_daily539_24.html' class='history_ball_link'>24</a> <a href='/lotto_ballview_daily539_05.html' class='history_ball_link'>05</a> </li>
</ul></td>
</tr>
</table>
<div>
<div class="ball_yellow">04</div><div class="ball_red">10</div><div class="ball_blue">14</div><div class="ball_yellow">19</div><div class="ball_yellow">20</div>
</div>
1)取大小顺序、落球顺序这个td内容:<td>[\s\S]+?</td>
2) 取到第一个td内容(结束标签前有内容:[^>]+),则:<td[^>]+[\s\S]+?</td>
3) 比如我要取<span style="font-size:18px; color:#fb4202; font-weight:bold;">103000312</span>/g 这段标签的值, 且有不同style样式的。则采用:<span style="(.*)">(\d+)</span>
4) 比如要取第一个td的日期与星期; 则用(.*)匹配: (.*) <br />\s+(.*)</td>
5) 要取 div 带ball_red、ball_blue的标签内容: <div class="(ball_blue|ball_red)">\d+</div>,class="(.*) 取到所有
套用上面顺口溜,很简单.....
接下來通過C#正則表達式獲取開獎結果舉例:
網頁內容:
<table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>大樂透開獎號碼</b></td>
</tr>
<tr>
<td colspan="2"><center>
<font color="#164F7C">第111000077期</font>
</center></td>
</tr>
<tr>
<td colspan="2"><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-23(二)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan="2"><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td colspan="2"><div class="no">47</div>
<div class="no">25</div>
<div class="no">24</div>
<div class="no">06</div>
<div class="no">49</div>
<div class="no">23</div></td>
</tr>
<tr>
<td width="30%"><font color="5B9401">特別號:</font></td>
<td width="70%"><div class="style1"><font color="#ca3000">05</font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_06.gif" width="228" height="19" /></td>
</tr>
<tr>
<td><a href="/biglotto"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_more.gif" width="228" height="32" border="0" /></a></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_07.gif" width="228" height="22" /></td>
</tr>
</table></td>
<td width="255"><table width="228" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="228"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_top7.gif" width="228" height="68" /></td>
</tr>
<tr>
<td height="141" background="//sg.cdn.lotto.auzonet.com/images/homepage/table_05.gif"><table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>威力彩開獎號碼</b></td>
</tr>
<tr>
<td colspan="2"><center>
<font color="#164F7C">第111000067期</font>
</center></td>
</tr>
<tr>
<td colspan="2"><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-22(一)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan="2"><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td colspan="2"><div class="no">19</div>
<div class="no">30</div>
<div class="no">02</div>
<div class="no">22</div>
<div class="no">09</div>
<div class="no">04</div></td>
</tr>
<tr>
<td width="30%"><font color="5B9401">二區號:</font></td>
<td width="70%"><div class="style1"><font color="#CA3000">
02</font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_06.gif" width="228" height="19" /></td>
</tr>
<tr>
<td><a href="/power"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_more.gif" width="228" height="32" border="0" /></a></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_07.gif" width="228" height="22" /></td>
</tr>
</table></td>
<td width="255">
<table width="228" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="228"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_top3.gif" width="228" height="68" /></td>
</tr>
<tr>
<td height="141" background="//sg.cdn.lotto.auzonet.com/images/homepage/table_05.gif"><table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>今彩539開獎號碼</b></td>
</tr>
<tr>
<td width="100%"><center>
<font color="#164F7C">第111000202期</font>
</center></td>
</tr>
<tr>
<td><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-24(三)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td><div class="no">26</div>
<div class="no">07</div>
<div class="no">33</div>
<div class="no">04</div>
<div class="no">34</div></td>
</tr>
<tr>
<td> </td>
</tr>
</table>
C# 代碼:
List<string> award = new List<string>();
//截取到需要的內容段
var reg = Regex.Match(content, @"<td colspan=""2"" align=""center""><b>今彩539開獎號碼</b></td>[\S\W]*?table_06.gif", RegexOptions.IgnoreCase);
if (reg.Success)
{
//正則表達式提取需要的內容
string bodyContent = reg.Value;
reg = Regex.Match(bodyContent, @"<font color=""#164F7C"">第(?'period'\d+)期</font>", RegexOptions.IgnoreCase);
if (reg.Success)
{
periodNumber = reg.Groups["period"].Value;
periodNumber = $"{DateTime.Now.Year}{periodNumber.Substring(periodNumber.Length - 3)}";
}
var regs = Regex.Matches(bodyContent, @"<div class=""no"">(?'number'\d+)</div>");
if (regs.Count > 0)
{
foreach (Match item in regs)
{
award.Add(item.Groups["number"].Value);
}
}
}