那个万能的正则表达式

最近需要爬取一些网页数据做分析,回顾了一下表达式。
整理出下面顺口溜,基本爬取网页内容够用

  1. 牢记? + *,匹配0个1个用?号,0个多个用*号,至少1个用+号
  2. 以^开头,以$结尾
  3. 匹配字母用A-Z,大写用A-Z,小写用a-z
  4. 遇到字符用\S 、\W
  5. 遇到数字用\d 、0-9
  6. 遇到换行用\s
  7. 多次匹配用大括号{},{n,m}至少出现n次,最多m次
  8. 多种选择用|,转义前面加上\
  9. 匹配html标签结束请用它:前内容:[^>]+  前后内容: [^>]*?、[^>]*
  10. 匹配不到莫慌张,后面带上*、+、?
  11. .是万能符号,任何字符都能用
  12. 万能匹配试试它:(.*?)、[\s\S]+、[\s\S]*?、[\w\W]+

根据上面规则举个栗子:

比如一段html代码如下:

<table>
<tr>
<td rowspan="2" align="center"><span style="font-size:18px; color:#fb4202; font-weight:bold;">103000312</span> <br />
<br />
2014-12-30 <br />
(星期二)</td>
<td><ul class="history_ball">
<li class="ball_blue">
<p>大小順序:</p>
<a href='/lotto_ballview_daily539_04.html' class='history_ball_link'>04</a> <a href='/lotto_ballview_daily539_05.html' class='history_ball_link'>05</a> <a href='/lotto_ballview_daily539_21.html' class='history_ball_link'>21</a> <a href='/lotto_ballview_daily539_24.html' class='history_ball_link'>24</a> <a href='/lotto_ballview_daily539_26.html' class='history_ball_link'>26</a> </li>
<li>
<p>落球順序:</p>
<a href='/lotto_ballview_daily539_26.html' class='history_ball_link'>26</a> <a href='/lotto_ballview_daily539_21.html' class='history_ball_link'>21</a> <a href='/lotto_ballview_daily539_04.html' class='history_ball_link'>04</a> <a href='/lotto_ballview_daily539_24.html' class='history_ball_link'>24</a> <a href='/lotto_ballview_daily539_05.html' class='history_ball_link'>05</a> </li>
</ul></td>
</tr>
</table>
<div>
 <div class="ball_yellow">04</div><div class="ball_red">10</div><div class="ball_blue">14</div><div class="ball_yellow">19</div><div class="ball_yellow">20</div>
</div>

1)取大小顺序、落球顺序这个td内容:<td>[\s\S]+?</td>   

2)    取到第一个td内容(结束标签前有内容:[^>]+),则:<td[^>]+[\s\S]+?</td>

3) 比如我要取<span style="font-size:18px; color:#fb4202; font-weight:bold;">103000312</span>/g 这段标签的值, 且有不同style样式的。则采用:<span style="(.*)">(\d+)</span>

4) 比如要取第一个td的日期与星期;  则用(.*)匹配: (.*) <br />\s+(.*)</td>

5)   要取 div 带ball_red、ball_blue的标签内容: <div class="(ball_blue|ball_red)">\d+</div>,class="(.*) 取到所有

套用上面顺口溜,很简单.....

接下來通過C#正則表達式獲取開獎結果舉例:

網頁內容:

<table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>大樂透開獎號碼</b></td>
</tr>
<tr>
<td colspan="2"><center>
<font color="#164F7C">第111000077期</font>
</center></td>
</tr>
<tr>
<td colspan="2"><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-23(二)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan="2"><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td colspan="2"><div class="no">47</div>
<div class="no">25</div>
<div class="no">24</div>
<div class="no">06</div>
<div class="no">49</div>
<div class="no">23</div></td>
</tr>
<tr>
<td width="30%"><font color="5B9401">特別號:</font></td>
<td width="70%"><div class="style1"><font color="#ca3000">05</font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_06.gif" width="228" height="19" /></td>
</tr>
<tr>
<td><a href="/biglotto"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_more.gif" width="228" height="32" border="0" /></a></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_07.gif" width="228" height="22" /></td>
</tr>
</table></td>
<td width="255"><table width="228" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="228"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_top7.gif" width="228" height="68" /></td>
</tr>
<tr>
<td height="141" background="//sg.cdn.lotto.auzonet.com/images/homepage/table_05.gif"><table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>威力彩開獎號碼</b></td>
</tr>
<tr>
<td colspan="2"><center>
<font color="#164F7C">第111000067期</font>
</center></td>
</tr>
<tr>
<td colspan="2"><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-22(一)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan="2"><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td colspan="2"><div class="no">19</div>
<div class="no">30</div>
<div class="no">02</div>
<div class="no">22</div>
<div class="no">09</div>
<div class="no">04</div></td>
</tr>
<tr>
<td width="30%"><font color="5B9401">二區號:</font></td>
<td width="70%"><div class="style1"><font color="#CA3000">
02</font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_06.gif" width="228" height="19" /></td>
</tr>
<tr>
<td><a href="/power"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_more.gif" width="228" height="32" border="0" /></a></td>
</tr>
<tr>
<td><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_07.gif" width="228" height="22" /></td>
</tr>
 </table></td>
<td width="255">

<table width="228" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="228"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/table_top3.gif" width="228" height="68" /></td>
</tr>
<tr>
<td height="141" background="//sg.cdn.lotto.auzonet.com/images/homepage/table_05.gif"><table width="79%" height="138" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td colspan="2" align="center"><b>今彩539開獎號碼</b></td>
</tr>
<tr>
<td width="100%"><center>
<font color="#164F7C">第111000202期</font>
</center></td>
</tr>
<tr>
<td><table width="121" border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td width="10"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data1.gif" width="10" height="22" /></td>
<td width="100" bgcolor="#5480A5"><center>
<font color="#FFFFFF"><b>2022-08-24(三)</b></font>
</center></td>
<td width="11"><img src="//sg.cdn.lotto.auzonet.com/images/homepage/data2.gif" width="10" height="22" /></td>
</tr>
</table></td>
</tr>
<tr>
<td><font color="5B9401">開出號碼:</font></td>
</tr>
<tr>
<td><div class="no">26</div>
<div class="no">07</div>
<div class="no">33</div>
<div class="no">04</div>
<div class="no">34</div></td>
</tr>
<tr>
<td> </td>
</tr>
</table>

C# 代碼:

              List<string> award = new List<string>();

               //截取到需要的內容段

                var reg = Regex.Match(content, @"<td colspan=""2"" align=""center""><b>今彩539開獎號碼</b></td>[\S\W]*?table_06.gif", RegexOptions.IgnoreCase);
                if (reg.Success)
                {

                  //正則表達式提取需要的內容
                    string bodyContent = reg.Value;

                    reg = Regex.Match(bodyContent, @"<font color=""#164F7C"">第(?'period'\d+)期</font>", RegexOptions.IgnoreCase);
                    if (reg.Success)
                    {
                        periodNumber = reg.Groups["period"].Value;
                        periodNumber = $"{DateTime.Now.Year}{periodNumber.Substring(periodNumber.Length - 3)}";
                    }
                    var regs = Regex.Matches(bodyContent, @"<div class=""no"">(?'number'\d+)</div>");

                    if (regs.Count > 0)
                    {
                        foreach (Match item in regs)
                        {
                            award.Add(item.Groups["number"].Value);

                        }
                    }
                }

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

letisgoto

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值