web crawling

common web craling:     scr url --- server----- url------database----server----...---read---got info --- achieve goal

spective web craling     scr url(specific)---web craling 1--- url---save---filter---database---read----got info---achieve goal

**********************************

\w:leter, num, "_"

\d: num(10)

\s: string(empty)

\W: all character but "\w"

(\D, \S) 

*****************

".":any char

"^":head match

"$":end match

"*": any time

"?"ome time or zero eg:"s" or "ss"

"+" one time or more than two times

t{7}:ttttttt

t{7,}more than 6 times

t{4,7}: 4<times<7

t|s: t or s

(): 

************************

I: A or a

M: more than one raw

L: local match

U:unical cod role

S:"." could match "\n"

************************************

eg:

pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)

<_sre.SRE_Match object; span=(0, 19), match='pppppppsssspsyyyyyy'>
<_sre.SRE_Match object; span=(11, 14), match='psy'>

***************************************************************

.match:one res;the first char have to match ,,or iy will be "none"

res8_3=re.compile(pat7_1).findall(string7)print(res8_3)

 

*************************************************************

#Author:Mini
#!/usr/bin/env python
import urllib
import re
pat="hao"
string="http://2345.hao3603.com/"
res1=re.search(pat,string)
print(res1)
pat1="\n"
string1="""you
u"""
res2=re.search(pat1,string1)
print(res2)
pat2="\w\dp\w"
string2="abd3p13spe3p3p4ap3"
res3=re.search(pat2,string2)
print(res3)
pat3="pyth[jsz]n"
string3="pathpythsnpythznpythzn"
res4=re.search(pat3,string3)
print(res4)
pat4=".pat..."
string4="tpatttttt"
res5=re.search(pat4,string4)
print(res5)
pat5="abc|aaa"
string5="abdsdfabc"
res6=re.search(pat5,string5)
print(res6)
pat6="ppppp"
string6="PPPPPPP"
res7=re.search(pat6,string6,re.I)
print(res7)
pat7="p.*y"
string7="pppppppsssspsyyyyyy"
pat7_1="p.?y"
res8=re.search(pat7,string7)
print(res8)
res8_1=re.search(pat7_1,string7)
print(res8_1)
res8_2=re.match(pat7_1,string7)
print(res8_2)
res8_3=re.compile(pat7_1).findall(string7)
print(res8_3)
pat8="[a-zA-Z]+://[^\s]*[.com|.cn]"
string8='<a href="http://2345.hao3603.com">hasghj</a>'
res9=re.compile(pat8).findall(string8)
print(res9)
from urllib.request import urlopen
string8_1=urlopen("https://www.baidu.com").read()
res10=re.compile(pat8).findall(str(string8_1))
print("you know",string8_1,"\n",res10)





 

转载于:https://www.cnblogs.com/rabbittail/p/7616101.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值