python爬虫学习笔记

最新推荐文章于 2023-12-03 15:39:19 发布

liying700

最新推荐文章于 2023-12-03 15:39:19 发布

阅读量324

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/liying700/article/details/73087536

版权

python 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

一、爬虫思路：
对于一般的文章而言，思路如下
1.通过主页url获取主页源码，从主页源码中获得“标题”链接（如想要抓取知乎上的新闻，就获得主页上的新闻链接），2.继续通过“标题”链接获得“标题”源码，进而获得“标题”中的内容。其中，当存在多页时，先将每一页都一样的URL写下来，然后循环加入页码，具体事例如下（fanli_infoemation.py）：

fanly_url='http://zhide.fanli.com/p' #主页URL 多页
        for i in range(start_page,end_page+1):#可自己定义起始页码和终止页码
            rt=urllib2.Request(fanly_url+str(i))#url完整化，多页

3.接着获取主页源码，这里有两种方式：
（1）urllib2.urlopen(主页url）.read()
具体事例如下(zhihu_news.py)：

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
request = urllib2.Request(url,headers=header)
response = urllib2.urlopen(request)#打开网页
text=response.read()#获取源码
return text

(2)requests.get.content
具体事例如下（thread.py）:

headers={'user-Agent':'user-agent:Mozilla/5.0 (Windows NT 10.0; WOW64)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
request=requests.get(url=url,headers=headers)#网址请求
response=request.content#获取源码
return response

4.在主页源码（html）中匹配到“标题”链接（URL），然后将标题的url作为参数继续调用get_html（获取主页源码的方法）方法。其中匹配“标题”链接有几种方式：
（1）正则表达式re：
具体事例如下（zhihu_news.py）：

pattern = re.compile('<a href="/story/(.*?)"')#编译，提高效率
    items=re.findall(pattern,html)
    #print items#打印后是列表形式
    urls=[]
    for item in items:
       urls.append('http://daily.zhihu.com/story/'+item)
        #print urls
        return urls

（2）Beutifulsoup
具体事例如下(thread.py)：

 soup=BeautifulSoup(html,'lxml')#解析网页 bs4
    all_a=soup.find_all('a',class_='list-group-item')#找到a标签
    #print all_a
    for i in all_a:
        img_html=get_html(i['href'])#获取内页链接中的源码
        #print img_html
        get_img(img_html)

(3)xpatn
具体事例如下（thread.py）：

soup=lxml.etree.HTML(html)#初始化源码
  items=soup.xpath('//div[@class="artile_des"]')#@是选取属性的意思
  for item in items:#一层层的解析网页，直到拿到图片
        imgurl_list=item.xpath('table/tbody/tr/td/a/img/@onerror')

5.在“标题”的html中匹配信息，如正文信息，标题信息
6.打印这些信息或者写入文件

liying700

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫学习笔记

一、爬虫思路：对于一般的文章而言，思路如下 1.通过主页url获取主页源码，从主页源码中获得“标题”链接（如想要抓取知乎上的新闻，就获得主页上的新闻链接），2.继续通过“标题”链接获得“标题”源码，进而获得“标题”中的内容。其中，当存在多页时，先将每一页都一样的URL写下来，然后循环加入页码，具体事例如下（fanli_infoemation.py）： fanly_url=’http://zh
复制链接

扫一扫