python爬虫——黑板客老师课程学习

最新推荐文章于 2024-04-30 15:56:32 发布

dianwei0041

最新推荐文章于 2024-04-30 15:56:32 发布

阅读量87

点赞数

原文链接：http://www.cnblogs.com/shixisheng/p/5926415.html

版权

程序：

　　目标url

　　内容提取

　　表现形式

为什么：

　　大数据——数据膨胀，信息太多了，不知道哪些信息适合你，例如谷歌搜索引擎。

　　垂直行业搜索——某一个行业的搜索，与搜索引擎最大的区别：搜索引擎是告诉你哪些网页适合你，而垂直搜索引擎是告诉你哪些数据适合你。例如：去哪儿网，告诉你哪些机票适合你；链家网，告诉你哪些房子适合你。

学什么：

　　get && show 就是爬虫

　　装库

　　pip install beautifulsoup4

　　pip install requests

　　pip install selenium

　　beautifulsoup4:把html看成一个树

#!/usr/bin/env python
# coding: utf-8
#copyRight by heibanke

import urllib
from bs4 import BeautifulSoup
import re

html = urllib.urlopen('http://baike.baidu.com/view/284853.htm')
#通过urllib.urlopen来获取这个网址的内容
bs_obj = BeautifulSoup(html,"html.parser")
#通过beautifulSoup来实例化一个对象

#findAll(tag, attributes, recursive, text, limit, keywords)
#find(tag, attributes, recursive, text, keywords)
#recursive=False表示只搜索直接儿子，否则搜索整个子树，默认为True。
#findAll(“a”）
#findAll(“a”, href=“”)
#findAll(“div”, class=“”)
#findAll(“button”, id=“”)

#a_list = bs_obj.findAll("a")
a_list = bs_obj.findAll("a",href=re.compile("\.baidu\.com\w?"))#正则表达式处理
#这里的a是html中的一个标签
#<a> 标签定义超链接，用于从一张页面链接到另一张页面。
#<a> 元素最重要的属性是 href 属性，它指示链接的目标
print a_list

for aa in a_list:
    if not aa.find("img"):#图片的链接没有用
        if aa.attrs.get('href'):
            print aa.text, aa.attrs['href']

　　这不过是入门而已，我们如果想更深入的了解，还要学会beautifulsoup4这个库，可以通过帮助文档、博客啥的进行学习。

　　关卡1：循环访问url

　　http://www.heibanke.com/lesson/crawler_ex00/

　　我就奇怪了，代码是黑板课老师那边提供的，可是运行的时候就会出错，不知道为什么。

# -*- coding: utf-8 -*-
# CopyRight by heibanke

import urllib
from bs4 import BeautifulSoup
import re


url='http://www.heibanke.com/lesson/crawler_ex00/'
number=['']
loops = 0

while True:
    content = urllib.urlopen(url+number[0])

    bs_obj = BeautifulSoup(content,"html.parser")
    tag_number = bs_obj.find("h3")

    number= re.findall(r'\d+',tag_number.get_text())
    
    if not number or loops>100:
        break
    else:
        print number[0]

    loops+=1


print bs_obj.text