python 爬虫笔记

from    urllib.request  import  urlopen
from    bs4 import  BeautifulSoup
html    =   urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj   =   BeautifulSoup(html.read());
print(bsObj.h1)

发现本机电脑urllib2

BeautifulSoup4 是一个解析库

import urllib2 
response = urllib2.urlopen('http://python.org/') 
html = response.read()
bsObj   =   BeautifulSoup(html.read());
//html 输出一次需要重新获取
print(bsObj.h1)

爬取span标签中 class 为green 的内容

#! /usr/bin/env python
#coding=utf-8

import urllib2
from    bs4 import  BeautifulSoup

html=urllib2.urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html)
nameList=bsObj.findAll("span",{"class":"green"})
for name in nameList:
        print(name.get_text())
find()  and findAll()   with    BeautifulSoup
BeautifulSoup’s  find()     and  findAll()  are the two functions   you will    likely  use the most.
With    them,   you can easily  filter  HTML    pages   to  find    lists   of  desired tags,   or  a   single  tag,
based   on  their   various attributes.
The two functions   are extremely   similar,    as  evidenced   by  their   definitions in  the
BeautifulSoup   documentation:
findAll(tag,    attributes, recursive,  text,   limit,  keywords)
find(tag,   attributes, recursive,  text,   keywords)
In  all likelihood, 95% of  the time    you will    find    yourself    only    needing to  use the first   two
arguments:   tag    and  attributes .   However,    let’s   take    a   look    at  all of  the arguments   in
greater detail.
The  tag    argument    is  one that    we’ve   seen    before  —   you can pass    a   string  name    of  a   tag or
even    a   Python  list    of  string  tag names.  For example,    the following   will    return  a   list    of  all
the header  tags    in  a   document: 1
.findAll({"h1","h2","h3","h4","h5","h6"})
.findAll("span",    {"class":"green",   "class":"red"})
nameList    =   bsObj.findAll(text="the prince")
print(len(nameList))
allText =   bsObj.findAll(id="text")
print(allText[0].get_text())
bsObj.findAll(id="text")
bsObj.findAll("",   {"id":"text"})

A Caveat to the keyword Argument
The keyword argument can be very helpful in some situations. However, it is technically redundant as a
BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished
using techniques we will discuss later in this chapter (see Regular Expressions and Lambda Expressions).
For instance, the following two lines are identical:

bsObj.findAll(id="text")
bsObj.findAll("",   {"id":"text"})

In addition, you might occasionally run into problems using keyword , most notably when searching for
elements by their class attribute, because class is a protected keyword in Python. That is, class is a
reserved word in Python that cannot be used as a variable or argument name (no relation to the
BeautifulSoup.findAll() keyword argument, previously discussed).2 For example, if you try the following
call, you’ll get a syntax error due to the nonstandard use of clasbsObj.findAll(class="green")tead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:

bsObj.findAll(class_="green")

Alternatively, you can enclose class in quotes:

bsObj.findAll("",   {"class":"green"}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值