python爬虫介绍，HTML数据提取

最新推荐文章于 2024-06-24 06:32:00 发布

大道说说

最新推荐文章于 2024-06-24 06:32:00 发布

阅读量2k

点赞数

文章标签： python 爬虫 beautifulsoup

本文链接：https://blog.csdn.net/BestDD/article/details/48470663

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、python模拟浏览器简単爬虫

def readHeiKe(url):
    req_header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
}
    req_timeout = 5
    req = urllib2.Request(url,None,req_header)
    resp = urllib2.urlopen(req,None,req_timeout)
    return resp.read();

2、HTML数据提取

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python

使用教程参见：http://beautifulsoup.readthedocs.org/zh_CN/latest/#

代码示例：

    url="#"
    html=readHeiKe(url)
    soup=BeautifulSoup(html, "lxml")
    arr=soup.find_all("div", class_="newitem clearfix")
    for term in arr:
        url= term.a['href']+""
        title= term.a.string+""
        time=term.find("div",class_="col-sm-3 text-right").string+""

循环内可做数据库处理