python 网络爬虫入门 1

最新推荐文章于 2023-08-22 11:27:18 发布

Katherine_请叫我小之之

最新推荐文章于 2023-08-22 11:27:18 发布

阅读量852

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/ssandre/article/details/51013695

版权

python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、python 自带三个库

基本但强大 urllib, urllib2, cookielib
以下是简单的抓取代码

## 抓静态页面
import urllib, urllib2
url = "http://www.baidu.com/s"
data = {
    'wd':'Katherine'
}
data = urllib.urlencode(data) #编码 由dict->string
full_url = url+'?'+data #get请求发送
response = urllib2.urlopen(full_url)
print response.read()

# 需要登录<无验证码> post 豆瓣源代码
# data 格式从不同网页的Form Data 查看
import urllib, urllib2
url = "http://www.douban.com"
data = {
    'form_email':'xxxx',
    'form_password':'xxxx',
}
data = urllib.urlencode(data)
req = urllib2.Request(url = url, data = data)
response = urllib2.urlopen(req)
print response.read()

# 用cookie 免登录
import urllib, urllib2, cookielib
cookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)
opener.open("https://www.douban.com")

二、Scrapy 框架

Scrapy (/ˈskreɪpi/ skray-pee)[1] is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.

Scrapy project architecture is built around ‘spiders’, which are self-contained crawlers which are given a set of instructions. Following the spirit of other don’t repeat yourself frameworks, such as Django,[3] it makes it easier to build and scale large crawling projects by allowing developers to re-use their code. Scrapy also provides a web crawling shell which can be used by developers to test their assumptions on a site’s behavior.[4]

Some well-known companies and products using Scrapy are: Lyst,[5] CareerBuilder,[6] Parse.ly,[7] Sciences Po Medialab,[8] Data.gov.uk’s World Government Data site.[9]

一个现成的scrapy 案例：实现从腾讯招聘页面抓取数据

官网document镇文: http://doc.scrapy.org/en/0.20/
案例代码出自:
http://blog.csdn.net/HanTangSongMing/article/details/24454453
此文我讲详细讲解这个案例，给出下载代码和修改代码的方法(原博客中很多人会出现运行错误etc.).

1 下载代码与修改

如果未装scrapy,命令行中运行

pip install scrapy

在命令行中, 在自己想建立工程的文件夹下，运行以下来下载代码

git clone https://github.com/maxliaops/scrapy-itzhaopin.git

文件夹中会新建出scrapy-itzhaopin 这个文件夹（下文会详述里面的文件都是干什么用的，是怎么创建出来的）
在此路径下找到tencent_spider.py 文件
scrapy-itzhaopin->itzhaopin->itzhaopin->spiders->tencent_spider.py
打开并用以下代码片进行替换

import re
import json

from scrapy.selector import Selector
try:
    from scrapy.spiders import Spider
except:
    from scrapy.spiders import BaseSpider as Spider

from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle

from itzhaopin.items import *
from itzhaopin.misc.log import *

class TencentSpider(CrawlSpider):
    name = "tencent"
    allowed_domains = ["tencent.com"]
    start_urls = [
        "http://hr.tencent.com/position.php"
    ]
    rules = [
        Rule(sle(allow=("/position.php\?&start=\d{,4}#a")), follow=True, callback='parse_item')
    ]

    def parse_item(self, response):
        items = []
        sel = Selector(response)
        base_url = get_base_url(response)
        sites_even = sel.css('table.tablelist tr.even')
        for site in sites_even:
            item = TencentItem()
            item['name'] = site.css('.l.square a').xpath('text()').extract()[0]
            relative_url = site.css('.l.square a').xpath('@href').extract()[0]
            item['detailLink'] = urljoin_rfc(base_url, relative_url)
            item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()[0]
            item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()[0]
            item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()[0]
            item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()[0]
            items.append(item)
            #print repr(item).decode("unicode-escape") + '\n'

        sites_odd = sel.css('table.tablelist tr.odd')
        for site in sites_odd:
            item = TencentItem()
            item['name'] = site.css('.l.square a').xpath('text()').extract()[0]
            relative_url = site.css('.l.square a').xpath('@href').extract()[0]
            item['detailLink'] = urljoin_rfc(base_url, relative_url)
            item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()[0]
            item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()[0]
            item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()[0]
            item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()[0]
            items.append(item)
            #print repr(item).decode("unicode-escape") + '\n'

        info('parsed ' + str(response))
        return items

    def _process_request(self, request):
        info('process ' + str(request))
        return request

在命令行中运行

scrapy crawl tencent

即运行了这个scrapy框架的spider，将抓取的数据存放在spiders文件夹下的tencent.json文件中。

2 案例讲解

1 目的：抓取腾讯招聘官网上的职位信息并保存为json格式
http://hr.tencent.com/position.php
2 步骤
1) create a project
新建一个工程文件夹，执行

scrapy startproject itzhaopin

这将会在当前目录下建立一个新目录itzhaopin，既定结构如下:

├── itzhaopin
│   ├── itzhaopin
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │      └── __init__.py
│   └── scrapy.cfg

scrapy.cfg 为项目的配置文件（不用管它）
settings.py 为爬虫配置文件（需要加定义pipeline的内容,打开pipelines.py里面有响应的注释和官网链接）
items.py 为需要提取的数据结构定义文件（需要我们自己定义）
pipeline.py 为管道定义，用来对items里面提取的数据进一步处理，比如保存（需要我们自己定义）
spiders: 爬虫核心文件
2) declare items

Katherine_请叫我小之之

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 网络爬虫入门 1

一、python 自带三个库基本但强大 urllib, urllib2, cookielib 以下是简单的抓取代码## 抓静态页面import urllib, urllib2url = "http://www.baidu.com/s"data = { 'wd':'Katherine'}data = urllib.urlencode(data) #编码由dict->string
复制链接

扫一扫

专栏目录