python爬虫相关

最新推荐文章于 2024-04-01 13:30:49 发布

无敌策哥

最新推荐文章于 2024-04-01 13:30:49 发布

阅读量299

点赞数 1

分类专栏：大数据文章标签： python

本文链接：https://blog.csdn.net/qq_39481696/article/details/82499191

版权

大数据专栏收录该内容

17 篇文章 1 订阅

订阅专栏

基础知识点

python格式化

python格式化
- 数字格式化
  python print("{:.2f}".format(3.1415926))#设置两位小数 print("{:.2%}".format(0.25))#百分比 print("{:^10}".format("aaaaa"))#居中 print("{:<10}".format("aaaaa"))#左对齐 print("{:>10}".format("aaaaa"))#右对齐
  结果
  python 3.14 25.00% aaaaa aaaaa aaaaa

中文输出排版

print("{0:{1}^10}".format("感悟",chr(12288)))#居中
print("{0:{1}^10}".format("你是不是",chr(12288)))#居中
print("{0:{1}^10}".format("哈哈哈",chr(12288)))#居中
print("{0:{1}^10}".format("你是环境吗",chr(12288)))#居中
print("{0:{1}^10}".format("很额额哈鸡儿哈酒",chr(12288)))#居中

python爬虫知识点

Beautiful Soup4库

安装BeautifulSoup4库

- 导入：from bs4 import BeautifulSoup

正则表达式

正则表达式是用来简洁表达一组字符串的表达式。
主要用在字符串匹配中
正则表达式常用字符串
- . 表示任何单个字符
- [] 字符集，对单个字符给出取值范围，例：[abc]表示a，b，c中的一个，[a-z]表示a到z中的一个字符
- [^]非字符集，对单个字符给出排除范围，例[^abc]表示非a或b或c的单个字符
- - 前一个字符0次或无线次扩展，例abc*表示ab、abc、abcc、、、
- - 前一个字符1次或无限次扩展，例abc+表示abc、abcc、、、、
- ？前一个字符0次或者1次扩展，例abc？表示ab、abc
- | 左右表达式任意一个，例abc|def表示abc、def
- {m} 扩展前一个字符m次，例ab{2}c表示abbc
- {m，n}扩展前一个字符m至n次（含n），例ab{1,2}c表示abc，abbc
- ^ 匹配字符串开头，例^abc表示abc且在一个字符串的开头
- $ 匹配字符串结尾
- （）分组标记，内部只能使用|操作符，例（abc）表示abc，（abc|def）表示abc、def
- \d 数字，等价于[0-9]
- \w 单词字符，等价于[A-Za-z0-9]

Re库

import re
表示类型：raw string（原生字符串类型），表示为：r’text’
主要功能函数：
-

yield关键字

生成器：一个不断产生值得函数
包含yield语句的函数是一个生成器
生成器每次产生一个值（yield语句），函数被冻结，被唤醒后再产生一个值
yield的使用通常个for循环搭配在一起

def gen(n):
    for i in range(n):
        yield i**2
#没执行一次，会被唤醒一次，并从上次沉睡的额地方继续开始
for i in gen(5):
    print(i)

pycharm中的包管理：

pycharm中的某个项目中可以包含package,目录等等
目录中的某个py文件要调用另一个py文件中的函数，首先要将目录设置为source root
目录>右键>make directory as>source root

json模块中loads与dumps的去别

dumps：序列化，encoding，把一个python队列编码转化为json字符串，json字符串是字符串
loads：反序列化，decoding，把json格式的字符串解码转化成python数据对象，纪委字典数据对象

Scrapy库

不使用管道简单读取

# -*- coding: utf-8 -*-
import scrapy
from scrapyTest.items import ScrapytestItem

class IdcastSpider(scrapy.Spider):
    name = 'idcast'
    allowed_domains = ['http://www.itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    def parse(self, response):
        items = []
        #不用使用lxml包直接可以xpath（框架好处）
        node_list = response.xpath("//div[@class='li_txt']")
        for node in node_list:
            item = ScrapytestItem()
            name = node.xpath("./h3/text()").extract()
            title = node.xpath("./h4/text()").extract()
            info = node.xpath("./p/text()").extract()
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]
            items.append(item)
        return items

保存数据 -o输入指定格式的文件（parse方法中直接return的，不使用管道的）
- scrapy crawl spiderName -o 文件名.json
- scrapy crawl sprideName -o 文件名.csv(可以使用excel打开)
- scrapy crawl sprideName -o 文件名.xml
管道编写实例(可以写多个管道，但是要在setting文件中配置好)

import json

class ScrapytestPipeline(object):
    def __init__(self):
        self.f = open("idcast_pip","w")

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False)
        self.f.write(content.encode("utf-8"))
        return item
    def close_spider(self,spider):
        self.f.close()

将项目储存到json文件中

服了，，

#settings文件
ITEM_PIPELINES = {
   'idcast.pipelines.IdcastPipeline': 300,
}
#管道文件
import json
import codecs
from scrapy.exporters import JsonItemExporter

class IdcastPipeline(object):
    def __init__(self):
        self.f = codecs.open("aaaaa","w",encoding="utf-8")
    def process_item(self, item, spider):
        content = json.dumps(dict(item) ,ensure_ascii=False) + "\n"
        self.f.write(content)
        return item
    def close_spider(self,spider):
        self.f.close()

self.parse和self.parse()的区别

前者是地址，后者是方法

查询多个网页，自己找url规律

import scrapy
from tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    name = 'Tencent'
    allowed_domains = ['tencent.com']
    baseURL =  "https://hr.tencent.com/position.php?&start="
    offset = 0
    start_urls = [baseURL+str(offset)]
    def parse(self, response):
        node_list = response.xpath("//div[@id='position']//tr[@class='odd'] | //div[@id='position']//tr[@class='even']")
        for node in node_list:
            item = TencentItem()
            item["positionName"] = node.xpath(".//a/text()").extract()[0]
            item["positionLink"] = node.xpath(".//a/@href").extract()[0]
            if len(node.xpath(".//td[2]/text()").extract()) == 0 :
                item["positionType"]=""
            else:
                item["positionType"] = node.xpath(".//td[2]/text()").extract()[0]
            item["workLocation"] = node.xpath(".//td[4]/text()").extract()[0]
            item["peopleNumber"] = node.xpath(".//td[3]/text()").extract()[0]
            item["publishTime"] = node.xpath(".//td[5]/text()").extract()[0]
            yield item
        if self.offset < 3600:
            self.offset += 10
            url = self.baseURL+str(self.offset)
            #callback是回调函数，parse后面不要加括号
            yield scrapy.Request(url,callback=self.parse)

通过获取下一个的url地址来获取连接

例子接上面，把上面最后if去掉

#也可以加一个if   len（）判断是否提取到
baseURL = response.xpath("//a[@id='next']/@href").extract()[0]
        url = "https://hr.tencent.com/"+baseURL
        yield  scrapy.Request(url,callback=self.parse)

scrapy保存到数据库

必须完全一致，不然会报错，还要在settings文件中设置

class TencentMysqlPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host="192.168.48.1",
                                    user="wei",
                                    password="123456",
                                    db="gaodb",
                                    charset="utf8",
                                    use_unicode=False
                                    )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

        positionName = item["positionName"]
        positionLink = item['positionLink']
        positionType = item['positionType']
        peopleNumber = item['peopleNumber']
        workLocation = item['workLocation']
        publishTime = item['publishTime']
        sql = "insert into tencent (positionName,positionLink,positionType," \
              "peopleNumber,workLocation,publishTime)VALUES(%s,%s,%s,%s,%s,%s)"
        lis = (positionName,positionLink,positionType,peopleNumber,workLocation,publishTime)
        self.cursor.execute(sql,lis)
        self.conn.commit()

        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

无敌策哥

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫相关

基础知识点python格式化python格式化数字格式化 python print("{:.2f}".format(3.1415926))#设置两位小数 print("{:.2%}".format(0.25))#百分比 print("{:^10}".format("aaaaa"))#居中 print("{:&lt;10}".format("aaaaa")
复制链接

扫一扫