python爬虫入门级项目

最新推荐文章于 2024-07-27 12:20:46 发布

迪亚大菠萝

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量865

点赞数

分类专栏： python 文章标签： python xpath

本文链接：https://blog.csdn.net/weixin_40498531/article/details/104794333

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

python爬虫入门级项目

- - 爬虫需要用到哪些python库
  - - 完整代码如下

比如我们想要拿到全国所有城市的邮编，做为某项目的数据支撑，我们就可以写一个爬虫去爬取这些数据。
我爬取邮编信息的网站是http://www.ip138.com/post/postal22/

爬虫需要用到哪些python库

常用的html解析库有
lxml和bs4

from lxml import etree
from bs4 import BeautifulSoup

有些网站返回数据格式是json格式，那么需要导入json库来解析

import json

而有些信息格式比较复杂，需要用正则提取所需信息，那么就需要导入re库

import re

有时候需要得到网站内容发布时间，但是网站发布时间的格式不一，想要统一格式，就需要datetime库

import datetime

另外发起请求所需的requests也是必不可少的

import requests

因为爬虫是通过请求别人网站的资源来获取数据，如果请求过于频繁，就会对别人服务器造成压力，这无异于网站攻击，所以需要设置请求的时间间隔，那么就用到了time库

import time

而有些网站具有反爬机制，请求频率不变的话会被拉黑，所以请求间隔时间应该随机

import random

爬虫是很不稳定的，随时可能会挂掉，想知道爬虫的状态，可以导入logging库打印日志

import logging

完整代码如下

import requests
from lxml import etree
import json
from bs4 import BeautifulSoup
import random
import time
import json
import logging
import re
import datetime

class GetPostNum:
    def __init__(self):
        self.url="http://www.ip138.com/post/postal22/%s"
        self.header={
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive",
            "Host": "www.ip138.com",
            "Referer": "http://www.ip138.com/post/postal22/indexdbc4.html",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        self.file_path="d:/post_data.txt"
    def start(self):
        session=requests.session()
        session.headers=self.header
        data=session.get(self.url % "top.js")

        if data.status_code==200:
            print("请求成功")
            # print(data.text)
        # table_data=etree.HTML(data.text)
        urls=re.findall(r'\<a href\=\"\S+html\"\>\S+\<\/a\>',data.text)
        for url in urls:
            href=re.findall(r'\"\S+\.html\"',url)[0]
            href=href[1:-1]
            province=re.findall(r'\>\S+\<',url)[0]
            province=province[1:-1]
            list=[]

            detail=session.get(self.url%href)
            detail.encoding='gb2312'
            table_data_d=etree.HTML(detail.text)
            trs_d=table_data_d.xpath("//table//tr")
            if len(trs_d) > 0:
                city=""
                for tr_d in trs_d[2:]:
                    tds_d = tr_d.xpath("./td")
                    if len(tds_d) ==4:
                        if len(tds_d[0].xpath("./b/text()"))>0:
                            if "市"  not in  province:
                                city=tds_d[0].xpath("./b/text()")[0]
                            data1={"province": province,"city": city,"district": "" ,"disnum":tds_d[1].text,"postnumber":tds_d[2].text }
                            list.append(data1)
                        else:
                            data1 = {"province": province, "city": city, "district": tds_d[0].text,
                                     "disnum": tds_d[1].text, "postnumber": tds_d[2].text}
                            list.append(data1)

                    else:
                        data1 = {"province": province, "city": city, "district": tds_d[0].text,
                                 "disnum": tds_d[1].text, "postnumber": tds_d[2].text}
                        list.append(data1)
                        data2 = {"province": province, "city": city, "district": tds_d[3].text,
                                 "disnum": tds_d[4].text, "postnumber": tds_d[5].text}
                        list.append(data2)
            time.sleep(random.randint(3,5))
            print(list)
            self.save_to_file(list)
    def save_to_file(self,list):
        with open(self.file_path, "a" ,encoding='utf-8') as f:
            for json_data in list:
                jdata=json.dumps(json_data, ensure_ascii=False)
                f.write(jdata)
                f.write("\n")



if __name__ == '__main__':
    GetPostNum().start()

迪亚大菠萝

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫入门级项目

python爬虫入门级项目爬虫需要用到哪些python库完整代码如下比如我们想要拿到全国所有城市的邮编，做为某项目的数据支撑，我们就可以写一个爬虫去爬取这些数据。我爬取邮编信息的网站是http://www.ip138.com/post/postal22/爬虫需要用到哪些python库常用的html解析库有lxml和bs4from lxml import etreefrom bs4 i...
复制链接

扫一扫

专栏目录