python 爬虫爬取＜ record ＞＜![ CDATA [ 内容获取不到解决办法

最新推荐文章于 2024-04-26 20:45:09 发布

浅_v

最新推荐文章于 2024-04-26 20:45:09 发布

阅读量1.8k

点赞数 3

文章标签： python 爬虫

本文链接：https://blog.csdn.net/xxxxx222222/article/details/119252001

版权

import scrapy
from gv_config.items import Item
import  re
from bs4 import BeautifulSoup
import json
class SpiderSpider(scrapy.Spider):
    name="gv_config"
#  爬取网站域
    allowed_domains=["xxxxx"]

    def start_requests(self):
        #获取爬取网站
        url="xxxxxx"
        yield scrapy.Request(url="xxxxxx",callback=self.parse)

    def parse(self, response, **kwargs):
       #直接替换
        text=response.text.replace('<record>','').replace('</record>','').replace('<!    [CDATA[','').replace(']]>',"")
        print(text)
      #BeautifulSoup 解析html
        soup=BeautifulSoup(text,'xml')
        data=soup.find_all("li",class_='wip_col_listli')
        for x in data:
           #定义字段 相当于Java 实体类
            item=Item()
         #解析爬取字段
            item['title']=x.find_all('a')[0].string
            item['date']=x.find_all('span')[0].string
            #link=x.find_all('a',)
            #print(name,data,link)
            yield item

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

浅_v

关注关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
3
评论
python 爬虫爬取＜ record ＞＜![ CDATA [ 内容获取不到解决办法

import scrapyfrom gv_config.items import Itemimport refrom bs4 import BeautifulSoupimport jsonclass SpiderSpider(scrapy.Spider): name="gv_config"# 爬取网站域 allowed_domains=["xxxxx"] def start_requests(self): #获取爬取网站 url="...
复制链接

扫一扫