gou(1):第一次页面爬取遇到的一些简单问题

最新推荐文章于 2022-04-10 17:46:50 发布

会编程的漂亮小姐姐

最新推荐文章于 2022-04-10 17:46:50 发布

阅读量346

点赞数

分类专栏： Python 爬虫

本文链接：https://blog.csdn.net/u014229742/article/details/82895209

版权

Python 同时被 2 个专栏收录

171 篇文章 2 订阅

订阅专栏

爬虫

52 篇文章 2 订阅

订阅专栏

错误1：我把item = TyunItem()写在了for循环外面，导致存入数据库中的内容一直一样。

def parse(self, response):
    li_list = response.xpath('/html/body/section/div[2]/div[2]/table/tbody/tr')
    # return
    for li in li_list:
        item = TyunItem()
        # print(li)
        # print(li.xpath('td[1]/a/text()'))
        item['demand_title'] = li.xpath('td[1]/a/text()').extract_first()

错误2：我把start_urls.append（）直接写成了start_urls = [‘http://1360.com/index.php?m=need&a=index&p=’+str(i)+’&keyword=建站’]

start_urls = []
for i in range(1,365):#365
    start_urls.append('http://1360.com/index.php?m=need&a=index&p='+str(i)+'&keyword=%E5%BB%BA%E7%AB%99')

错误3：在取这种td标签中还有span标签的，我一直不知道有个模块remove_tags可以直接去除标签。以前的取法都是一个一个的取出来，然后再拼接。

def go_remove_tag(value):
    # 移除标签
    content = remove_tags(value)
    # 移除空格 换行
    return re.sub(r'[\t\r\n\s]', '', content)

#不用取text，取text取出来的是一个值，因为有多个值，直接取标签，再去除标签
rest_time = li.xpath('td[4]/time').extract_first()
#去除标签
item['rest_time'] = go_remove_tag(rest_time)
demand_process = li.xpath('td[5]/p').extract_first()
item['demand_process'] = go_remove_tag(demand_process)

错误4：开启了如下设置，导致爬取速度特别缓慢

# DOWNLOAD_DELAY = 3
# HTTPCACHE_ENABLED = False
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

错误5：当取多个页面的时候，我总喜欢先取第一页，第一页没有了再去第二页。但实际上第二页还有第一页的内容。完全可以只在第二页取所有的内容。

错误6：我的代码后期维护性太差，一旦页面改变，后期维护会变得很艰难。