爬虫注意事项

最新推荐文章于 2024-05-26 00:00:00 发布

鱼鱼9901

最新推荐文章于 2024-05-26 00:00:00 发布

阅读量833

点赞数 9

分类专栏： Python 文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_72100405/article/details/135745273

版权

Python 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

获取网页内容：

def getHTMLText(url):
    try:
        with requests.get(url, timeout=30, stream=False) as r: #打开网页后必须要手动关闭，或者像这样用with语句打开网页
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
    except:
        return " "

然后就使用正则匹配（在原网页摁F12）找到匹配内容

每次爬完一页内容以后记得要

time.sleep(30) #限速，防止ip被封

以及在存入内容的时候

tttt=f[:-2]
with open("/your_path/{}.txt".format(ttttt), "w", encoding="utf-8") as f:
    f.write(text)
#with open的语法里面，不可以用切片，比如ttttt就不可以是f[:-2]的形式存在，不然就会报错