入门小爬虫

夏安code

已于 2022-07-16 11:56:47 修改

阅读量486

点赞数

分类专栏： python 文章标签： python 爬虫开发语言

于 2019-09-19 08:47:11 首次发布

本文链接：https://blog.csdn.net/Xu_programmer/article/details/84887333

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

python爬虫最基本流程是，获取请求，解析页面，处理页面数据。python获取页面的的第三方库很多，像request，urllib，解析方法有最简单的re正则表达式，也有专门用来解析的库xmlx，beautifulSoup等。下面以request，正则表达式为例。

一、获取请求，得到网页文本，先上代码：

import requests #获取请求 response = requests.get('http://www.baidu.com') #获取请求 print(response.status_code) # 打印状态码 print(response.url) # 打印请求url print(response.headers) # 打印头信息 print(response.cookies) # 打印cookie信息 html = response.text #以文本形式得到网页源码 print(response.content) #以字节流形式打印

二、接下来可以解析页面了

解析页面用到的技术比较多了，根据获取到的数据选择不同解析方式，如果获得到的是json字符串，用json解析，这个比较简单，

jsonString = json.loads(html)

别的解析方式有beautiful Soup，lxml具体方法可以看我的其他博客：

1、beautifulSoup解析方式

2、lxml解析方式

接下来就可以处理数据并入库了

下面是一个post请求例子，运行是用命令 python 文件名.py ，也可以用外部命令启动程序命令 python 文件名.py 参数1 参数2

import requests
import json



detail_url = ''
def get_detail(param1 ,param1 ):
    #获取请求
    headers = {}
        
    headers['Host']='tjcanger.com'
    headers['Connection']='keep-alive'
    headers['content-type']='application/x-www-form-urlencoded'
    headers['Accept-Encoding']='gzip, deflate, br'
    
    data = {
    'param1': '',
    'param2': ''
    }
    ## post时，用json包将data字典形式的参数转换成json格式。
    response = requests.post(url=detail_url, data= data,headers=headers )
    response.encoding = 'utf-8'
    #print(response.status_code)  
    #print(response.url)         
    #print(response.headers)      
    #print(response.cookies)     
    #response.content
    html = response.text  
    print(html)
    result = json.loads(html)['data']

    

def main():
    get_detail(param1,param2)
   

if __name__ == '__main__':
    main()

import requests
import json

import sys

    

def main():
    param1 = sys.argv[1]
    param2 = sys.argv[2]
    get_detail(param1,param2)
   

if __name__ == '__main__':
    main()

夏安code

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
入门小爬虫

python爬虫最基本流程是，获取请求，解析页面，处理页面数据。python获取页面的的第三方库很多，像request，urllib，解析方法有最简单的re正则表达式，也有专门用来解析的库xmlx，beautifulSoup等。下面以request，正则表达式为例。一、获取请求，得到网页文本，先上代码：import requests#获取请求response = requests.g......
复制链接

扫一扫