Python爬取 知乎上“发现”页面的“热门话题”部分

目的:将其问题和答案同样保存成文本形式

import requests
from pyquery import PyQuery as pq
 
url = 'https://www.zhihu.com/explore'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
html = requests.get(url, headers=headers).text
doc = pq(html)
items = doc('.explore-tab .feed-item').items()
for item in items:
    question = item.find('h2').text()
    author = item.find('.author-link-line').text()
    answer = pq(item.find('.content').html()).text()
    file = open('explore.txt', 'a', encoding='utf-8')
    file.write('\n'.join([question, author, answer]))#返回通过指定字符连接序列中元素后生成的新字符串
    file.write('\n' + '=' * 50 + '\n')
    file.close()

基础知识:1,利用pquery库进行爬取,pquery基本使用(感谢崔老师):https://cuiqingcai.com/5551.html,

2,join的用法(感谢菜鸟教程):http://www.runoob.com/python/att-string-join.html

str.join(元组、列表、字典、字符串) 之后生成的只能是字符串

所以很多地方很多时候生成了元组、列表、字典后,可以用 join() 来转化为字符串。

3,保存为TXT文件的简便方式:

 

爬取结果:

 

拓展:利用pquery库爬取  豆瓣读书 的书籍信息

import requests
from pyquery import PyQuery as pq
 
url = 'https://book.douban.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
html = requests.get(url, headers=headers).text
doc = pq(html)
items = doc('.list-col .list-col5 .list-express .slide-item').items()

for item in items:
    author = item.find('.info').text()
    print(author)
    

'''
    author = item.find('.author-link-line').text()
    answer = pq(item.find('.content').html()).text()
    file = open('explore_test_one.txt', 'a', encoding='utf-8')
    file.write('\n'.join([question, author, answer]))
    file.write('\n' + '=' * 50 + '\n')
    file.close()
'''

一直失败,,,,等过一阵再看看

展开阅读全文

python爬取ashx页面的post请求

03-18

我以一个类似的情况来提问,也是论坛中有个网友1年前提出的问题,但是他没有写出后续....rnrn网页地址:http://www.lzggzyjy.cn/InfoPage/InfoList.aspx?SiteItem=8rnrn需求:python post请求获取该页面(感觉很简单)rnrn分析页面:rn![图片说明](https://img-ask.csdn.net/upload/201803/18/1521356034_57752.png)rnrnrn![图片说明](https://img-ask.csdn.net/upload/201803/18/1521356049_933670.png)rnrn我把代码贴上来:rnrnrnrn```rn #! /usr/bin/env python3rn# -*- coding:utf-8 –*-rnrnrnimport requestsrnimport jsonrnrnimport sysrnreload(sys) rnsys.setdefaultencoding('utf-8')rnrnrndef testDownloadLanZhou():rn testUrl = 'http://www.lzggzyjy.cn/ajax/Controls_InfoListControl,App_Web_2ewqtbev.ashx?_method=getCurrentData&_session=rw'rn testHeaders = rn 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',rn 'Host':'www.lzggzyjy.cn',rn 'Referer':'http://www.lzggzyjy.cn/InfoPage/InfoList.aspx?SiteItem=8',rn 'X-Requested-With':'XMLHttpRequest',rn # "Content-Type":'text/plain;charset=UTF-8',rn "Cookie":'ASP.NET_SessionId=hcdc1tywt5dgszd5bziox4sc; SERVERID=b925605187c7d5d37f1395627a969c75|1521298751|1521298681'rn rnrn # testParams = '_method':'getCurrentData', '_session':'rw'rn testData = 'currentPage':'1', 'Query':''rnrn # 将dic 转换成json字符串rn # jsonDataString = json.dumps(testData)rn # print(jsonDataString)rn # print(type(jsonDataString))rnrn # # 对应每个参数添加换行隔开rn # newString = jsonDataString.replace(',', '\n')rn # print(newString)rnrn resq = requests.post(testUrl, headers=testHeaders, data=testData)rn print(resq.content)rn```rnrnrn这个请求怎么弄都不对,获取不到正确的html页面......请各位指点迷津rn rn rnrn 问答

没有更多推荐了,返回首页