随便找了一位博主的主页。
发出请求,打印出HTML。
import requests
from pyquery import PyQuery as pq
import pandas as pd
url = 'https://me.csdn.net/wushaowu2014'
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
}
response = requests.get(url, headers=headers)
print(response.text)
发现没有用到Ajax,那就可以直接用pyquery了。
df = pd.DataFrame()
doc = pq(response.text)
results = doc('.tab_page_list').items()
for result in results:
dict1 = {}
dict1['title'] = result('a').text()
dict1['num'] = result('em').text()
dict1['time'] = result('.fr').text()
df = df.append(dict1, ignore_index=True)
df
输出如下:
我们把title放到第一列。
title = df['title']
df.drop(labels=['title'], axis=1,inplace = True)
df.insert(0, 'title', title)
df
OK,完成!