python爬虫初次实践——爬取豆瓣top250
预备知识:User-Agent
Python 爬虫基础:利用 BeautifulSoup 解析网页内容
BeautifulSoup中find和find_all的使用
豆瓣top250
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
for i in range(0,250,25):
print(i)
respond = requests.get(f"https://movie.douban.com/top250?start={i}",headers=headers)
html = respond.text
soup = BeautifulSoup(html,'html.parser')
all_titles = soup.findAll(name="span",attrs={"class":"title"})
for title in all_titles:
title_string = title.string
if '/' not in title_string:
print(title_string)
with open('data.txt','a') as file:
file.write(title_string+'\n')
注意:python字符串前面加f是什么意思,如何表达式嵌入字符串中
存储到txt文件
data = ["data1","data2","data3"]
with open('data.txt','w') as file:
for item in data:
file.write(item+'\n')
存储到xlsx文件
import pandas as pd
data = {'a':[1,2,3],'b':['a','b','c']}
df = pd.DataFrame(data)
df.to_excel('data.xlsx',index=False)