对于python爬虫并不是很熟悉,遇到下面的问题,希望知道的朋友可以给个解决问题的方向!
需要爬出扇贝单词网里面单词书的单词,现在对于没有目录的单词书可以爬出来,如果有一层目录就没办法了 需要手动到这层目录去,输入这层目录的url 才能把这个目录下的单词爬取出来。
下面是我的代码:
from bs4 import BeautifulSoup
from lxml import html
import xml
import re
import requests
file = open("vocabulay.txt", "w")
''' file = open("out.txt", "w") '''
pattern='([a-z,A-Z]*?)'
def spider(url):
f = requests.get(url)
soup = BeautifulSoup(f.content, "lxml")
word_list=soup.select('strong')
for word in word_list:
word=str(word)
word=re.findall(pattern,word)
if(len(word)!=0): #需要对list进行长度判断,否则访问word[0]会有问题
print(word[0])
file.writelines((word[0],"\n"))
url_list = [ "https://www.shanbay.com/wordlist/80770/87931/",
"https://www.shanbay.com/wordlist/80770/89734/"
]
unit = 1
for url in url_list:
file.write("\n#章节"+str(unit)+"\n")
unit+=1
for i in range(1,11):
url1=url+"?page="+str(i)
spider(url1)
file.close()
单词书的URL:https://www.shanbay.com/wordlist/80770