学习再多的理论不实际动手,还是不会写,今天抽点空,写了两个常见的例子
一、爬取百度贴吧的图片
import requests
from lxml import etree
import json
class Tieba():
def __init__(self,name):
self.name = name
self.header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko)"}
def get_url_list(self):
url = "https://tieba.baidu.com/f?kw="+self.name+"&ie=utf-8&pn={}&"
url_list = []
for i in range(5):
url_list.append(url.format(i*50))
return url_list
def parse_url(self):
url_list = self.get_url_list()
for url in url_list:
response = requests.get(url,headers=self.header)
html = response.text
html = etree.HTML(html)
xml = html.xpath("//div[@class = 't_con cleafix']//div[@class = 'threadlist_lz clearfix']/div/a/@href")
return xml
def get_img(self):
link_list = self.parse_url()
for link in link_list:
url = "https://tieba.baidu.com" + link
html = requests.get(url,headers= self.header).text
html = etree.HTML(html)
list = html.xpath('//div/img[@class="BDE_Image"]/@src')
print(list)
return list
def save(self):
list = self.get_img()
with open('test.txt','w') as f:
f.write(json.dumps(list, ensure_ascii=False, indent=2))
if __name__ == '__main__':
tieba=Tieba("六学")
tieba.save()
在过程中遇到一些问题,首先一直爬取的为空列表,在百度上搜索找到了原因,换了一个user-agent就好了。
二、糗事百科段子的爬取
import requests
from lxml import etree
import json
class Qiushi():
def __init__(self):
self.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko)"}
self.url = "https://www.qiushibaike.com/text/page/{}/"
def parse_url(self,url):
response = requests.get(url,headers = self.headers)
print(response.content)
html = response.content.decode('utf-8')
html = etree.HTML(html)
return html
def parse_content(self,url):
html = self.parse_url(url)
contents = html.xpath("//div[@class='content']/span/text()")
for content in contents:
print(content)
with open('test1.txt','a',encoding='utf-8') as f:
f.write(content)
f.write('\n')
def run(self):
url = self.url.format(1)
self.parse_content(url)
if __name__ == '__main__':
qiushi = Qiushi()
qiushi.run()
在这过程中,一直爬取的为乱码,发现直接print(content)就不是乱码,写入文件就是乱码,就使用encoding=‘utf-8’,方式进行编码得以解决。
小白一个,如有问题请大家批评指正。