学习目标:
python学习二十二—数据抓取的日常练习
学习内容:
1、抓取笔趣阁的首页小说简介
2、利用start—with抓取趣事百科的相关信息
3、获取优图网的图片,利用//代表前面有东西模糊匹配到img标签并获取到data-original图片的地址
4、抓取安居客非图片内容
1、抓取笔趣阁的首页小说简介
source = requests.get('http://www.xbiquge.la', headers=headers).text
base = etree.HTML(source).xpath('//*[@id="newscontent"]/div[1]/ul/li')
for i in base:
type = i.xpath('span[1]/text()')
books = i.xpath('span[2]/a/text()')
chapter = i.xpath('span[3]/a/text()')
author = i.xpath('span[4]/text()')
print(type, books, chapter, author)
输出:
['[都市小说]'] ['摆个摊就能成神豪'] ['第160章 完全没有用武之地'] ['小老叔']
['[其他小说]'] ['荒野的黑客'] ['第四十三章 楞憨哼,揍我'] ['云外一声鸡']
['[其他小说]'] ['冷宫皇后皆寂寞'] ['第99章:母子之间的较量'] ['非也大人']
['[修真小说]'] ['西游之开局拒绝大闹天宫'] ['第二百八十九章 最弱的圣人'] ['我气化三清']
['[都市小说]'] ['人在末世也种田'] ['35、你老公和一个女人在一起呐'] ['小风猴猴']
.........
2、利用start—with抓取趣事百科的相关信息
//[@id=“qiushi_tag_123983036”]/div[1]/a[2]/h2
//[@id=“qiushi_tag_123884600”]/a[1]/div/span
//[@id=“qiushi_tag_124000602”]
//[@id=“qiushi_tag_124000602”]/div[1]/a[2]/h2 // *[ @ id = “qiushi_tag_124002094”]/a[1]/div/span
import requests
from lxml import etree
source = requests.get('https://www.qiushibaike.com/text/').text
base = etree.HTML(source).xpath('//*[starts-with(@id, "qiushi_tag_")]')
for i in base:
text = i.xpath('a/div/span[1]/text()')
name = i.xpath('div[1]/a[2]/h2/text()')
for i in text:
author = name[0].replace('\n', '')
print(author, i)
3、获取优图网的图片,利用//代表前面有东西模糊匹配到img标签并获取到data-original图片的地址
for i in range(1, 2):
source = requests.get('http://www.uppsd.com/search-0-20-0-0-1-p'+str(i), headers=headers).text
base = etree.HTML(source).xpath('//img[@class = "lazy"]/@data-original')
for i in base:
pic = requests.get(i).content
print(pic)
4、抓取安居客非图片内容
source = requests.get('https://tianjin.anjuke.com/sale/?from=navigation', headers=headers).text
base = etree.HTML(source).xpath('//*[@id="__layout"]/div/section/section[3]/section[1]/section[2]/div')
for i in base:
titel = i.xpath('a/div[2]/div[1]/div[1]/h3/text()')
print(titel)
txt = i.xpath('a / div[2] / div[1] / section / div[1] / p/span/text()')
print(txt)
neirong = i.xpath('a / div[2] / div[1] / section / div[1] / p/text()')
print(neirong)