抱抱卑微的自己
为期两周的实训终于结束了!一个字就是特别特别特别累!尤其是上周的实训,害!但是辛苦的同时,成果也是不错的。上周做了一个前端项目,本周做了一个python爬虫项目,还是要夸夸自己哈哈哈!就是这么凑不要脸,哼!源码贴上!`
爬取电影天堂源码奉上:
import requests
from lxml import etree
import sqlite3
import datetime
conn = sqlite3.connect('movie.db')
cursor = conn.cursor()
sql = 'insert into movie (m_id, m_name, m_img) values (:m_id,:m_name,:m_img)'
url_news = 'https://www.dytt8.net/html/gndy/dyzz/index.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
response = requests.get(url_news, headers=headers)
print(response.status_code) # 返回网络通信的状态 200
response.encoding = 'gb2312' # 中文编码
scode = response.status_code
if scode == 200:
print('可以开始爬取...')
domroot = etree.HTML(response.text) # 网页的节点对象
movielist = domroot.xpath('//td[@height="26"]/b/a') # xpath 进行爬虫规则的设置
print('电影的个数: ',len(movielist)) # 电影的个数
# 循环进行遍历
for x in movielist:
title = x.xpath('text()')
if title:
title = title[0]
else:
continue
print('电影名字: ', title) # 电影名字
url = x.xpath('@href')
if url:
url = url[0]
else:
continue
print('电影地址: ', url) # 电影地址
url = "https://www.dytt8.net" + url
response = requests.get(url, headers=headers)
response.encoding = 'gb2312'
page_2 = etree.HTML(response.text)
page_2_img_url = page_2.xpath("//div[@id='Zoom']//img[1]/@src")
if page_2_img_url:
page_2_img_url = page_2_img_url[0]
else:
continue
print('电影图片: ', page_2_img_url) # 电影图片
# 时间(毫秒)
time_now = datetime.datetime.now().strftime('%H:%M:%S.%f')
cursor.execute(sql, {'m_id':time_now, 'm_name':title, 'm_img':page_2_img_url})
conn.commit()
cursor.close()