2018年06月_真你假我

06月 03月

原创爬虫：csdn首页的超链接

import reimport requestsdef getlink(url): headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36'} req=req...

2018-06-21 09:46:29 409

原创爬虫：糗事百科

#思路#1.请求抓取网页#2.根据正则爬取关键内容#3.解析出用户名和内容#4.循环赋值进行输出import urllib.requestimport redef getcontent(url,page): headers=('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...

2018-06-21 09:46:15 288

原创爬虫：爬取京东手机图片

# 思路# 1.爬取网页# 2.根据正则表达式爬取关键内容# 3.根据关键内容，再次使用正则匹配出图片地址# 4.存储图片#import urllib.requestimport reimport urllib.errordef craw(url,page): html1=urllib.request.urlopen(url).read() html1=str(...

2018-06-21 09:46:03 1205

原创词云

from wordcloud import WordCloud,ImageColorGeneratorimport matplotlib.pyplot as pltfrom scipy.misc import imreadimport pymysql#处理中文乱码问题plt.rcParams['font.sans-serif']=['SimHei'] #设置默认字体plt.rcPar...

2018-06-21 09:45:49 365

原创爬虫：熊猫电影前一百：电影名、主演、评分、上映时间。并存取数据库

import urllib.requestfrom lxml import etreeimport pymysql# 获取所有的页面的HTMLdef get_all_html(url): for i in range(0,100,10): Url='' Url=url+str(i) html=urllib.request.urlope...

2018-06-21 09:45:36 821

原创爬虫实例--菜鸟教程

1、第一种方法# 第一种方式：requests 和 lxml结合使用import requestsfrom lxml import etree#1、拿到所有的页面链接，并使用yield返回完整的超链接def get_html(url):# 获取页面HTML html=requests.get(url)# 使用etree格式化HTML seq=etree....

2018-06-21 09:45:26 16269

原创爬虫实例--百度贴吧图片爬取

# 1、获取网页HTML# 2、分析标签特征，抓取所有图片的url连接# 3、保存图片#import urllib.requestimport re# 获取网页HTMLdef get_html(url): html=urllib.request.urlopen(url) return html.read().decode('utf-8')# 用正则抓取图片url,c...

2018-06-21 09:44:40 301