python初学（爬虫+web开发）

最新推荐文章于 2024-08-03 14:18:54 发布

冷夏LX

最新推荐文章于 2024-08-03 14:18:54 发布

阅读量3.8k

点赞数 4

分类专栏：技能-爬虫代码文章标签： python

本文链接：https://blog.csdn.net/u014119694/article/details/74937279

版权

技能-爬虫代码专栏收录该内容

4 篇文章 1 订阅

订阅专栏

#python初学

近期提前进入研究生的生活，有点措手不及，最近的两个项目都涉及到python ，第一个是使用爬虫，第二个是可视化，想用python来实现后天，然后js+html来实现前端，所以对python学习了下。
大神勿喷

文章目录

@[toc]

###爬虫历程

request/urlib

request库封装的基本网络请求，http://cn.python-requests.org/zh_CN/latest/，这是官方文档

import requests
url = 'https://www.zhihu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
} 
re=requests.get(url=user,headers=headers)
re.encoding=re.apparent_encoding
print re.text

如果要实现模拟登陆：

import requests
from bs4 import BeautifulSoup
url = "#" 
header = { "User-Agent" :# 
           "Referer": #
           }
 
s = requests.Session()
postData = { 'username': #
             'password': #
             }
#模拟发送post请求 
v2ex_session.post(url,
                  data = postData,
                  headers = header)
#用session访问其他网站
re = s.get('#',headers=header)

加代理，（被ban后）

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

代理可以从国内代理网站下载，也可以自己写爬虫实现

多线程，多进程，协程，分布式

进程池

if __name__ == '__main__':
    pool = Pool()
    groups = ([u['url'] for u in mon.db.source.find()])
    pool.map(main,groups)
    pool.close()
    pool.join()

带线程的抓取图片demo

# -*- coding: utf-8 -*-
import config
import requests
import threading
from bs4 import BeautifulSoup
import time
ps=0
def getHtml(url):
    try:
        r = requests.get(url=url, headers=config.get_header(), timeout=config.TIMEOUT)
        r.encoding ='utf-8'

        if(not r.ok)or (r.content)<500:
            print '提取网页失败'
        else:
            print '提取网页成功'
            return r.text
    except Exception:
        print Exception.message
def getma(html):
    try:
        soup = BeautifulSoup(html, 'lxml')
        lis = soup.find(class_='swiper-slide')
        lik=lis.find_all('img')
        for li in lik:
            url=li['src']
            r = requests.get(url=url, headers=config.get_header(), timeout=config.TIMEOUT)
            content=r.content
            leixing=str(li).split('.')[-1][:-3]
            filepath = '{0}/{1}.{2}'.format('image', 'tu'+str(time.time()), leixing)
            with open(filepath,'wb') as f:
                f.write(r.content)
                f.close()
    except Exception:
        print('处理图片网页错误')
def detilehtml(html):
    try:
        r = requests.get(url=html, headers=config.get_header(), timeout=config.TIMEOUT)
        r.encoding = 'utf-8'
        if (not r.ok) or (r.content) < 500:
            print '提取图片网页失败'
        else:
            print '提取图片网页成功'
            getma(r.text)
    except Exception:
        print Exception.message
def delhtml(html):
    try:
        soup = BeautifulSoup(html, 'lxml')
        lis = soup.find_all(class_='list-group-item')
        for li in lis:
            detilehtml(li['href'])
    except Exception:
        print('处理网页错误')
def main(p):
    global ps
    start_url='http://www.doutula.com/article/list/?page='
    try:
        url=start_url+str(p)
        html=getHtml(url)
        delhtml(html)
        print str(ps)+' finished'
        ps=ps+1
    except Exception:
        print Exception.message


if __name__ == '__main__':
    sopage=300
    for i in range(sopage):
        i=i+100
        t=threading.Thread(target=main,args=(i,))
        t.start()
        t.join()

配置文件：

# coding:utf-8
import  random
DB_CONFIG = {
    'DB_CONNECT_STRING':'mongodb://localhost:27017/'
}
USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

]
def get_header():
    return {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Accept-Encoding': 'gzip, deflate',
    }

TIMEOUT = 5  # socket延时

匹配字符

通常有三种方式：Beautiful Soup / lxml / 正则匹配,详细教程可以看官方文档

scrapy框架

要想完全理解scrapy框架，必须看官方文档，scrapy适用于全站搜索。

动态网站分析

chrome f12就ok 了，然后根据请求方式模拟发送请求，得到网站数据有两种方式，一种是从渲染后的静态页面爬，二是发送请求得到原始数据，一般是json格式，然后直接存取。

semilium自动化

之前的京东文章就是semilium爬取的，但当时不知道为什么要延迟，该方式是使用浏览器，如果不延迟，直接爬，可能造成页面没加载完全，爬取不完整，所以给浏览器一定的反应时间。

###数据库

mongodb

非关系型数据库，基于文档，后期再写

###python web

flask框架

为了实现可视化，近期学了一下python web，看网上都推荐flask，就搞了搞。

建议使用虚拟环境来学习python，易于移植。
蓝图是把app给模块化的方式
掌握数据库ORM
flask_script

###总结与展望
刚刚开始学做项目，难免有些着急和紧张，这十几天来的收获还是蛮大的，至少学会了爬虫和python web开发。下一阶段的目标是实现完整python前后端（学着用Django），实现数据可视化处理以及前端开发。然后开始进入机器学习、深度学习和最优化算法的学习。

冷夏LX

关注

4
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
python初学（爬虫+web开发）

python初学近期提前进入研究生的生活，有点措手不及，最近的两个项目都涉及到python ，第一个是使用爬虫，第二个是可视化，想用python来实现后天，然后js+html来实现前端，所以对python学习了下。大神勿喷python初学爬虫历程数据库python web总结与展望爬虫历程request/urlibrequest库封装的基本网络请求，http://cn.python-r
复制链接

扫一扫