数据采集和清洗(一)

数据下载

实践1 – 搭建Python开发环境

在Linux中安装配置Python3(多版本)

  • 在centos下编译安装python3前,建议先将yum源更改为国内的yum源
  1. 安装编译所需的工具
    yum install -y gcc zlib zlib-devel libffi-devel openssl openssl-devel
  2. 下载并编译安装Python3
wget https://www.python.org/ftp/python/3.6.2/Python-3.6.2.tgz
tar -xvf Python-3.6.2.tgz 
cd Python-3.6.2
./configure --with-ssl
make && make install
  1. 检查
python3
pip3

使用Virtualenv创建独立的Python运行环境

  • pip安装建议使用国内的pypi源,否则经常会因为网络超时导致模块安装失败
    1.安装virtualenv
    pip3 install -i https://pypi.doubanio.com/simple/ virtualenv
  1. 创建python虚拟环境
mkdir myspace #工作目录
cd myspace
virtualenv -p python3 venv
  1. 激活python虚机环境
    . venv/bin/activate
  2. 退出python虚机环境
    deactivate

实践2 – urllib和Requests的使用

urllib使用

  1. 发起GET请求
# http_get.py
from urllib import request


response = request.urlopen("http://www.baidu.com/")
print(response.read())
  1. 发起带参数的请求
# http_params.py
from urllib import request, parse


url = 'http://www.baidu.com/s?'
params = {'word': 'Python爬虫',
          'tn': 'baiduhome_pg',
          'ie': 'utf-8'}
url = url + parse.urlencode(params)
# print(url)
with request.urlopen(url) as response:
    with open("response.html", "wb") as file:
        file.write(response.read())
  1. 发起POST请求
# http_post.py
from urllib import request,parse


data = parse.urlencode({'terms': 'here is test'}).encode()
req = request.Request('http://httpbin.org/post?q=Python', data=data)
with request.urlopen(req) as response:
    print(response.read())

requests使用

  1. 发起http请求
# req_http.py
import requests


# 发起get请求,并传递参数
r1 = requests.get('https://httpbin.org/get', params={'terms': 'here is test'})
print(r1.url)
print(r1.status_code)
print(r1.content)
# 发起post请求
r2 = requests.post('https://httpbin.org/post', data={'terms': 'here is test'})
print(r2.content)
  1. 常见设置
# req_header.py
import requests


# 自定义请求头
headers = {'user-agent': 'Mozilla/5.0'}
r1 = requests.get('http://httpbin.org/headers', headers=headers)
print(r1.text)

# 指定Cookie
cookies = {'from-my': 'browser'}
r2 = requests.get('http://httpbin.org/cookies', cookies=cookies)
print(r2.text)

# 设置超时
r3 = requests.get('https://www.baidu.com', timeout=5)

# 设置代理 西刺https://www.xicidaili.com/
proxy = {
    'http': 'http://112.85.170.175:9999',
    'https': 'https://118.190.73.168:808',
}
r4 = requests.get('http://www.kuaidaili.com/free/', proxies=proxy, timeout=2)
print(r4.content)

# Session
s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a": "c"})
r5 = s.get('http://httpbin.org/cookies')
print(r5.text)
r5 = s.get('http://httpbin.org/cookies')
print(r5.text)

实践3 – Beautiful Soup的使用

使用bs4解析网页

  1. 基于标签Tag查找
# douban_top250.py
import requests
from bs4 import BeautifulSoup


headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
rsq = requests.get('https://movie.douban.com/top250', headers=headers)
html = rsq.text
soup = BeautifulSoup(html, 'html.parser')

# 基于标签查找
divs = soup.find_all('div', class_='hd')
for div in divs:
    print(div.a.span.string)

next_link = soup.find('span', class_='next')
if next_link is not None:
    print(next_link.a['href'])
  1. 基于CSS selector查找
# douban_top250.py
import requests
from bs4 import BeautifulSoup


headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
rsq = requests.get('https://movie.douban.com/top250', headers=headers)
html = rsq.text
soup = BeautifulSoup(html, 'html.parser')

# CSS选择器
div_css = soup.select('.item a .title:first-child')
for name in div_css:
    print(name.get_text())

link_css = soup.select_one('.next a')
if link_css is not None:
    print(link_css['href'])

实践4 – MongoDB基础

MongoDB数据库基本操作

  1. 启动/连接数据库
cd /path/for/mongodb/bin
./mongod --dbpath /path/for/data/
./mongo
  1. 创建数据库
use spider
db
show dbs
db.dropDatabase()
  1. 创建集合
db.createCollection('douban')
show collections
db.douban.drop()
  1. 创建文档
db.douban.insert({'title': '豆瓣'}) # 如果集合douban不存在也会自动创建
db.douban.find()
db.douban.update({'title': '豆瓣'}, {$set:{'title': '豆瓣爬虫'}})
db.douban.remove({})

pymongo操作数据库

  • 使用前需安装pymongo模块

pip install -i https://pypi.doubanio.com/simple/ pymongo

  1. pymongo操作数据库
import pymongo


client = pymongo.MongoClient(host="127.0.0.1", port=27017)
db = client["jobs"]
collection = db["jobs_bigdata"]
data = { "title": "肖申克的救赎", "star": 1000, "url": "https://movie.douban.com/subject/1292052/" }
doc = collection.insert_one(data)
print(doc.inserted_id)

for x in collection.find():
    print(x)

实践5 – 爬取动态网页

爬取豆瓣电影Top250

  • 使用前需安装selenium模块

pip install -i https://pypi.doubanio.com/simple/ selenium

  1. 模拟登录
# login.py
from selenium import webdriver


browser = webdriver.Chrome(executable_path='f:/bigdata/chromedriver.exe')
browser.get('http://www.baidu.com')
elem = browser.find_element_by_id("kw")
elem.clear()
elem.send_keys('python爬虫')
btn = browser.find_element_by_id("su")
btn.click()
  1. 加载js
# pulldown.py
from selenium import webdriver
import time


browser = webdriver.Chrome(executable_path='f:/bigdata/chromedriver.exe')
browser.get('https://www.oschina.net/home/login?goto_page=https%3A%2F%2Fwww.oschina.net%2Fblog')
# 页面未加载完成会导致无法找到对象
time.sleep(5)
browser.find_element_by_css_selector("#userMail").send_keys("******")
browser.find_element_by_css_selector("#userPassword").send_keys("******")
browser.find_element_by_css_selector(".btn.btn-green.block.btn-login").click()

for i in range(3):
    script = "window.scrollTo(0, document.body.scrollHeight); var lenOfPage=document.body.scrollHeight; return lenOfPage;"
    browser.execute_script(script)
    time.sleep(3)
  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值