title: 爬虫过程中遇到的相关问题
date: 2018-07-17 22:58:29
categories: 爬虫
tags:
- 正则表达式
- bs4
爬虫过程中遇到的相关问题:正则表达式、bs4解析网页相关等
#一、爬虫
爬虫=爬取网页+解析网页。爬取网页方法:
法一:requests.get(url)
法二:用selenium的webdriver模拟浏览器点击来爬取
解析网页方法:
bs4的BeautifulSoup(page_source, 'html.parser')解析即可
自己实现的爬虫
# -*- coding:utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from bs4 import BeautifulSoup
import urllib
import re
syspath = sys.path[0]
wfile = open(syspath+"/wfile", "a")
wfile.write("url_id\t景点名称\t景点地址\t门票信息\t开放时间\t交通指南\t简介\n")
for i in range(355, 1099):
print i
url = "http://m.cncn.com/jingdian/" + str(i)
page_source = urllib.urlopen(url).read().decode('gbk')
page_source.replace("\s\S\n\r", "")
soup = BeautifulSoup(page_source, 'lxml')
try:
id_info = str(i)
name_info = str(soup.header.p.string)
address_info = "None"
price_info = "None"
time_info = "None"
trans_info = "None"
station_info = "None"
data_list = soup.find_all("b")
for i in data_list:
if str(i)=="<b>景点地址</b>":
reg = re.compile("<b>景点地址</b>.*?>([.\n\s\S]*?)</div>")
m = re.search(reg, str(page_source))
if m:
address_info = str(m.group(1))
else:
address_info = str(m)
elif str(i)=="<b>门票信息</b>":
reg = re.compile("<b>门票信息</b>.*?>([.\n\s\S]*?)</div>")
m = re.search(reg, str(page_source))
if m:
price_info = str(m.group(1))
else:
price_info = str(m)
elif str(i)=="<b>开放时间</b>":
reg = re.compile("<b>开放时间</b>.*?>([.\n\s\S]*?)</div>")
m = re.search(reg, str(page_source))
if m:
time_info = str(m.group(1))
else:
time_info = str(m)
elif str(i)=="<b>交通指南</b>":
reg = re.compile("<b>交通指南</b>.*?>([.\n\s\S]*?)<")
m = re.search(reg, str(page_source))
if m:
trans_info = str(m.group(1)).replace(" ", "").replace("\r\n", "")
else:
trans_info = str(m)
elif str(i)=="<b>简介</b>":
# reg = re.compile("<b>简介</b>.*?>.*?([^\u4E00-\u9FA5]*)。")
reg = re.compile("<b>简介</b>.*>(.*[^\u4E00-\u9FA5]*)。")
m = re.search(reg, str(page_source))
if m:
station_info = str(m.group(1)).replace(" ", "") + "。"
else:
station_info = str(m)
wfile.write(id_info+"\t"+name_info+"\t"+address_info+"\t"+price_info+"\t"+time_info+
"\t"+trans_info+"\t"+station_info+"\n")
except:
print "Error"
'''
爬虫=爬取网页+解析网页
爬取网页方法:
法一:上述requests.get(url)
法二:下述用selenium的webdriver模拟浏览器点击来爬取
解析网页方法:
bs4的BeautifulSoup(page_source, 'html.parser')解析即可
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
for url_line in ofile:
driver = webdriver.Chrome()
driver.get(url_line)
time.sleep(2)
phone = driver.find_element_by_xpath("//div[@class='pb30 position-rel']/input")
phone.send_keys("15652965942")
passw = driver.find_element_by_xpath("//div[@class='pb40 position-rel']/input")
passw.send_keys("abc123456")
passw.send_keys(Keys.RETURN)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# data = soup.find_all(href=re.compile("common.stopPropagation\(event\)"))
data = soup.find('class=re.compile("js-full-container hidden")')
print data
break
# time.sleep(5)
'''
#二、正则表达式
###1.正则表达式匹配任意字符(包括换行符)的写法:
正则表达式中,“.”(点符号)匹配的是除了换行符“\n”以外的所有字符。同时,手册上还有一句话:要匹配包括 ‘\n’ 在内的任何字符,请使用像 ‘[.\n]’ 的模式,结果还是不行,什么内容都取不到。
以下为正确的匹配包括换行符的任意字符的正则表达式匹配规则:
([.\s\S]*)
###2.正则表达式的汉字匹配
[^\u4E00-\u9FA5]*
###3.爬取后网页用bs4解析的问题
请看上述示例爬虫代码,里面有正则解析与bs4解析之lxml两种方法,其中bs4解析还包括lxml、html.parse等等方法,具体请百度
###4.python抓取网页出现 ^M 解决办法
str.replace("\r", "") 去掉就行
###4.在vim中替换^M
:1,$s/word1/word2/g 从第一行到最后一行之间,查找word1,替换成word2
故直接输入 :1,$s/<Ctrl+v><Ctrl+m>//g 并按回车
版权声明:本文为原创文章,客官如需转载请注明,感谢支持
转载自:[Braincao's Blog的博客](https://braincao.github.io)