python网页信息采集

最新推荐文章于 2022-09-11 15:51:53 发布

FungJL

最新推荐文章于 2022-09-11 15:51:53 发布

阅读量585

点赞数

文章标签： python selenium chrome

本文链接：https://blog.csdn.net/FungJL/article/details/107281066

版权

python网页信息采集

引言

引言

这是第一次实战，帮忙从俄新社网页链接下载关于中国的新闻，技术不行，还是得配上个人操作才能完成。

1.前期准备

选择好日期，或者其他筛选项。
在这里插入图片描述
这网页第一次会出现加载选项，要自己点，后面下滑都会动态加载了。

2.自动控制鼠标下滑，保存已加载的网页

发现前期准备直接用selenium模块直接打开页面，选择日期，获取数据的方式，浏览器都会突然关闭。所以只能前期自己打开浏览器，手动选好页面，然后selenium继续控制，才能正常加载。
1.python+selenium控制已打开页面
参考链接

Win：
    chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\Users\Administrator\Desktop\ria_ru"
Mac:
    chrome启动程序目录：/Applications/Google Chrome.app/Contents/MacOS/
    进入chrome启动程序目录后执行：
    ./Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/Users/lee/Documents/selenum/AutomationProfile"
参数说明：
    --remote-debugging-port
    可以指定任何打开的端口，selenium启动时要用这个端口。
    --user-data-dir
    指定创建新chrome配置文件的目录。它确保在单独的配置文件中启动chrome，不会污染你的默认配置文件。

2.按前述前期准备，在打开的浏览器手动选择好需要的页面
3.自动滚动，保存已加载的页面
技术太差，不知道这种动态加载的网页要怎么选择结束的条件，发现新闻是新到旧这样排序的，所以多选一个结束日期（月份）来作为终止条件

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium. webdriver.common.keys import Keys
import time
def Stop_Month(self, stop_month):
    #通过获取指定日期的前一个日期判断加载是否完成，需要多选择前一个日期 
    htmldateElems = browser.find_elements_by_class_name('list-item__date')
    month_str = htmldateElems[-1].text.split()
    return month_str[1]
def mouse_move(self, stop_month):       # 滑动鼠标至底部操作
    htmlElem = browser.find_element_by_tag_name('html')
    while True:
        htmlElem.send_keys(Keys.END)
        time.sleep(1)
        month = Stop_Month(self, stop_month)
        print(month)
        if stop_month == month:
            print('****Arrived at the specified month interface****')
            break
options = Options()
options.add_experimental_option('debuggerAddress', "127.0.0.1:9222")
browser = webdriver.Chrome(chrome_options=options)
browser.implicitly_wait(3)
stop_month = 'декабря'
mouse_move(browser, stop_month)
f = open('0631-0207.html', 'wb')
f.write(browser.page_source.encode("utf-8", "ignore"))
print('****html is written successfully****')
f.close()

3.获取页面的所有新闻链接、标题及时间，生成excel表格

下载的网页其实已经包含了所以的新闻链接、标题及时间，问题是如何提取。

import openpyxl re, bs4

def Links_Get(self):
    '''获取链接'''
    downloadFile = open(self, encoding='utf-8')
    webdata = bs4.BeautifulSoup(downloadFile.read(), 'html.parser')
    elems = webdata.find_all(attrs={'class': 'list-item__title color-font-hover-only'})
    link_regex = re.compile(r'http(.*)html')
    links=[]
    for elem in elems:
        a = link_regex.search(str(elem))
        links.append(a.group())
    return links
def Titles_Get(self):
    '''获取标题'''
    downloadFile = open(self, encoding='utf-8')
    webdata = bs4.BeautifulSoup(downloadFile.read(), 'html.parser')
    elems = webdata.find_all(attrs={'class': 'list-item__title color-font-hover-only'})
    #查找所有包含这个属性的标签
    titles=[]
    for elem in elems:
        titles.append(elem.text)
    return titles
def Get_Link_to_Title(self, title, excel, i):   
    '''信息写入excel'''
    excel['A%s'%(i)] = i
    #获取时间列表
    date_regex = re.compile(r'\d+')
    a = date_regex.search(self)
    excel['B%s'%(i)] = a.group()
    excel['C%s'%(i)] = title
    excel['D%s'%(i)] = self
    print("****%s successful****" % i)

links = Links_Get('0631-0207.html') #前面下载网页保存在工作目录
titles = Titles_Get('0631-0207.html')
nums1 = len(links)
nums2 = len(titles)
if nums1 == nums2：#一般的话，应该是对应的，不行的话，再看吧
    i, j = 1, 0
    #事先新建一个excel，再加载写入       
    time_title_link = openpyxl.load_workbook('time_title_link.xlsx')
    time_title_link.create_sheet('0631-0207')
    for link in links:
        get_news.Get_Link_to_Title(link, titles[j], time_title_link['0631-0207'], i)
        print(str(i), str(nums1))      
        if link == links[-1]:
            time_title_link.save('time_title_link.xlsx')
            print('Succeessful save')          
        i += 1
        j += 1                 
    print('****Succeessful all****')
else：
    print('Error, titles != links')

4.从生成的列表中，获取每个链接的新闻内容，生成docx

import openpyxl
import docx

def Get_News(self, doc):
    res = requests.get(self)
    res.raise_for_status()
    NewsFile = bs4.BeautifulSoup(res.text, 'html.parser')
    elems_titles = NewsFile.select('.article__title')
    date_regex = re.compile(r'\d+')
    a = date_regex.search(self)
    date_str = 'a[href=' + '"/' + a.group() + '/"]'
    elems_dates = NewsFile.select(date_str)
    elems_texts = NewsFile.select('.article__text')
    head0 = doc.add_heading('', 0)
    for title in elems_titles:
        head0.add_run(title.getText() + ' ')
    print('title write succeed')
    head2 = doc.add_heading('', 2)
    for date in elems_dates:
        head2.add_run(date.getText())
    print('date write succeed')
    for text in elems_texts:
        doc.add_paragraph(text.getText())
    print('text write succeed')
    doc.add_page_break()

workbook = openpyxl.load_workbook(r'time_title_link.xlsx')
sheet = workbook['0631-0207']
doc = docx.Document()
i = 1
for cell in sheet['D']:
    if cell.value == 'URL':
        continue
    elif cell.value != '':
        Get_News(cell.value, doc)
        print(str(i))
        i += 1
    else:
        doc.save('0631-0207.docx')
        break
print('****Succeessful save****')