Python网络数据采集

最新推荐文章于 2022-03-09 14:28:24 发布

爱吃番茄的胖超人

最新推荐文章于 2022-03-09 14:28:24 发布

阅读量356

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/Tomato_Sir/article/details/79902426

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

通读完Python网络数据采集，总体上内容比较基础，主要采用urllib、BeautifulSoup、requests等进行爬取，对Scrapy等框架浅尝辄止。下面是三个比较有代表性的程序示例

利用urllib和BeautifulSoup对wiki页面的链接进行采集

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    # 查找以/wiki/为开头的链接
    for link in bsObj.findAll("a",href=re.compile("^(/wiki/)")):  
        if 'href'in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks("")

输出

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Requesting_copyright_permission
/wiki/Wikipedia:User_access_levels
/wiki/Wikipedia:Requests_for_adminship
/wiki/Wikipedia:Protection_policy#extended
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:WPPP
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Policies_and_guidelines
/wiki/Wikipedia:Shortcut
/wiki/Wikipedia:Keyboard_shortcuts
/wiki/Wikipedia:WikiProject_Kansas

登录表单示例，原理是将填表的内容利用params通过post方式发送给服务器

import requests

session = requests.Session()

params = {'username':'TomatoSir','password':'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php",params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print('------------------')
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

输出

Cookie is set to:
{'loggedin': '1', 'username': 'TomatoSir'}
------------------
Going to profile page...
Hey TomatoSir! Looks like you're still logged into the site!

just for fun 利用马尔科夫链原理自动生成文本，原理是根据二元词组的词频大小作为下一个单词出现的概率

from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

#按照词频随机选择单词
def retrieveRandomWord(wordList):
    randIndex = randint(1,wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    #剔除换行符和引号
    text = text.replace("\n","").replace("\"","")

    #将标点和前面词连在一起，确保标点不被剔除
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol," "+symbol+" ")

    words = text.split(" ")
    #过滤空单词
    words = [word for word in words if word != ""]
    #建立字典
    wordDict = {}
    #统计2-gram词组的个数
    for i in range(1,len(words)):
        if words[i-1] not in wordDict:
            wordDict[words[i-1]]={}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] =  wordDict[words[i-1]][words[i]] + 1

    return wordDict

text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
wordDict = buildWordDict(text)

#生成马尔科夫链
length = 100
chain = ""
currentWord = "I"
for i in range(0,length):
    chain += currentWord + " "
    currentWord = retrieveRandomWord(wordDict[currentWord])

print(chain)

输出

I deem the framers of every patriot . And although there was intended to be to me in principle of the spoils and I have been known to their affections changed . Amongst the Constitution has produced . I can unmake , and knowing the days of government and a reference to create or classed with whose situation could not appear to a full participation in as well understood , they looked with every other consequences than a necessary burdens to be applied upon their hoards and his objections . But with these grants of correction . It would ,

爱吃番茄的胖超人

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python网络数据采集

通读完Python网络数据采集，总体上内容比较基础，主要采用urllib、BeautifulSoup、requests等进行爬取，对Scrapy等框架浅尝辄止。下面是三个比较有代表性的程序示例利用urllib和BeautifulSoup对wiki页面的链接进行采集from urllib.request import urlopenfrom bs4 import Beautiful...
复制链接

扫一扫