Using Python to Access Web Data

最新推荐文章于 2022-03-06 22:12:00 发布

Loucas99

最新推荐文章于 2022-03-06 22:12:00 发布

阅读量1.6k

点赞数 1

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/Loucas99/article/details/105651469

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Coursera：Python for Everybody Specialization
开始这个系列第三门课了开始懵逼

Regular Expressions
一个在我看来简便的快速写法：
整理转载于：https://blog.csdn.net/weixin_43593303/article/details/90704433

转载一个作业整理（CSDN博主「zjt1027」的原创文章），原文链接：https://blog.csdn.net/zjt1027/article/details/104221343
唉，不会。

题目：Scraping Numbers from HTML using BeautifulSoup

In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_269099.html (Sum ends with 62)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

Data Format

The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:

<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>

You are to find all the tags in the file and pull out the numbers from the tag and sum the numbers.

Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
   # Look at the parts of a tag
   print 'TAG:',tag
   print 'URL:',tag.get('href', None)
   print 'Contents:',tag.contents[0]
   print 'Attrs:',tag.attrs

You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.

Sample Execution

$ python3 solution.py
Enter - http://py4e-data.dr-chuck.net/comments_42.html
Count 50
Sum 2...

我的解法：

import re
import urllib
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_269099.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('span')
nums = list()
for tag in tags:
    num = re.findall('[0-9]+',str(tag))
    nums += num
numbers = [ int(x) for x in nums ]
print(sum(numbers))

我想说的：

1，复习一下，列表合并用+就可合并为一个。

2，继续复习，列表解析式 numbers = [ int(x) for x in nums ] 真香。

3，一开始我把bs4.zip解压后放到了和我的.py一个目录下，然后发现会报错 AttributeError: module ‘html5lib’ has no attribute ‘treebuilders’，然后我就不知道怎么办，看到网上说pip xx和修改库文件什么的也试了不对，然后请教了一个大佬，才发现anaconda似乎是环境独立，库并不互通，所以搜anaconda promote，在里面输入conda install html5lib 和 bs4，前一个安装，后一个说存在，之后再去运行的时候就不会报错了，并且会显示正确答案。

4，查了一些资料，ulropen() 的（https://blog.csdn.net/qq_41856814/article/details/99658108）和bs4的文档（https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#），这对我帮助挺大。

但这位大哥的代码我没运行出来，，，所以来了一位超强的作业合集！：https://wanakiki.github.io/2019/python-assignments/

# 抓取网站中的数据并计算数据总和
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_275913.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
span = soup('span')
res = 0
for aspan in span:
    res = res + int(aspan.get_text())
print('结果为', res)

get_text()用法：
比如这儿有这么一大段带html的字串，想要从中提取文本，首先发现这是一个textarea，我们使用beautifulsoup：

def get_content(url):
    resp = urllib.request.urlopen(url)
    html = resp.read()
    bs = BeautifulSoup(html, "html.parser")
    return bs.textarea.get_text()

首先用那段html字符串初始化beautifulsoup对象，然后bs.textarea返回找到的第一个textarea，找到后使用get_text()清空所有html标签元素，之后就会返回干净的文字。

ex_13_02:
Following Links in Python
In this assignment you will write a Python program that expands on (http://www.py4e.com/code3/urllinks.py).

# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve. Sequence of names: Fikret Montgomery Mhairade Butchi Anayah Last name in sequence: Anayah
Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Cohen.html Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: A
Strategy
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for you to do the assignment without writing a Python program. But frankly with a little effort and patience you can overcome these attempts to make it a little harder to complete the assignment without writing a Python program. But that is not the point. The point is to write a clever Python program to solve the program.

Sample execution
Here is a sample execution of a solution:

$ python3 solution.py
Enter URL: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Enter count: 4
Enter position: 3
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html

The answer to the assignment for this execution is “Anayah”.

程序要求：从指定的链接开始，解析html页面，获取所有的herf标签，找到指定位置（position）的链接，重复这个过程指定次数（count)。最终输出找到的人名。

作业答案：
1.

# 跟踪超链接，找到目标
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
# 假定输入规范
count = int(input('Enter Count:'))
position = int(input('Enter Position:')) - 1    #从1开始计算位置
flag = int(input('''Select Url:
1: http://py4e-data.dr-chuck.net/known_by_Fikret.html
2: http://py4e-data.dr-chuck.net/known_by_Cohen.html
'''))
if flag == 1:
    url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
else:
    url = 'http://py4e-data.dr-chuck.net/known_by_Cohen.html'
while True:
    html = urllib.request.urlopen(url).read()   #打开界面
    soup = BeautifulSoup(html, 'html.parser')   #界面解析
    #界面链接的格式为 ``<a href = "xx.com"></a>``
    #应该提取a标签
    a_coll = soup('a')    #获取a标签集合 属于beautiful soup中的类型，应该是重载了括号，print(a_coll[0])支持这种访问
    href = a_coll[position].get('href', None)
    if href is None:
        print('runtime error')
        quit()
    if count > 1:
        count = count - 1
        url = href  # 更新url 以进一步抓取
    else:
        res = a_coll[position].get_text()   #获取内容
        break
print(res)

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

def findUrl(url,position):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    # Retrieve all of the anchor tags
    tags = soup('a')
    return tags[position].get('href',None)


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

count = int(input('Enter count - '))

position = int(input('Enter position - '))-1

for i in range(count):
    if i==0:
        url_now = input('Enter - ')
        print(url_now)
        url_now = findUrl(url_now,position)
        print(url_now)
    else:
        url_now = findUrl(url_now,position) 
        print(url_now)

转载于：https://blog.csdn.net/u012348774/article/details/78106407

最后一章的习题了！好烦！但我自己改对了！其他的答案都不对，佛了。

Calling a JSON API
In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geojson.py.
The program will prompt for a location, contact a web service and retrieve JSON for the web service and parse that data, and retrieve the first place_id from the JSON. A place ID is a textual identifier that uniquely identifies a place as within Google Maps.

API End Points
To complete this assignment, you should use this API endpoint that has a static subset of the Google Data:

http://py4e-data.dr-chuck.net/json?

This API uses the same parameter (address) as the Google API. This API also has no rate limit so you can test as often as you like. If you visit the URL with no parameters, you get “No address…” response.

To call the API, you need to provide the address that you are requesting as the address= parameter that is properly URL encoded using the urllib.parse.urlencode() function as shown in http://www.py4e.com/code3/geojson.py

Test Data / Sample Execution
You can test to see if your program is working with a location of “South Federal University” which will have a place_id of “ChIJ9e_QQm0sDogRhUPatldEFxw”.

$ python3 solution.py
Enter location: South Federal University
Retrieving http://...
Retrieved 2291 characters
Place id ChIJ9e_QQm0sDogRhUPatldEFxw

大哥的答案：调用谷歌API查询指定地点的place id，因为谷歌地图API现在需要验证，所以最好使用PY4E提供的副本。需要注意，使用py4e提供的副本时，需要额外传递一个key参数，这点在作业说明中没有指出，查看代码包中的代码之后才发现这个值为42，http://www.py4e.com/code3/geojson.py 具体见这个链接。

import urllib.error, urllib.request, urllib.parse
import json
target = 'http://py4e-data.dr-chuck.net/json?'  #使用这个接口，需要 key参数且值为42
local = input('Enter location: ')
url = target + urllib.parse.urlencode({'address': local, 'key' : 42})
#对字符串进行url编码，直接传递参数和值构成的字典
print('Retriving', url)
data = urllib.request.urlopen(url).read()
print('Retrived', len(data), 'characters')
js = json.loads(data)
# print(json.dumps(js, indent = 4)) #查看接收到的文件结构
print('Place id', js['results'][0]['place_id'])

我的答案：

import urllib.request, urllib.parse, urllib.error
import json
import ssl

api_key = False
# If you have a Google Places API key, enter it here
# api_key = 'AIzaSy___IDByT70'
# https://developers.google.com/maps/documentation/geocoding/intro

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/json?'
else :
    serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

while True:
    address = input('Enter location: ')
    if len(address) < 1: break

    parms = dict()
    parms['address'] = address
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None
    
    print('Place id', js['results'][0]['place_id'])