aws python库_使用Python和AWS进行Web爬取| 节能应用程序第1部分

最新推荐文章于 2024-05-30 18:05:01 发布

weixin_26752759

最新推荐文章于 2024-05-30 18:05:01 发布

阅读量324

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/web-scraping-with-python-and-aws-energy-saving-app-part-1-5c12eb78770a

版权

aws python库

The idea behind this article came to me a while back. While writing a previous Medium article about predicting energy demand I thought of the idea to build an app that would notify users when electricity prices are high. Originally this was an addition to the previous application but I quickly realized that regular people would get more benefit out of it.

这篇文章背后的想法在不久前传给我。在撰写之前的有关预测能源需求的中型文章时，我想到了构建一个可在电价高时通知用户的应用程序的想法。最初，这是以前的应用程序的补充，但我很快意识到，普通人将从中受益匪浅。

The app itself is explained a bit further in the video above but to summarize it will gather real-time electricity prices and notify users when the price rises in their location. This allows people who use time-of-use utility plans to reduce their bills and help offload the demand on the grid during peak times.The backend of the mobile app (all the stuff the user doesn’t see) will require a web-scraping script to gather this price data from the various ISO websites. This article will discuss how I did that using only python and some AWS services.

上面的视频对应用程序本身进行了进一步的解释，但作为总结，它将收集实时的电价并在用户所在地的价格上涨时通知用户。这样一来，使用分时实用计划的人们就可以减少账单，并在高峰时段帮助减轻电网需求。移动应用程序的后端(用户看不到的所有东西)将需要一个Web-抓取脚本以从各个ISO网站收集此价格数据。本文将讨论我如何仅使用python和一些AWS服务来做到这一点。

Web-scraping is just the process of gathering data from a website using a software program. Although using APIs are desirable it’s not uncommon to find that some web services don’t provide them (or only allow stakeholders to access them). This means we need to find alternative ways to get this data.

网络抓取只是使用软件程序从网站收集数据的过程。尽管使用API是可取的，但发现某些Web服务没有提供API(或仅允许利益相关者访问它们)并不少见。这意味着我们需要找到获取数据的替代方法。

The scripts themselves are relatively simple. Once they are written though we need a way to run them automatically and in a scalable way. This is where using AWS services comes in handy.

脚本本身相对简单。尽管编写了它们，但我们需要一种自动且可扩展的方式来运行它们。这是使用AWS服务的方便之处。

要求 (Requirements)

Scripts that will gather the price data from ISO websites.
将从ISO网站收集价格数据的脚本。
Store the price data in a database for further use.
将价格数据存储在数据库中以备将来使用。
Automate the scripts so they run every 5 minutes.
自动化脚本，使它们每5分钟运行一次。

The script I wrote uses the requests python library to make a get request to the ISO website. The response contains the typical HTML code that can be seen using inspect elements on a webpage. Another library called Beautiful Soup is used to prettify this HTML code and allows elements to be found easily. The code for one of the ISOs can be seen below.

我编写的脚本使用requests python库向ISO网站发出get请求。响应包含典型HTML代码，使用网页上的inspect元素可以看到该代码。另一个名为Beautiful Soup的库用于美化此HTML代码，并允许轻松找到元素。以下是其中一个ISO的代码。

import requests
import logging
from bs4 import BeautifulSouplmp_dict = {
 “DPL”:””,
 “COMED”:””,
 “AEP”:””,
 “EKPC”:””,
 “PEP”:””,
 “JC”:””,
 “PL”:””,
 “DOM”:””
}def main():
 try:
     URL = ‘https://www.pjm.com/'
     page = requests.get(URL)
     soup = BeautifulSoup(page.content, ‘html.parser’)
     for i in soup.find_all(‘ul’, class_=’lmp-price-table’):
         divs = i.findChildren(‘div’)
         j=0
         while j < len(divs):
             value = divs[j].text
             if value in lmp_dict:
                 lmp_dict[value] = divs[j+1].text
             j+=1
 
      print(lmp_dict)   except Exception as e:
     raise eif __name__ == “__main__”:
 main()

A script was created for a few different ISOs which each have different websites. The scripts can be viewed in my GitHub repo below.

为几个不同的ISO创建了一个脚本，每个ISO具有不同的网站。可以在下面的GitHub存储库中查看脚本。

Once the scripts worked they needed to be hosted on AWS. Lambda is a service specifically for simple scripts that allows you to invoke them from other services or on a schedule with CloudWatch. More information can be found here.

脚本工作后，需要将其托管在AWS上。 Lambda是专门用于简单脚本的服务，允许您从其他服务或按计划使用CloudWatch调用它们。在这里可以找到更多信息。

We also needed to store the price data in a database. I decided to use DynamoDB which is a NoSQL DB because it is much cheaper for read operations (we’ll only be writing new data every 5 minutes) and if the schema needs to change in the future there won’t be any major issues.

我们还需要将价格数据存储在数据库中。我决定使用DynamoDB(这是一种NoSQL DB)，因为它对于读取操作便宜得多(我们每5分钟只会写一次新数据)，并且如果将来需要更改架构，就不会有任何重大问题。

So the Lambda function needed to have some additional handlers so that it could be ran and an additional function for putting the data to DynamoDB. One of the scripts can be seen in the code section below.

因此，Lambda函数需要具有一些其他处理程序，以便可以运行它，以及一个用于将数据放入DynamoDB的附加函数。脚本之一可以在下面的代码部分中找到。

import json
import boto3
import os
import requests
from datetime import datetime, timedelta
import logging
from bs4 import BeautifulSoupdef get_timestamp():
    # current date and time
    now = datetime.now()
 
    # Convert the hour from UST to EST using environment value 
    hour_value = int(os.environ[‘time_addition’])
    now += timedelta(hours=hour_value)
    return now.strftime (‘%Y%m%d%H%M’) 
 
 
def get_value():
    URL = 'https://www.iso-ne.com/'
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    td_element = soup.find('td', class_ ='homepage-price')
    lpm_text = td_element.text
    return lpm_text[1:]
def put_item(lpm_value, time_stamp, region):
    client = boto3.client('dynamodb', 
             region_name=os.environ['aws_region'])
    return client.put_item(
            TableName=os.environ['table_name'],
            Item={
                'price': { "S": lpm_value},
                'region_name': { "S": region},
                'timestamp': { "S": time_stamp}
            }
            )
def lambda_handler(event, context):
    client = boto3.client(‘dynamodb’,                         
             region_name=os.environ[‘aws_region’])
    region = “iso-ne”
 
    try:
       try:
        lpm_value = get_value()
        time_stamp = get_timestamp()
        
        response = put_item(lpm_value, time_stamp, "iso-ne")
        return response     except Exception as e:
         raise e

Once the scripts were working and storing data in the DynamoDB table the only thing left to do is to automate it. CloudWatch is a service which is really handy for collecting logs from events. Lambda functions by default publish logs using CloudWatch whenever it runs. Under the events section of the CloudWatch dashboard there is an option to create a schedule. I set the schedule for 5 minutes and the target as my two lambda scripts. This means we don’t have to worry about any issues with running a fail/success events can be monitored if you need to know about them (such as sending an email to you on failure to run).

一旦脚本开始工作并将数据存储在DynamoDB表中，剩下要做的就是使它自动化。 CloudWatch是一项非常有用的服务，用于从事件收集日志。 Lambda函数默认在运行时使用CloudWatch发布日志。在CloudWatch仪表板的“事件”部分下，有一个用于创建计划的选项。我将时间表设置为5分钟，将目标设置为我的两个Lambda脚本。这意味着我们不必担心运行失败/成功事件的任何问题(如果您需要了解它们的话)(例如，在运行失败时向您发送电子邮件)。

There is a lot more to do with this project but for the time being, that’s everything to do with the data collection. I’ll be discussing the mobile development in a future article.

这个项目还有很多事情要做，但是就目前而言，这一切与数据收集有关。我将在以后的文章中讨论移动开发。

Twitter: https://twitter.com/CoogyEoin

推特： https : //twitter.com/CoogyEoin

翻译自: https://towardsdatascience.com/web-scraping-with-python-and-aws-energy-saving-app-part-1-5c12eb78770a

aws python库

weixin_26752759

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
aws python库_使用Python和AWS进行Web爬取| 节能应用程序第1部分

aws python库使用Python和AWS进行Web爬取| 节能应用程序第1部分 (Web Scraping with Python and AWS | Energy Saving App Part 1) 在AWS上构建应用程序以帮助节省能源费用和环境 (Building an app on AWS to help save energy bills and the environment...
复制链接

扫一扫