excel抓取数据_抓取具有要求的excel在线只读文件

excel抓取数据

In a previous article https://medium.com/@hanqi_47643/scraping-excel-online-read-only-file-with-selenium-and-javascript-in-python-7bb549f05d66, I used Selenium to scrape this Excel Online file, but that felt a little indirect and slow, so here is a new attempt with new tools and knowledge gained. Full notebook at https://gist.github.com/gitgithan/b9f48e1b23e88f1fb1c56ad9b739adef

在上一篇文章https://medium.com/@hanqi_47643/scraping-excel-online-read-only-file-with-selenium-and-javascript-in-python-7bb549f05d66中 ,我使用Selenium刮取了此Excel Online文件,但感觉有些间接和缓慢,因此这是使用新工具和新知识进行的新尝试。 完整笔记本位于https://gist.github.com/gitgithan/b9f48e1b23e88f1fb1c56ad9b739adef

创建请求 (Creating the request)

In the previous article, the strategy was to scroll, find, parse, scroll, find, parse,… Now, the goal is to send requests using Python requests library to directly target the information we want.

在上一篇文章中,策略是滚动,查找,解析,滚动,查找,解析……。现在,目标是使用Python请求库发送请求,以直接针对我们想要的信息。

Begin by F12 to open Developer Tools → Network Tab on Chrome, then load http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc=%7B566feecf-1e0d-46b8-a505-7cd762665268%7D&action=edit&source=http%3A%2F%2Fwww%2Epresupuesto%2Epr%2Egov%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%2520Budget%2520Requirements%2520FY%25202021%2FForms%2FAllItems%2Easpx%3FRootFolder%3D%252FPRESUPUESTOPROPUESTO2020%252D2021%252FFOMB%2520Budget%2520Requirements%2520FY%25202021 or F5 reload page to see a list of Network Requests being recorded, we want to focus/filter on the GetRangeContent requests (discovered by manually scrolling and observing new requests made) because they contain the data.

从F12开始,在Chrome上打开开发人员工具→网络标签,然后加载http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc=%7B566feecf-1e0d-46b8-a505- 7cd762665268%7D&action = edit&source = http%3A%2F%2Fwww%2Epresupuesto%2Epr%2Egov%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%2520Budget%2520Requirements%2520FY%25202021%2FForms%2Fasp %%%% RO %%% RO %%% F%B %%% F %%% F %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% E% 2520Budget%2520Requirements%2520FY%25202021或F5重新加载页面以查看正在记录的网络请求的列表,我们希望重点/过滤GetRangeContent请求(通过手动滚动和观察发出的新请求发现),因为它们包含数据。

You will see 7 different GetRangeContent requests being made on page load. 1 of them contains the Column names, the other 6 represent the permutation of 3 starting positions of rows (0,28,56) X 2 starting positions of columns (0,30). The 3 requests starting at column 30 do not provide useful data because there are only 14 columns in this data.

您将在页面加载时看到7个不同的GetRangeContent请求。 其中1个包含列名称,其他6个代表行(0,28,56)的3个起始位置X列(0,30)的2个起始位置的排列。 从第30列开始的3个请求没有提供有用的数据,因为此数据中只有14列。

As in the screenshot above, we can right-click Copy as cURL (bash) into https://curl.trillworks.com/ to generate the requests code required to emulate this call made by the browser. Copy as cURL (cmd) is possible if you want to run in Windows cmd directly, but after using the code generator, you’ll have to replace ^%^ with % and ^& with & before the code works (maybe other converters will not include so many erroneous ^ that’s not used for line breaking).

如上面的屏幕截图所示,我们可以右键单击“复制为cURL(bash)到https://curl.trillworks.com/”,以生成模拟浏览器调用所需的请求代码。 如果要直接在Windows cmd中运行,则可以复制为cURL(cmd),但是使用代码生成器后,必须在代码起作用之前将^%^替换为% ,将^&替换为& (也许其他转换器会不包含许多未用于换行的错误^ )。

Not all the Request Headers you see in Chrome will be copied (eg. Accept-Encoding: gzip, deflate), but as we’ll see later, there is much less information required to successfully make a request.

并非您在Chrome中看到的所有请求标头都将被复制(例如Accept-Encoding:gzip,deflate ),但是正如我们稍后将要看到的那样,成功发出请求所需的信息要少得多。

Image for post
https://curl.trillworks.com/ https://curl.trillworks.com/生成的请求

We can see the code generator nicely organizes the headers, params, and requests.get call, ready to be run.

我们可以看到代码生成器很好地组织了标头,参数和request.get调用,可以开始运行了。

By slowly deleting information and manipulating the request, we can learn by experiment what’s not important and what is.

通过缓慢删除信息并处理请求,我们可以通过实验了解不重要和重要的内容。

Not important:

不重要:

Everything in the headers except Content-Type: application/json can be deleted. If there is no Content-Type supplied, the server will return an error. You can verify this by opening the GetRangeContent in new tab and refreshing it, which does not provide a Content-Type. It is strange why this is required as it is usually a Reponse Header rather than a Request Header, and even as a Request Header, it is given to specify a payload (https://stackoverflow.com/questions/5661596/do-i-need-a-content-type-header-for-http-get-requests#:~:text=In%20that%20sense%2C%20a%20Content,there%20needs%20no%20Content%2DType%20.), but we have no payload here.

标题中除Content-Type: application/json之外的所有Content-Type: application/json都可以删除。 如果没有提供Content-Type,则服务器将返回错误。 您可以通过在新选项卡中打开GetRangeContent并刷新它(不提供Content-Type)来验证这一点。 奇怪的是为什么要这样做,因为它通常是响应标头而不是请求标头,甚至作为请求标头也可以指定有效负载( https://stackoverflow.com/questions/5661596/do-i -需要为HTTP获取请求的内容类型标题#:〜:text = In%20that%20sense%2C%20a%20Content,其中有%20needs%20no%20Content%2DType%20 。),但是我们这里没有有效载荷。

cookies can be deleted too as the server always sends the same cookie no matter how you refresh.

cookies也可以删除,因为无论您如何刷新,服务器总是发送相同的Cookie。

Important:

重要:

context should contain updated information within each session before it times out after ~10 mins, so the whole process of refresh page+copy curl must be repeated, and later relevant code blocks updated before things run with correct responses.

context应该在每个会话中包含更新的信息,然后在大约10分钟后超时,因此必须重复刷新页面+复制卷曲的整个过程,并在以后以正确的响应运行之前更新相关的代码块。

row specifies the starting row of the requested data. This is varied in both the singlethreaded and multithreaded make_requests function.

row指定所请求数据的起始行。 在单线程和多线程make_requests函数中,这是不同的。

rowCount specifies the number of rows of data to get, beginning from row . We do not have to stick to the default 28 that Chrome/Selenium sends and can actually get all 15253 rows at once. However, such requests take much longer to finish. Experimentation shows 100–6000 rows per request is a good range.

rowCountrow开始指定要获取的数据row 。 我们不必遵守Chrome / Selenium发送的默认值28,实际上可以一次获取所有15253行。 但是,此类请求需要更长的时间才能完成。 实验表明,每个请求100-6000行是一个很好的范围。

blockPositionPaneType PaneType 1 returns data only. PaneType 3 contains the column names (sent by the 7th GetRangeContent we see after refresh).

blockPositionPaneType PaneType 1仅返回数据。 PaneType 3包含列名称(由刷新后我们看到的第7个GetRangeContent发送)。

定义功能 (Defining Functions)

def make_requests (def make_requests)

This is a while loop with row and rowCount parameterized for the request to start at a later row for each call, and to experiment with run speeds with different block sizes requested.

这是一个while循环,其中将rowrowCount参数化,以使请求从每个调用的下一行开始,并尝试使用请求的不同块大小的运行速度。

def clean_data (def clean_data)

This function uses a similar logic as the previous Selenium article to fill empty strings in 1st column, but is more accurate because it even checks the last block while the Selenium article implementation did not (so missing values in last block would not have been identified/filled).

此函数使用与上一篇Selenium文章相似的逻辑来填充第一列中的空字符串,但更为准确,因为它甚至在Selenium文章实现未执行的情况下甚至检查了最后一块(因此,不会识别出最后一块中的缺失值/填充)。

def parse_response (def parse_response)

This function extracts the valuable information in the CDATA section manually with string slicing [166:-25] to avoid XMLSyntaxError: CData section too big found. The resulting string is parsed into a dictionary, and the relevant data inside GridHtml is extracted. These paths are found by manually inspecting response.text to see where the data is.

此函数使用字符串切片[166:-25]手动提取CDATA部分中的有价值的信息,以避免XMLSyntaxError:CData部分太大。 结果字符串被解析为字典,并提取GridHtml的相关数据。 通过手动检查response.text以查看数据的位置可以找到这些路径。

Image for post
response.text from GetRangeContent
来自GetRangeContent的response.text
Image for post
Identifying unique classes to use for xpath for data searching/cleaning
标识用于xpath的唯一类,以进行数据搜索/清理

Once we get inside GridHtml , we can use Developer Tools live DOM to analyze what class to target for our Xpath search for data extraction and cleaning. We can see that every row starts with ewr-nglr (Ctrl+F shows 28 rows), so that class was used to identify and clean rows in def clean_data. Since every cell has ewr-vab , that can be used too for xpath searching for data, but it’s more convenient to directly do gridhtml.xpath('//text()') .

一旦进入GridHtml ,我们就可以使用Developer Tools实时DOM分析Xpath搜索数据提取和清理所针对的类。 我们可以看到每一行都以ewr-nglr (Ctrl + F显示28行),因此该类用于标识和清除def clean_data中的行。 由于每个单元都有ewr-vab ,因此它也可以用于xpath搜索数据,但是直接执行gridhtml.xpath('//text()')更为方便。

When you meet missing data beginning row 9242, you can see there is no left:0px , but the data starts from left:64px (every cell has 64px width) which is visually displayed in the 2nd column, leaving the 1st column empty. This proves that the Excel Online client does not know in which columns the data is missing, but only the correct horizontal positions to show them among the gridlines so the user can interpret missingness. These positions could be used for more complex missing data filling if it was not only the 1st column that is missing in this simple case.

当遇到从第9242行开始的丢失数据时,您可以看到没有left:0px ,但是数据从left:64px (每个单元格的宽度为64px)开始,该值直观地显示在第二列中,而第一列为空白。 这证明Excel Online客户端不知道数据丢失在哪几列中,而是仅知道正确的水平位置以在网格线之间显示它们,以便用户可以解释缺失。 如果在这种简单情况下不仅缺少第一列,这些位置可用于填充更复杂的丢失数据。

多线程加速 (Speeding up with Multi-threading)

Image for post
Multithreading to speed up 30 secs to 5 secs
多线程将速度从30秒提高到5秒

Since requests are I/O bound tasks, there is no need to wait for 1 block of rows to be returned and parsed before the next request is sent (as in the single threaded while loop in the notebook). The sweet spot is around 300–500 rows for rowCount . range is used to create different starting row positions for each function in each thread.

由于请求是与I / O绑定的任务,因此在发送下一个请求之前(如笔记本中的单线程while循环),无需等待返回并解析1行行。 rowCountrowCount 300-500行。 range用于为每个线程中的每个函数创建不同的起始row位置。

观看错误 (Errors to Watch)

Image for post

If it has been awhile since the page was refreshed, a refresh is needed to get a working session. Scrolling in the window to revive it will not work as it will refresh anyway to give a new SessionId.

如果自刷新页面以来已经有一段时间了,则需要刷新才能正常工作。 在窗口中滚动以恢复它将不起作用,因为它将始终刷新以提供新的SessionId。

The 4 errors shown at the end could error when the rowCount is changed/reduced. Both the single/multi-threaded code will break at different rowCount ranges. At rowCount of 298 →299, the response will change from a known Content-Length to a Transfer-Encoding: Chunked and ChunkedEncodingError starts creeping in, however at much higher rowCount of 1000’s this error goes away, same goes for the ContentDecodingError .

当更改/减少rowCount时,最后显示的4个错误可能会出错。 单线程/多线程代码都将在不同的rowCount范围内中断。 在298→299的rowCount处,响应将从已知的Content-Length变为Transfer-Encoding:Chunked和ChunkedEncodingError开始ChunkedEncodingError ,但是在rowCount为1000时,此错误消失了, ContentDecodingError

These seem like issues with the server/requests library implementation. The ContentDecodingError can be resolved by specifying request header Accept-Encoding of br or identity so the server uses a different compression scheme than the default gzip , but the time to get the request will double . JSONDecodeError happens more rarely and I interpret it is bad data returned from server.

这些似乎是服务器/请求库实现的问题。 可以通过指定bridentity请求标头Accept-Encoding来解决ContentDecodingError ,以便服务器使用与默认gzip不同的压缩方案,但是获取请求的时间将增加一倍。 JSONDecodeError发生的JSONDecodeError很少,我认为这是服务器返回的错误数据。

I guess Jupyter stuck at running symbol (*) when running executor.map would be due to my lack of understanding of how ThreadPoolExecutor really works.

我猜Jupyter在运行executor.map时会卡在运行符号(*)上,这是由于我对ThreadPoolExecutor的工作原理缺乏了解。

与Selenium的比较 (Comparison with Selenium)

Selenium Pros

Selenium专家

  1. Selenium allows the user to observe the open window scrolling to track progress. Also, it will never error out with these parsing errors if sufficient waits are given

    Selenium允许用户观察打开的窗口滚动以跟踪进度。 同样,如果给予足够的等待,它将永远不会因这些解析错误而出错。
  2. Selenium does not require the handling of different starting rows

    Selenium不需要处理不同的起始行
  3. More intuitive, mimics how a user actually interacts with the browser

    更直观,模仿用户与浏览器的实际交互方式

Selenium Cons

Selenium缺点

  1. A lot slower (overhead in scrolling new elements into DOM)

    慢很多(将新元素滚动到DOM中的开销)
  2. Multi-threading is difficult/ineffective. To multi-thread, one has to scroll each thread to a different starting point before reading. All these scrolling is unnecessary overhead.

    多线程是困难/无效的。 要使用多线程,必须在阅读之前将每个线程滚动到不同的起点。 所有这些滚动都是不必要的开销。
  3. Cannot get more than 28 rows at a time because information is constrained by what is attached to the DOM at any moment

    一次最多只能获取28行,因为信息随时会受到DOM附加内容的限制

经验教训 (Lessons)

  1. Observe requests and compare their differences to learn about what information is returned (column names vs actual data)

    观察请求并比较它们之间的差异以了解返回的信息(列名与实际数据)
  2. Do not trust Copy as cURL blindly, may require cleaning and filling in missing headers

    不要盲目相信复制为cURL,可能需要清理并填写缺少的标题
  3. Manually slicing strings to remove outer layers of nested information can hack through lxml limitations on CDATA parsing

    手动切片字符串以删除嵌套信息的外层可以破解CDATA解析中的lxml限制
  4. Experiment with new request parameters (not stick to default requesting of 28 rows, delete useless headers)

    试用新的请求参数(不遵循默认的28行请求,删除无用的标头)
  5. Client does not know where is missing data, it is based on human interpretation from the inline style CSS, but this could be useful for programmatically filling in missing data

    客户不知道丢失的数据在哪里,它基于内联样式CSS的人工解释,但是对于以编程方式填充丢失的数据可能很有用
  6. Changing number of rows requested can change the way server passes back data, and how efficiently multi-threading works

    更改请求的行数可以更改服务器传回数据的方式以及多线程的工作效率

翻译自: https://medium.com/@hanqi_47643/scraping-excel-online-read-only-file-with-requests-142b70e7b40d

excel抓取数据

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值