python加载html表格数据,使用Python读取和与HTML表交互(Reading & Interacting With HTML Table Using Python)...

使用Python读取和与HTML表交互(Reading & Interacting With HTML Table Using Python)

我想在9:30开始,然后向前跳1分钟与桌子互动。 我想将所有数据导出到DataFrame。 我尝试过使用pandas.read_html()并尝试使用BeautifulSoup。 尽管我对BeautifulSoup缺乏经验,但这些都不适合我。 我的请求是否可能,或者网站是否通过网络报废保护此信息? 任何帮助,将不胜感激!

I'm trying to web scrape information from an HTML table that has interactive ability to sift through various time periods. An example table is located at this URL: http://quotes.freerealtime.com/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750.

I'd like to start at the time of 9:30 and then interact with the table by jumping forward 1 min. I want to export all of the data to a DataFrame. I've tried using pandas.read_html() and also tried using BeautifulSoup. Neither of these are working for me albeit I am inexperienced with BeautifulSoup. Is my request possible or has the website protected this information from web scrapping? Any help would be appreciated!

原文:https://stackoverflow.com/questions/41581616

更新时间:2020-02-22 17:08

最满意答案

该页面非常动态(而且非常慢,至少在我身边),涉及JavaScript和多个异步请求以获取数据。 接近requests并不容易,您可能需要通过例如selenium来使用浏览器自动化。

这是你开始的事情。 请注意在这里和那里使用显式等待 :

import pandas as pd

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.select import Select

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()

driver.maximize_window()

driver.get("http://quotes.freerealtime.com/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750")

wait = WebDriverWait(driver, 400) # 400 seconds timeout

# wait for select element to be visible

time_select = Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name=time]"))))

# select 9:30 and go

time_select.select_by_visible_text("09:30")

driver.execute_script("arguments[0].click();", driver.find_element_by_id("go"))

time.sleep(2)

while True:

# wait for the table to appear and load to pandas dataframe

table = wait.until(EC.presence_of_element_located((By.ID, "qmmt-time-and-sales-data-table")))

df = pd.read_html(table.get_attribute("outerHTML"))

print(df[0])

# wait for offset select to be visible and forward it 1 min

offset_select = Select(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "select[name=timeOffset]"))))

offset_select.select_by_value("1")

time.sleep(2)

# TODO: think of a break condition

请注意,这在我的机器上真的非常慢,我不确定它会在你的机器上运行得多好,但它会在无限循环中持续前进1分钟(你可能需要在某些时候停止它)。

The page is quite dynamic (and terribly slow, at least on my side), involves JavaScript and multiple asynchronous requests to get the data. Approaching that with requests would not be easy and you might need to fall into using browser automation via, for example, selenium.

Here is something for you to get started. Note the use of Explicit Waits here and there:

import pandas as pd

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.select import Select

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()

driver.maximize_window()

driver.get("http://quotes.freerealtime.com/dl/frt/M?IM=quotes&type=Time%26Sales&SA=quotes&symbol=IBM&qm_page=45750")

wait = WebDriverWait(driver, 400) # 400 seconds timeout

# wait for select element to be visible

time_select = Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[name=time]"))))

# select 9:30 and go

time_select.select_by_visible_text("09:30")

driver.execute_script("arguments[0].click();", driver.find_element_by_id("go"))

time.sleep(2)

while True:

# wait for the table to appear and load to pandas dataframe

table = wait.until(EC.presence_of_element_located((By.ID, "qmmt-time-and-sales-data-table")))

df = pd.read_html(table.get_attribute("outerHTML"))

print(df[0])

# wait for offset select to be visible and forward it 1 min

offset_select = Select(wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "select[name=timeOffset]"))))

offset_select.select_by_value("1")

time.sleep(2)

# TODO: think of a break condition

Note that this works really, really slow on my machine and I am not sure how well it would run on yours, but it continuously advances 1 minute forward in an endless loop (you would probably need to stop it at some point).

相关问答

您可以使用cheerio操作DOM树。 const cheerio = require('cheerio');

const json = { html: '...' };

const $ = cheerio.load(json.html);

const script = `

function onActivateClick(event){

// YOU CODE HERE

}

`;

$(

...

不同之处在于Python 3.4中默认为bufsize=-1 ,因此slave.stdin.write()不会立即将该行发送到ruby子slave.stdin.write() 。 快速解决方法是添加slave.stdin.flush()调用。 #!/usr/bin/env python3

from subprocess import Popen, PIPE

log = print

log("Launch slave process...")

with Popen(['ruby', 'slave.

...

该页面非常动态(而且非常慢,至少在我身边),涉及JavaScript和多个异步请求以获取数据。 接近requests并不容易,您可能需要通过例如selenium来使用浏览器自动化。 这是你开始的事情。 请注意在这里和那里使用显式等待 : import pandas as pd

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.

...

经过一些小修改...... food_list = ['car', 'plane', 'van', 'boat', 'ship', 'jet','shuttle']

for i in xrange(0, len(food_list), 4):

print '

' + ''.join(food_list[i:i+4]) + ''

这基本上将分隔符更改为不是制表符,而是表格元素。 此外,将开放行和关闭行放在开头和结尾。 With some

...

BeautifulSoup会让你非常接近你想要的行为: from bs4 import BeautifulSoup

html_table_string = """

Something else

"""

table = BeautifulSoup(html_table_string, "html.parser")

# Select first td element and set it's

...

您可以尝试BeautifulSoup.findAll并提供您可能知道的标签以及您正在寻找的标签的任何其他属性。 看完页面之后,看起来你正在寻找所有

标签。 所以你可以使用soup.findAll("tr", attrs = {"class": "even"}) 。 例如。 import urllib.request

from bs4 import BeautifulSoup

game_link = "http://espn.go.com/nba/playbyplay?gameId=4005

...

您看到的弹出窗口不是可以使用switch_to进行交互的常规弹出窗口。 这些弹出窗口是系统对话框 , 无法使用selenium自动执行 。 通常人们通过调整浏览器首选项来避免首先显示这些对话框,例如: 使用selenium下载文件 访问Firefox中的文件下载对话框 如何使用Selenium的WebDriver下载文件? 对于上传,通常您可以找到相应的输入元素并使用文件路径向其发送密钥: 如何使用selenium,python上传文件(图片) 如何将文件上传到文件输入? (python-sele

...

您必须迭代字典和状态的所有水果组合,然后为每个水果创建一行(而不是一列)。 然后迭代匹配该水果的所有文件并过滤那些包含当前状态的文件并将其连接到一个单元格中。 d = {'kiwi': ['kiwi.good.svg'], 'apple': ['apple.good.2.svg', 'apple.good.1.svg'], 'banana': ['banana.1.ugly.svg', 'banana.bad.2.svg']}

html = """

...

启动命令没有输入到gdb中,并且它没有通常在main中放置断点 在gdb提示符中手动键入start命令时,按Enter键实际执行它。 你应该在js脚本中做同样的事情。 在start命令结束时添加\n : ps.stdin.write('start\n');

至于(1)问题,我无法在Fedora上重现它。 the start command isn't fed into gdb and it doesn't do it's usual thing of putting a breakpoint i

...

In [49]: for table in soup.find_all('table'):

...: keys = [th.get_text(strip=True)for th in table.find_all('th')]

...: values = [td.get_text(strip=True) for td in table.find_all('td')]

...: d = dict(zip(keys, values))

...:

...

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值