bs4返回dom对象包含的html,beautifulsoup4：获得href但返回“＃”(beautifulsoup4: get href but return “#”)...

最新推荐文章于 2022-10-04 18:47:02 发布

亭中意

最新推荐文章于 2022-10-04 18:47:02 发布

阅读量432

点赞数

文章标签： bs4返回dom对象包含的html

beautifulsoup4：获得href但返回“＃”(beautifulsoup4: get href but return “#”)

我正在使用bs4从站点获取一些href。

data-track="HOT:SR:HotelModule" tabindex="0">

some text here

HTML就像上面那样。我可以使用此代码获取大部分网址：

for URL in res.select('.someClass')

URL.select('a')[0]['href']

但一些返回值是＃我检查了网站的源代码，我发现href真的在这里，它不是＃。

有什么不对，让我得到＃但不是网址？

这是我正在尝试的网站。我的问题发生在+ VIP标签的酒店。

I am using bs4 to get some href from a site.

data-track="HOT:SR:HotelModule" tabindex="0">

some text here

The HTML is like the above. I can get most of the URL using this code:

for URL in res.select('.someClass')

URL.select('a')[0]['href']

but some of the return value is # I have checked the source code of the website and I find the href is really here and it's not #.

What's wrong make me get # but not the url?

Here's the website that I am trying. My problem happened at the hotel with +VIP tag.

原文：https://stackoverflow.com/questions/42959591

更新时间：2021-01-15 21:01

最满意答案

也许他们正在使用Link ，你确定没有像这样的href吗？当使用不同的解析器时会产生不同的结果。尝试使用它们，并检查结果xml ， html5lib ， html.parser" 解析器之间的差异

Maybe they're using Link, are you sure there's no href like this? And there's a difference when using different = parsers give different results. Try using all of them and check for the result xml , html5lib , html.parser" Difference between parsers

相关问答

该网站是动态的，这意味着您需要使用浏览器操作工具，如selenium 。然后，从每个搜索的多个类名中提取文本： import urllib

import re

from bs4 import BeautifulSoup as soup

from selenium import webdriver

def get_table():

d = webdriver.Chrome('path/to/driver') #or webdriver.Firefox(), depending on your

...

正则表达式应该没问题！尝试 table = soup.find_all("div",{ "id": re.compile('content-body-*')})

Regular expressions should be fine! Try table = soup.find_all("div",{ "id": re.compile('content-body-*')})

也许他们正在使用Link ，你确定没有像这样的href吗？当使用不同的解析器时会产生不同的结果。尝试使用它们，并检查结果xml ， html5lib ， html.parser" 解析器之间的差异 Maybe they're using Link, are you sure there's no href like this? And ther

...

问题是网站是用ReactJS ，它创建VirtualDom来填充数据。另一方面，BeautifulSoup查找DOM元素。由于没有为元素创建DOM，它将获得空值。最好的解决方案是使用casperjs ( http://casperjs.org/ ) 我建议类似casperjs的唯一原因是比python支持的像selenium这样的抓取模块要简单得多。如果你对你的pythonic方式非常认真， Selenium应该为你工作。但它很难配置第一次。使用npm install -g phant

...

我相信主要问题是美丽汤4中的一个错误。我已经提交了它，并且修补程序将在下一个版本中发布。感谢您的发现。也就是说，我不知道为什么你的配置文件提到了HTMLParser类，因为你使用的是lxml。 I believe the main problem is a bug in Beautiful Soup 4. I've filed it and a fix will be released in the next version. Thanks for finding this. That sai

...

首先，确保在html看到这些“缺少的标签”进入BeautifulSoup进行解析。可能问题不在于BeautifulSoup如何解析HTML，而在于如何检索要解析的HTML数据。我怀疑，您正在通过urllib2下载谷歌主页或requests并将您在str(soup)看到的内容与您在真实浏览器中看到的内容进行比较。如果是这种情况，那么你无法比较两者，因为urllib2和requests都不是浏览器，并且在页面加载后无法执行javascript或操作DOM，或者发出异步请求。你用urllib2

...

如果为不包含锚点的div调用div.find('a') ，它将返回None 。您的代码必须处理此问题。例如，你可以这样做： entries = []

for div in vlad_div:

a = div.find('a')

img = div.find('img')

if a is not None and img is not None:

entry = {

'text': div.text

'hre

...

我通过消除我用于错误检查的打印语句并指定要被删除的HTML文件的编码和csv输出文件，通过with open命令添加encoding="utf-8"来解决这个问题。 from bs4 import BeautifulSoup

import requests

import sys

import csv

import re

from datetime import datetime

from pytz import timezone

url = input("Enter the name of th

...

首先，newboston似乎是一个截屏视频，因此在那里获取代码会有所帮助。其次，我建议在本地输出文件，以便您可以在浏览器中打开它，并在Web Tools中查看以查看您想要的内容。我还建议使用ipython在本地文件上使用BeautfulSoup，而不是每次都使用它。如果你把它扔在那里你可以做到这一点： plainText = sourceCode.text

f = open('something.html', 'w')

f.write(sourceCode.text.encode('utf8

...

你需要的是使用类似字典的元素属性访问： [a['href'] for a in item('a')]

而且，作为旁注，您可以改进查找li元素的方式，而不是： data = soup.find_all('li',{"class":"more-data"})+soup.findAll('li', {"class":"more-data topten"})

for item in data:

print(item('a'))

你可以做： links = soup.select("li.mor

...

亭中意

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
bs4返回dom对象包含的html,beautifulsoup4：获得href但返回“＃”(beautifulsoup4: get href but return “#”)...

beautifulsoup4：获得href但返回“＃”(beautifulsoup4: get href but return “#”)我正在使用bs4从站点获取一些href。data-track="HOT:SR:HotelModule" tabindex="0">some text hereHTML就像上面那样。我可以使用此代码获取大部分网址：for URL in res.select(...
复制链接

扫一扫