Meta tag in html

The head element contains general information, also called meta-information, about a document. Meta means "information about".

You can say that meta-data means information about data, or meta-information means information about information.

Meta Data
Data that describes other data.

MIME (Multipurpose Internet Mail Extensions)
An Internet standard for defining document types. MIME type examples: text/plain, text/html, image/gif, image/jpg.

SGML (Standard Generalized Markup Language)
An international standard for markup languages. The basis for HTML and XML.

SMIL (Synchronized Multimedia Integration Language)
A W3C recommended language for creating multimedia presentations.

SOAP (Simple Object Access Protocol)
A standard protocol for letting applications communicate with each other using XML.

SSI (Server Side Include)
A type of HTML comment inserted into a web page to instruct the web server to generate dynamic content. The most common use is to include standard header or footer for the page.

SSL (Secure Socket Layer)
Software to secure and protect web site communication using encrypted transmission of data.

VPN (Virtual Private Network)
A private network between two remote sites, over a secure encrypted virtual Internet connection (a tunnel).

Web Spider
A computer program that searches the Internet for web pages. Common web spiders are the one used by search engines like Google and AltaVista to index the web. Web spiders are also called web robots or wanderers.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
import os import re from bs4 import BeautifulSoup # 设置html文件路径 folder_path = r'C:\Users\test\Desktop\DIDItest' # 提取html文件内所需要数据 def extract_html_info(file_path): with open(file_path, 'r', encoding='utf-8') as file: # 读取HTML源代码 html = file.read() soup = BeautifulSoup(html, 'html.parser') # 提取所有的<p>标签 p_tags = soup.find_all('p') for p_tag in p_tags: # 提取ID talk_id = p_tag.find_previous(string=lambda text: isinstance(text, str) and '[talkid:' in text) talk_id = talk_id.strip('[talkid:]') # 提取时间 timestamp = p_tag.find_previous('body').find_previous('head').find('meta', {'http-equiv': 'Content=-Type'})[ 'content'] # 提取发送号码 send_number = p_tag.find('span', {'class': 'hint-success'}).text # 提取接收号码 receive_number = p_tag.find_all('span', {'class': 'hint-success'})[1].text # 提取信息内容 message = p_tag.previous_sibling.strip() # 遍历文件夹及子文件夹,提取HTML文件信息 for root, dirs, files in os.walk(folder_path): for file in files: if file.endswith('.html'): file_path = os.path.join(root, file) extract_html_info(file_path) # 判断是否是音频 if '音频' in message: file_url = p_tag.find('a')['href'] print( f"ID: {talk_id}, 时间: {timestamp}, 发送号码: {send_number}, 接收号码: {receive_number}, 音频文件地址: {file_url}") else: print( f"ID: {talk_id}, 时间: {timestamp}, 发送号码: {send_number}, 接收号码: {receive_number}, 信息内容: {message}")
07-14
你的代码有一些缩进错误,导致无法正确运行。以下是修正后的代码: ```python import os import re from bs4 import BeautifulSoup # 设置html文件路径 folder_path = r'C:\Users\test\Desktop\DIDItest' # 提取html文件内所需要数据 def extract_html_info(file_path): with open(file_path, 'r', encoding='utf-8') as file: # 读取HTML源代码 html = file.read() soup = BeautifulSoup(html, 'html.parser') # 提取所有的<p>标签 p_tags = soup.find_all('p') for p_tag in p_tags: # 提取ID talk_id = p_tag.find_previous(string=lambda text: isinstance(text, str) and '[talkid:' in text) talk_id = talk_id.strip('[talkid:]') # 提取时间 timestamp = p_tag.find_previous('body').find_previous('head').find('meta', {'http-equiv': 'Content=-Type'})['content'] # 提取发送号码 send_number = p_tag.find('span', {'class': 'hint-success'}).text # 提取接收号码 receive_number = p_tag.find_all('span', {'class': 'hint-success'})[1].text # 提取信息内容 message = p_tag.previous_sibling.strip() # 判断是否是音频 if '音频' in message: file_url = p_tag.find('a')['href'] print(f"ID: {talk_id}, 时间: {timestamp}, 发送号码: {send_number}, 接收号码: {receive_number}, 音频文件地址: {file_url}") else: print(f"ID: {talk_id}, 时间: {timestamp}, 发送号码: {send_number}, 接收号码: {receive_number}, 信息内容: {message}") # 遍历文件夹及子文件夹,提取HTML文件信息 for root, dirs, files in os.walk(folder_path): for file in files: if file.endswith('.html'): file_path = os.path.join(root, file) extract_html_info(file_path) ``` 在这个修正后的代码中,我将提取HTML文件信息的部分封装到了一个函数`extract_html_info`中。然后,我使用`os.walk`函数遍历了指定文件夹及其子文件夹,找到所有的HTML文件,并调用`extract_html_info`函数来提取信息。 希望这次能够顺利运行!如果你还有其他问题,请随时提问。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值