python for every one: 3 web data_python for every one-CSDN博客

本文链接：https://blog.csdn.net/followUrheart6/article/details/103830292

File 文件

open() : 打开文件，得到一个序列，不是数据本身
read():读取文件，得到一个单独字符串
\n：换行符，一个字符

# handle = open(filename, mode)
handle = open('myfile', 'r')

Building a Search Engine

Page Rank

目标：简单的网页爬虫、计算谷歌Page Rank算法、可视化网络

Search Engine Architecture

Web Crawling

网页爬虫主要是用来创造访问页面的备份，为了后续的数据处理和搜索
在这里插入图片描述

Index Building

Searching

Unicode Characters and Strings

ASCII: 编码 ord(‘str’)：显示编码 UTF-8 is the best encoding data

print(ord('H'))		# 72

2 Regular Expressions

访问网络数据
 正则表达式查询表
 regular expression
Greedy Matching： Non-greedy Matching

import re  # 插入正则表达式模块
x = "My 2 favorite bumbers are 19 and 49"
y = re.findall('[0-9]+', x)
z = re.findall('[AEIOU]+', x)
p = re.findall('f.+d',x)
#re.search(): 找到一个匹配正则表达式的字符，返回True/False
#re.findall(): 提取符合你的正则表达式的字符
#['2', '19', '49']
#[] 
 #['favorite bumbers are 19 and']

3 Networked Technology

在这里插入图片描述

HTTP:Hypertext Transfer Protocol 超文本传输协议
return HTML language

4 Program surfing the web

urllib 模块

import urllib.request, urllib.parse
fhand = urllib.request.urlopen('https://www.baidu.com/')
counts = dict()
for line in fhand:
   # strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列
   # print(line.decode().strip())
   words = line.decode().split()
   for word in words:
       counts[word] = counts.get(word, 0) + 1
print(counts)

解析网页 Beautiful Soup 模块

# 使用BeautifulSoup去解析网页
url =  "https://www.zhihu.com/"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags  页面标签 <a href=''.....>
tags = soup('link')
for tag in tags:
    print(tag.get('href', None))

5 Web Services and XML

网页上的数据格式： XML HTML JSON
eXtensible Markup Language (XML): 可扩展标记语言
为了分享结构化数据
在这里插入图片描述
基本语法：
start-end tag：开头结尾对应的部分
text content: 在标签中间的部分
attribute: 属性，位于tag中间
XML as a Tree

XML as Paths

XML Schema 图解：标签化语言，一一对应
解析 XML：

data = '''
<person>
    <name>Sam</name>
</person>
'''
tree = ET.fromstring(data)
print('name: ', tree.find('name').text)
Name_list = tree.findall('name')
# name:  Sam