读取函数
主要函数有两个,如下所示:
read_csv:默认分隔符为逗号
read_table:默认分隔符为制表符(‘\t’)
一些函数:
skiprows=[0,2,3]:跳过文件的第一行,第三行和第四行
sentinels={'列名':['foo','NA'],'列名':['two']}
pd.read_csv('',na_values=sentinels)
Web信息收集
from urllib.request import urlopen, Request
from lxml.html import parse
url = 'https://movie.douban.com/top250?start=%s&filter='
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
ret = Request(url, headers=headers)
parsed=parse(urlopen(ret))
doc=parsed.getroot()
links=doc.findall('.//a')
print(links[15:20])
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
获取方式:在浏览器中输入about:version
读取Excel文件
xls_file=pd.ExcelFile(data.xls)
table=xls_file.parse('Sheet1')