使用requests+xpath爬取nba球员信息

主要目的:

我比较喜欢打篮球,然后这两天要做一个篮球相关的项目,需要爬取nba球员信息,我这里从两个网站获取信息,因为没有哪一个网站拥有所有我需要的信息。
1.http://www.stat-nba.com/
2.nba中文官方网站

环境:

windows7
PyCharm 2019.3.3 (Professional Edition)
python3.7

过程

在nba中国官方网站首页点击现役球员,退役的同样方法也可以拿到,我这里爬的是现役的
在这里插入图片描述
然后来到球员页
在这里插入图片描述
在这里插入图片描述
这里看一下可以发现网页的基本设计思路(个人理解)。这个页面中的每一个字母可以对应一个容器(用来装名字以该字母开头的球员信息),display设置为none,然后在每一个字母上加一个JavaScript,点击的时候对应的容器显示出来。然后在加载页面的时候,通过一个ajax请求向后台请求所有的球员数据,然后返回到前台之后,放到上述容器之中,我个人是这样理解这个页面的设计思路的,通过上述请求也基本可以看出来,大致就是这个思路。
上图中也可以看出playerlist.json就是ajax从后台拿回来的json格式的数据,从该条请求中得到ajax请求地址,下图即为所示。因此直接向这个url发出请求就可以得到相应的数据
在这里插入图片描述
从头部信息也可以看出,这里并没有携带什么数据,所以直接发送请求即可
在这里插入图片描述

 def get_page(self,url):
        response = requests.get(url=url,headers = self.headers)
        data = response.content.decode('gbk')
        # print(data)
        return data
    def get_playerjson(self,url,file):
        #爬取过程中发现这里使用的是json形式交换数据,然后就直接找到了json数据交换的url直接获取并解析。这里json格式非常规范
        response = requests.get(url)
        json_loads = json.loads(response)
        # print(json.loads(response))
        if os.path.exists(file):
            pass
        else:
            with open(file, "w",encoding='utf8') as fp:
                fp.write(json.dumps(json_loads, indent=4,ensure_ascii=False))

这里把得到的json数据存入文件中,注意上图中indent=4是加4个缩进,让格式更加美观,ensure_ascii=False这个是为了让中文写入,如果不加写入的是编码后的结果。

然后再从中读取需要的数据,这里就是按照python中的字典形式读就行,然后写入数据库也不过多赘述,后面直接放上源码

下面是下载图片

这里可以用urllib中的urlretrieve来下载,也可以使用我这里面的方法f.write(response.content)这里content就是直接从网站上拿到的内容没有经过解码,是bytes类型,所以直接写入.png之中就变成了图片。这个url地址规律也比较容易找。

 def get_playerimg(self):
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')
        cursor = db.cursor()
        sql = "select playerId,name from players"
        try:
            # player_ids = cursor.execute(sql)#这样会返回个数
            cursor.execute(sql)
            player_idnames = cursor.fetchall()
            # print(player_ids)
        except:
            print('获取数据异常')
        db.close()
        for player_idname in player_idnames:
            # request = request.Request()
            print(player_idname)
            url = 'https://china.nba.com/media/img/players/head/260x190/'+player_idname[0]+'.png'
            # print(url)
            # break
            if not os.path.exists('player_imgs/'+player_idname[0]+'.png'):
                response = requests.get(url,headers=self.headers)
                with open('player_imgs/'+player_idname[0]+'.png','wb') as f:
                    f.write(response.content)
                print(player_idname[1]+":下载完毕")
            else:
                print(player_idname[1]+":已存在")
这里nba中文官方网站已经爬取完毕,接下来爬另外一个网站,从这个网站爬取球员的每个赛季的具体数据,这里也需要借助之前的数据

在这里插入图片描述
这里是按照名字首字母排序的,这个网站因为不是官方网站,有时候会有不太规范的地方,所以我每次用xpath提取的时候都会先进行规范化,下面是函数,这里可以选择返回内容,如果希望返回以后用xpath提取数据,就写上参数xpath,还可以选择pyquery

    def request_page(self,url,extract_function):
        response = requests.get(url)
        original_data = response.content.decode('utf-8')
        standard_html = etree.HTML(original_data)
        standard_data = etree.tostring(standard_html).decode('utf-8')
        if extract_function=='xpath':
            return standard_html
        elif extract_function=='pyquery':
            return standard_data#返回去之后直接可以pyquery解析

接下来是最重要的一件事,就是这个网站上面的球员非常的多,现役的,退役的,还有教练,我只需要获取从官网爬取的那些球员,所以就要加判断

    def request_letter_page(self,text):#在文件中拿到url
        letter_links = []#从文件中拿到每个字母对应的页面地址
        with open(text,'r',encoding='utf8') as fp:
            for link in fp.readlines():
                link = link.replace('\n','')#这里是生成新的,不改变原来的,迷了半天
                letter_links.append(link)
        player_names_and_urls = []  # 为了存储得到的playername和url,计划所有都得到以后再与数据库中的数据加以比较
        for index,letter_link in enumerate(letter_links):
            # 这里很坑!!!!!!!!!!!!!!!!!!!!!!!!!!!!1
            # !!!!!!!!!!!!
            # 因为!!!!1这里的xyz页面中没有教练员这一栏!!!!!!!!!!,所以xpath不正确
            # print(letter_link)
            # standard_data = self.request_page(letter_link)
            # doc = pq(standard_data)
            # print(doc('title'))
            # break
            #!!!!!!!!!!!!!!!!!!!报错,解决不了,就是在print(divs)那里,直接换xpath
            #UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 1093: illegal multibyte sequence
            # divs = doc("#background > div:nth-child(16)")
            # print(divs)
            # print(divs.items())
            # for span in spans:
            #     print(span.html())
            #     break

            # response = requests.get(letter_link)
            # original_data = response.content.decode('utf8')
            # htmlElement = etree.HTML(original_data)
            htmlElement = self.request_page(letter_link,'xpath')
            if (index < 23)&(index!=20):
                original_names = htmlElement.xpath("//div[@class='playerList'][2]//span/text()")
                name_urls = htmlElement.xpath("//div[@class='playerList'][2]//div/a/@href")
            else:
                original_names = htmlElement.xpath("//div[@class='playerList'][1]//span/text()")
                name_urls = htmlElement.xpath("//div[@class='playerList'][1]//div/a/@href")
            # print(original_names[0])
            for index,original_name in enumerate(original_names):
                person_name_and_url = {}
                # print(original_name)
                #这里会有没有中文名字的…………
                if re.search(r'.*?/(.*?)\n',original_name):
                    name = re.search(r'.*?/(.*?)\n',original_name).group(1)
                    name_cn = re.search(r'(.*?)/.*?\n',original_name).group(1)
                else:#这里的话需要单独再去掉最后的换行。这里相当于是只有英文名
                    name = original_name.replace('\n','')
                    name_cn = ''
                # print(name.group(1))#group()和group(0)都是所有匹配到的,就是写的这个正则能匹配到的所有,1是可以显示第一个括号中的
                name_url = re.sub(r'^.','http://www.stat-nba.com',name_urls[index])
                # name_url = name_url.replace('\n','')这个不用要了
                # print(name)
                # print('----'+str(name_url)+'---')#这里为了检验是否有换行
                # print(name_url)
                person_name_and_url['name'] = name
                person_name_and_url['name_url'] = name_url
                person_name_and_url['name_cn'] = name_cn
                player_names_and_urls.append(person_name_and_url)
                # print(player_names_and_urls)
            print(letter_link+"已经爬取完毕")
        self.write_dict_to_csv('csv文件/web_players.csv', player_names_and_urls)
        print("从网站上得到的数据长度为:"+str(len(player_names_and_urls)))

            # print(page_names)#这里从网页中得到了球员名字和对应的url
        player_nameid_list = spider.get_playername_fromdb()
        # print(player_nameid_list)
        # print(player_names_and_urls)
        index = 0
        for db_playername_and_id in range(len(player_nameid_list)):#这里相当于是没有用到这里的值,只是当做一个轴
            #这里不能删除,因为直接删除的话,会有索引值问题
            #直接自己加一个索引
            ############3
           
            for index1,player_name_and_url in enumerate(player_names_and_urls):
                # print(player_name_and_url['name'],db_playername_and_id['name'])
                # self.write_file(player_name_and_url['name'],db_playername_and_id['name'])
                # self.write_file('\n')
                # print(index,index1)
                if (player_name_and_url['name'] == player_nameid_list[index]['name'])|(player_name_and_url['name_cn'] == player_nameid_list[index]['name_cn']):
                    print('匹配到球员'+player_name_and_url['name'])
                    # self.write_console(player_name_and_url['name']+'='+db_playername_and_id['name'])
                    # player_names_and_urls[index]['playerId'] = db_playername_and_id['playerId']
                    player_nameid_list[index]['name_url'] = player_name_and_url['name_url']
                    #上面这句话我只是给现在的这个字典加了属性,但是并没有加到之前那个列表中!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!迷了一个多小时
                    break
                elif index1 == len(player_names_and_urls)-1:#这里判断是否是最后一个元素,如果是的话就直接删除
                    # print(index,index1)
                    # self.write_console(player_name_and_url['name'],db_playername_and_id['name'])
                    # self.write_console('\n')
                    print('删除球员:'+player_nameid_list[index]['name'])
                    del player_nameid_list[index]#这里面有url,name,name_cn,还有id
                    index -=1#写外面写里面都一样
            index +=1
        # self.write_console(player_names_and_urls)
        # self.write_console(len(player_names_and_urls))
        return player_nameid_list
        # self.write

需要注意的我在注释上也都写了,就是xyz页面中没有教练员一栏,所以xpath表达式会有些不同,要加判断
下面这个问题我也没有解决,本身想用pyquery(单纯的想试一试,但是遇到这个编码问题解决不了就换了xpath)

#standard_data = self.request_page(letter_link)
# doc = pq(standard_data)
# print(doc(‘title’))
# break
#!!!报错,解决不了,就是在print(divs)那里,直接换xpath
#UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa9’ in position 1093: illegal multibyte sequence

还有几处就是因为python中for i in list 这里i都是list中的元素,但是我需要的是下标,就需要用上述函数中的方法进行操作

下面显示的就是有些球员会没有中文名,判断方法需要改变
在这里插入图片描述
还有一个问题
在这里插入图片描述
就是这里有的球员在一个赛季换了多支球队,就会出现这个问题,这个问题解决思路就是从上往下判断,是否和上一个相同,相同的话就删去这一个,这样就可以把最完美的数据存入数据库中

源码:

我分了两个文件写:

  1. 爬nba官网:
# -*- coding: utf-8 -*-
import requests#这里希望用etree做一下规范,所以用了requests,直接pyquery爬担心爬到不规范的网站
from lxml import etree#为了修正一些不规范的网页
from pyquery import PyQuery as pq
from urllib import request#这里希望下载图片,用urlretrieve方法
import time
import sys
import json
import re
import pymysql
import os#判断是否已下载图片

class Spider:
    def __init__(self,file):
        self.to_file = file
        # self.to_console = sys.stdout
        self.headers = {
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
            'referer':'https: // china.nba.com /'
        }
    def write_console(self,*args):
        sys.stdout = self.to_console
        for arg in args:
            print(arg)
    #必须要,因为改的是sys.stdout
    def write_file(self,*args):
        sys.stdout = open(self.to_file,'a',encoding='utf8')
        for arg in args:
            print(arg)
    def run_nba_spider(self):
        # base_url = "https://china.nba.com"
        # player_url = 'https://china.nba.com/playerindex/'
        # original_data = self.get_page(player_url)
        # json_url = 'https://china.nba.com/static/data/league/playerlist.json'
        # self.get_playerjson(json_url,'tmp.txt')#这个只需要获取数据时执行即可
        # players = self.get_player_information('tmp.json')
        # self.write_playerinfo_table('nbadb',players)
        self.get_playerimg()
    def get_page(self,url):
        response = requests.get(url=url,headers = self.headers)
        data = response.content.decode('gbk')
        # print(data)
        return data
    def get_playerjson(self,url,file):

        # htmlElement = etree.HTML(data)#因为这里经过HTML规范化之后
        # data = etree.tostring(htmlElement).decode('utf8')
        # print(data)
        # doc = pq(data)
        # print('zzz')
        # print(doc('title'))
        #爬取过程中发现这里使用的是json形式交换数据,然后就直接找到了json数据交换的url直接获取并解析。这里json格式非常规范
        response = requests.get(url)
        json_loads = json.loads(response)
        # print(json.loads(response))
        if os.path.exists(file):
            pass
        else:
            with open(file, "w",encoding='utf8') as fp:
                fp.write(json.dumps(json_loads, indent=4,ensure_ascii=False))
    def get_player_information(self,file):
        with open(file, 'r',encoding='utf8') as f:
            b = f.read()
            json_loads = json.loads(b)
        players_list = json_loads['payload']['players']
        players = []
        for i in players_list:
            player = {}
            playerProfile = i['playerProfile']
            player['playerId'] = playerProfile["playerId"]
            player['code'] = playerProfile['code']
            player['name'] = playerProfile["displayName"].replace(" ",'-')
            player['displayNameEn'] = playerProfile['displayNameEn'].replace(" ",'-')
            player['position'] = playerProfile['position']
            player['height'] = playerProfile['height']
            player['weight'] = playerProfile['weight'].replace(" ",'')
            player['country'] = playerProfile["country"]
            player['jerseyNo'] = playerProfile['jerseyNo']
            player['draftYear'] = playerProfile['draftYear']
            player['team_abbr'] = i['teamProfile']['abbr']
            player['team_city'] = i['teamProfile']['city']
            player['team'] = i['teamProfile']['name']
            player['team_name'] = player['team_city']+player['team']
            print(player)
            players.append(player)
        return players
    def create_table(self,table_name):
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')
        cursor = db.cursor()
        sql = 'create table if not exists {0} (playerId varchar(10) primary key,code varchar(20) not null,name varchar(100) not null,displayNameEn varchar(20) not null,position varchar(10) not null,height varchar(10) not null,weight varchar(10) not null,country varchar(20) not null,jerseyNo varchar(10) not null,draftYear varchar(10) not null,team_abbr varchar(10) not null,team_name varchar(100) not null)'.format(table_name)
        cursor.execute(sql)
        db.close()
    def write_playerinfo_table(self,db_name,players_info):
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db=db_name)
        cursor = db.cursor()
        for player in players_info:
            print(player['code'])
            sql = 'insert into players(playerId,code,name,displayNameEn,position,height,weight,country,jerseyNo,draftYear,team_abbr,team_name) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'#这种稳稳地
            # sql = "insert into players() values({0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{1},)".format(player['playerId'],player['code'],player['name'],player['displayNameEn'],player['position'],player['height'],player['weight'],player['country'],player['jerseyNo'],player['draftYear'],player['team_abbr'],player['team_name'])
            try:
                # cursor.execute(sql)
                cursor.execute(sql,(player['playerId'],player['code'],player['name'],player['displayNameEn'],player['position'],player['height'],player['weight'],player['country'],player['jerseyNo'],player['draftYear'],player['team_abbr'],player['team_name']))
                db.commit()
            except Exception as e:
                print('插入数据出现异常',e)
                db.rollback()
        db.close()
    def get_playerimg(self):
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')
        cursor = db.cursor()
        sql = "select playerId,name from players"
        try:
            # player_ids = cursor.execute(sql)#这样会返回个数
            cursor.execute(sql)
            player_idnames = cursor.fetchall()
            # print(player_ids)
        except:
            print('获取数据异常')
        db.close()
        for player_idname in player_idnames:
            # request = request.Request()
            print(player_idname)
            url = 'https://china.nba.com/media/img/players/head/260x190/'+player_idname[0]+'.png'
            # print(url)
            # break
            if not os.path.exists('player_imgs/'+player_idname[0]+'.png'):
                response = requests.get(url,headers=self.headers)
                with open('player_imgs/'+player_idname[0]+'.png','wb') as f:
                    f.write(response.content)
                print(player_idname[1]+":下载完毕")
            else:
                print(player_idname[1]+":已存在")
if __name__ == "__main__":
    spider = Spider('nba.txt')
    spider.run_nba_spider()

  1. 爬nba数据库网站
# -*- coding: utf-8 -*-
import requests
import pymysql
from lxml import etree
from pyquery import PyQuery as pq
import re
import sys
import csv
import json


class Spider:
    def __init__(self):
        self.url = ''
        self.headers = {
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
        # self.to_file = 'out/inputdb.txt'  #这里可以自己修改
        self.to_console = sys.stdout
    def write_console(self,*args):
        sys.stdout = self.to_console
        for arg in args:
            print(arg,end='')
    def write_file(self,file,*args):
        sys.stdout = open(file,'w',encoding='utf8')
        for arg in args:
            print(arg,end='')
        sys.stdout =self.to_console
    def get_playername_fromdb(self):#从player表中取出id和name
        player_idnamelist=[]
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')
        cursor = db.cursor()
        sql = 'select playerId,displaynameEn,name from players'
        cursor.execute(sql)
        player_idnames = cursor.fetchall()
        for player_idname in player_idnames:
            player_dict = {}
            # player_idnamelist.append([player_idname[0],player_idname[1].replace('-',' ')])
            player_dict['name'] = player_idname[1].replace('-',' ')
            player_dict['playerId'] = player_idname[0]
            player_dict['name_cn'] = player_idname[2]
            player_idnamelist.append(player_dict)
        # print(player_idnamelist)
        return player_idnamelist
    def get_page_info(self,text,url):#请求主页,得到每个字母开头的页面的url存进文件中
        # base_url = 'http://www.stat-nba.com/playerList.php'
        # response = requests.get(base_url,headers=self.headers)
        # print(response.content)
        # original_data = response.content.decode('utf-8')
        # print(original_data)这里会报错,没解决,但是不影响,只是中文解析会有问题,英文不影响
        # standard_html = etree.HTML(original_data)
        # print(standard_html.xpath('//title/text()'))
        # standard_data = etree.tostring(standard_html).decode('utf-8')
        # print(standard_data)
        standard_data = self.request_page(url,'pyquery')
        doc = pq(standard_data)
        # print(doc('title'))#很神奇,这里可以显示中文了
        #个人一些见解:pyquery中pq()需要解析的是str类型,而xpath需要对htmlelement进行解析,所以也就造成了上面的结果,
        #tostring是转换为bytes,decode是把bytes解码成string,以后进一步学习编码!!!
        dom_as = doc('.pagination>div>a')
        letter_links = []
        for dom_a in dom_as.items():
            print(dom_a)
            letter_links.append(re.sub(r'^\.','',dom_a.attr('href')))#第一次直接用了去除点,出问题了
            #上面这个写不写反斜杠都可以的,因为这里一个.意思就是第一个,所以无关紧要
        letter_links = letter_links[1::]
        with open(text,'w',encoding='utf8') as fp:
            for letter_link in letter_links:
                fp.write('http://www.stat-nba.com'+letter_link+'\n')
        print(letter_links)#这里直接存到本地文件里面了,防止爬的多了被封ip,而且效率也高
    def request_letter_page(self,text):#在文件中拿到url
        letter_links = []#从文件中拿到每个字母对应的页面地址
        with open(text,'r',encoding='utf8') as fp:
            for link in fp.readlines():
                link = link.replace('\n','')#这里是生成新的,不改变原来的,迷了半天
                letter_links.append(link)
        player_names_and_urls = []  # 为了存储得到的playername和url,计划所有都得到以后再与数据库中的数据加以比较
        for index,letter_link in enumerate(letter_links):
            # 这里很坑!!!!!!!!!!!!!!!!!!!!!!!!!!!!1
            # !!!!!!!!!!!!
            # 因为!!!!1这里的xyz页面中没有教练员这一栏!!!!!!!!!!,所以xpath不正确
            # print(letter_link)
            # standard_data = self.request_page(letter_link)
            # doc = pq(standard_data)
            # print(doc('title'))
            # break
            #!!!!!!!!!!!!!!!!!!!报错,解决不了,就是在print(divs)那里,直接换xpath
            #UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 1093: illegal multibyte sequence
            # divs = doc("#background > div:nth-child(16)")
            # print(divs)
            # print(divs.items())
            # for span in spans:
            #     print(span.html())
            #     break

            # response = requests.get(letter_link)
            # original_data = response.content.decode('utf8')
            # htmlElement = etree.HTML(original_data)
            htmlElement = self.request_page(letter_link,'xpath')
            if (index < 23)&(index!=20):
                original_names = htmlElement.xpath("//div[@class='playerList'][2]//span/text()")
                name_urls = htmlElement.xpath("//div[@class='playerList'][2]//div/a/@href")
            else:
                original_names = htmlElement.xpath("//div[@class='playerList'][1]//span/text()")
                name_urls = htmlElement.xpath("//div[@class='playerList'][1]//div/a/@href")
            # print(original_names[0])
            for index,original_name in enumerate(original_names):
                person_name_and_url = {}
                # print(original_name)
                #这里会有没有中文名字的…………
                if re.search(r'.*?/(.*?)\n',original_name):
                    name = re.search(r'.*?/(.*?)\n',original_name).group(1)
                    name_cn = re.search(r'(.*?)/.*?\n',original_name).group(1)
                else:#这里的话需要单独再去掉最后的换行。这里相当于是只有英文名
                    name = original_name.replace('\n','')
                    name_cn = ''
                # print(name.group(1))#group()和group(0)都是所有匹配到的,就是写的这个正则能匹配到的所有,1是可以显示第一个括号中的
                name_url = re.sub(r'^.','http://www.stat-nba.com',name_urls[index])
                # name_url = name_url.replace('\n','')这个不用要了
                # print(name)
                # print('----'+str(name_url)+'---')#这里为了检验是否有换行
                # print(name_url)
                person_name_and_url['name'] = name
                person_name_and_url['name_url'] = name_url
                person_name_and_url['name_cn'] = name_cn
                player_names_and_urls.append(person_name_and_url)
                # print(player_names_and_urls)
            print(letter_link+"已经爬取完毕")
        self.write_dict_to_csv('csv文件/web_players.csv', player_names_and_urls)
        print("从网站上得到的数据长度为:"+str(len(player_names_and_urls)))

            # print(page_names)#这里从网页中得到了球员名字和对应的url
        player_nameid_list = spider.get_playername_fromdb()
        # print(player_nameid_list)
        # print(player_names_and_urls)
        index = 0
        for db_playername_and_id in range(len(player_nameid_list)):#这里相当于是没有用到这里的值,只是当做一个轴
            #这里不能删除,因为直接删除的话,会有索引值问题
            #直接自己加一个索引
            ############3
            #这里有一个问题,因为此时如果相等需要跳出两层循环!!!!!1
            for index1,player_name_and_url in enumerate(player_names_and_urls):
                # print(player_name_and_url['name'],db_playername_and_id['name'])
                # self.write_file(player_name_and_url['name'],db_playername_and_id['name'])
                # self.write_file('\n')
                # print(index,index1)
                if (player_name_and_url['name'] == player_nameid_list[index]['name'])|(player_name_and_url['name_cn'] == player_nameid_list[index]['name_cn']):
                    print('匹配到球员'+player_name_and_url['name'])
                    # self.write_console(player_name_and_url['name']+'='+db_playername_and_id['name'])
                    # player_names_and_urls[index]['playerId'] = db_playername_and_id['playerId']
                    player_nameid_list[index]['name_url'] = player_name_and_url['name_url']
                    #上面这句话我只是给现在的这个字典加了属性,但是并没有加到之前那个列表中!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!迷了一个多小时
                    break
                elif index1 == len(player_names_and_urls)-1:#这里判断是否是最后一个元素,如果是的话就直接删除
                    # print(index,index1)
                    # self.write_console(player_name_and_url['name'],db_playername_and_id['name'])
                    # self.write_console('\n')
                    print('删除球员:'+player_nameid_list[index]['name'])
                    del player_nameid_list[index]#这里面有url,name,name_cn,还有id
                    index -=1#写外面写里面都一样
            index +=1
        # self.write_console(player_names_and_urls)
        # self.write_console(len(player_names_and_urls))
        return player_nameid_list
        # self.write
    #这个函数是获取从nba官网获得的数据,但是在nba数据库网站中没有的球员名字。
    def get_missing_players(self,player_db,player_web):
        missing_players = []
        for i in player_db:
            for index,j in enumerate(player_web):
                if i['playerId'] == j['playerId']:
                    break
                elif index==len(player_web)-1:
                    missing_players.append(i)

        # with open('missing_players.txt','w',encoding='utf8')as fp:
        #     for i in missing_players:
        #         fp.write("name:"+i)
        #         fp.write(i['name']+' ')
        #         fp.write("playerId:" + i)
        #         fp.write(i['playerId'])
        #         fp.write('\n')
        return missing_players
    def request_page(self,url,extract_function):
        response = requests.get(url)
        original_data = response.content.decode('utf-8')
        standard_html = etree.HTML(original_data)
        standard_data = etree.tostring(standard_html).decode('utf-8')
        if extract_function=='xpath':
            return standard_html
        elif extract_function=='pyquery':
            return standard_data#返回去之后直接可以pyquery解析
    # def write_list_tofile(self,list,file):
    #     with open(file,'w',encoding='utf8') as fp:
    #         for i in list:
    #             fp.write(i+'\n')
    def write_dict_to_csv(self,file,lists):#这里index是判断是否是第一个需要加列名
        if re.search(r'\.csv$',file):#有的话就有返回值,没有的话就会返回none
            # fieldnames = []这个方法也可以但是太笨
            # for arg in args:
            #     fieldnames.append(arg)
            fieldnames = list(lists[0].keys())
            with open(file,'w',encoding='utf8')as csvfile:
                writer = csv.DictWriter(csvfile,lineterminator='\n',fieldnames=fieldnames)
                #lineterminator控制换行的,minator是结尾,终结者的意思。
                writer.writeheader()
                for i in lists:
                    writer.writerow(i)
                    if i['name_cn']:
                        print('写入'+i['name_cn']+'数据成功')
                    else:
                        print('写入'+i['name']+'数据成功')
        else:
            self.write_console('请传入一个csv文件')
    def update_playersdb(self,new_player_list):
        db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')
        cursor = db.cursor()
        # sql = "alter table players add  column name_url varchar(200)"
        # cursor.execute(sql)
        for i in new_player_list:
            try:
                sql = "update players set name_url = %s where playerId = %s"
                cursor.execute(sql,(i['name_url'],i['playerId']))
                db.commit()
            except Exception as e:
                print(e)
                db.rollback()
        db.close()
        # player_idnames = cursor.fetchall()
    def read_data_from_csv(self,file):
        if re.search(r'\.csv$',file):
            with open(file,'r',encoding='utf8')as csvfile:
                reader = list(csv.reader(csvfile))
                data_list = reader[1:]
            player_list = []
            for i in data_list:
                player_dict = {}
                for j in range(len(reader[0])):
                    player_dict[reader[0][j]] = i[j]
                player_list.append(player_dict)
            return player_list
        else:
            print('请传入一个csv文件')
    def spider_player_page(self):
        db = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='nbadb')
        cursor = db.cursor()
        sql = 'select name_url,playerId from players'
        cursor.execute(sql)
        db.close()
        data = cursor.fetchall()
        player_url_id = []#保存playerId和name_url
        for i in data:
            player_dict = {}
            player_dict['name_url'] = i[0]
            player_dict['playerId'] = i[1]
            player_url_id.append(player_dict)
            # break
        print(player_url_id)
        print('从数据库中提取数据完毕')
        for index,player in enumerate(player_url_id):#这里需要修改原列表
            print('正在处理:'+player['playerId']+'的数据')
            player_url_id[index]['playerSeasonData'] = []
            player_url_id[index]['playerCareerData'] = []
            name_url = player['name_url']
            standard_data = self.request_page(name_url,'xpath')
            oringinal_season_title = standard_data.xpath('//*[@id="stat_box_avg"]/thead/tr//text()')
            # print(oringinal_season_title)#这里会有换行空格什么的需要清除一下,直接定义一个函数拉倒
            title = self.clear_n_inlist(oringinal_season_title)
            # self.write_file('out/player_data.txt')
            player_url_id[index]['playerSeasonData'].append(title)
            oringinal_season_datas = standard_data.xpath('//*[@id="stat_box_avg"]/tbody/tr[@class="sort"]')

            #这个for循环是为了去除重复数据,因为有的球员一个赛季换两个以上球队的话,这个数据库中会把每个球队的数据分开存储,然后再写一个总计的数据,我只需要总计就行,总计在最上面,所以这种方式就可以
            index1 = 0
            for i in range(len(oringinal_season_datas)):
                if (index1 != 0) & (oringinal_season_datas[index1].xpath('./td[2]//text()')[0] == oringinal_season_datas[index1 - 1].xpath('./td[2]//text()')[0]):
                    print('删除'+player_url_id[index]['playerId']+'的'+oringinal_season_datas[index1].xpath('./td[2]//text()')[0]+'数据')
                    del oringinal_season_datas[index1]
                    index1 -= 1
                index1 +=1
            print('还剩:'+str(index1)+'个赛季的数据')
            for i in oringinal_season_datas:
                oringinal_season_data = i.xpath('.//text()')
                season_data = self.clear_n_inlist(oringinal_season_data)
                player_url_id[index]['playerSeasonData'].append(season_data)
            # print(player['playerSeasonData'])
            # print(player_url_id)
            oringinal_career_datas = standard_data.xpath('//*[@id="stat_box_avg"]/tbody/tr[position()>last()-2][position()<last()+1]')
            for j in oringinal_career_datas:
                oringinal_season_data = j.xpath('.//text()')
                self.clear_n_inlist(oringinal_season_data)
                player_url_id[index]['playerCareerData'].append(oringinal_season_data)
            print(player['playerId']+'的数据处理完毕')
            # print(player_url_id)#输出在player_page.txt中
            # break
        print('所有数据爬取完毕')
        self.write_file(player_url_id,'out/inputdb.txt')#这里最好是直接存进json中,但是也无妨出来之后json.loads()直接转换为list类型可以进行数据库插入操作NB!!!因为这样可以节约一次次爬取网站消耗的事件,爬的其实不快,估计得两分钟
        return player_url_id

    def clear_n_inlist(self,list):#这里是去掉提取信息中的换行
        index = 0
        for i in range(len(list)):
            # if list[index]=="\n":#后来发现还有' \n'和'\n '
            if re.search(r'.*?\n.*?',list[index]):
                del list[index]
                index -=1
            index +=1
        return list
    def write_player_season_data_todb(self):
        #先从文件中提取数据
        with open('out/inputdb.txt','r',encoding='utf8')as fp:
            player_data_str = fp.read()#此时是字符串型的而且都是单引号,直接转json会有报错
            player_data_str = player_data_str.replace("'",'"')#单引号转双引号
            player_data_json = json.loads(player_data_str)
            # print(len(player_data_json))
        #写入playerdata表中
        db = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='nbadb')
        cursor = db.cursor()
        # 下面是球员的场均数据,我觉得这个可以写到同一个表里
        try:
            sql_create_table = 'CREATE TABLE if not exists playerCareerdata (playerId varchar(20) primary key,season VARCHAR(20),team VARCHAR(20),chuchang_times VARCHAR(20),starting_times VARCHAR(20),play_time VARCHAR(20),hit_rate VARCHAR(20),hit_times VARCHAR(20),shoot_times VARCHAR(20),three_hit_rate VARCHAR(20),three_hit_times VARCHAR(20),three_shoot_times VARCHAR(20),free_hit_rate VARCHAR(20),free_hit_times VARCHAR(20),free_shoot_times VARCHAR(20),rebound VARCHAR(20),offensive_rebound VARCHAR(20),defensive_rebound VARCHAR(20),assist VARCHAR(20),steal VARCHAR(20),block VARCHAR(20),fault VARCHAR(20),foul VARCHAR(20),score VARCHAR(20),win VARCHAR(20),lose VARCHAR(20)) character set utf8'
            cursor.execute(sql_create_table)
            print('球员生涯场均表建立完成!')

            for player_info in player_data_json:#这里是挨个取出每个球员的信息,每一个球员都有一个表
                # print(player_info)
                sql_create_table = 'CREATE TABLE if not exists `%s` (years VARCHAR(10),team_num VARCHAR(10),chuchang_times VARCHAR(20),starting_times VARCHAR(20),play_time VARCHAR(20),hit_rate VARCHAR(20),hit_times VARCHAR(20),shoot_times VARCHAR(20),three_hit_rate VARCHAR(20),three_hit_times VARCHAR(20),three_shoot_times VARCHAR(20),free_hit_rate VARCHAR(20),free_hit_times VARCHAR(20),free_shoot_times VARCHAR(20),rebound VARCHAR(20),offensive_rebound VARCHAR(20),defensive_rebound VARCHAR(20),assist VARCHAR(20),steal VARCHAR(20),block VARCHAR(20),fault VARCHAR(20),foul VARCHAR(20),score VARCHAR(20),win VARCHAR(20),lose VARCHAR(20)) character set utf8'%player_info["playerId"]
                cursor.execute(sql_create_table)
                print(player_info['playerId']+'赛季数据库已创建好')

                for player_season_data in player_info['playerSeasonData'][1:]:
                    print(len(player_info['playerSeasonData'][1:]))
                    print(player_season_data)
                    if len(player_season_data) != 25:
                        print(" "+player_info['playerId']+'缺少'+player_season_data[0]+'赛季数据')
                        break
                    sql_insert = 'insert into `%s`'%(player_info["playerId"]) +' values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
                    # print('sds')
                    cursor.execute(sql_insert, (player_season_data[0], player_season_data[1], player_season_data[2], player_season_data[3],player_season_data[4], player_season_data[5], player_season_data[6], player_season_data[7],player_season_data[8], player_season_data[9], player_season_data[10], player_season_data[11],player_season_data[12], player_season_data[13], player_season_data[14], player_season_data[15],player_season_data[16], player_season_data[17], player_season_data[18], player_season_data[19],player_season_data[20], player_season_data[21], player_season_data[22], player_season_data[23],player_season_data[24]))
                    print(" "+player_season_data[0]+'赛季插入数据库完成')
                for player_career_data in player_info['playerCareerData'][1:]:
                    if len(player_career_data) != 25:
                        print(" "+player_info['playerId'] + '缺少场均数据')
                        break
                    sql_insert = 'insert into playerCareerdata values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
                    cursor.execute(sql_insert, (
                    player_info['playerId'],
                    player_career_data[0], player_career_data[1], player_career_data[2], player_career_data[3],
                    player_career_data[4], player_career_data[5], player_career_data[6], player_career_data[7],
                    player_career_data[8], player_career_data[9], player_career_data[10], player_career_data[11],
                    player_career_data[12], player_career_data[13], player_career_data[14], player_career_data[15],
                    player_career_data[16], player_career_data[17], player_career_data[18], player_career_data[19],
                    player_career_data[20], player_career_data[21], player_career_data[22], player_career_data[23],
                    player_career_data[24]))
                    print(" "+player_info['playerId']+"插入场均数据库完毕")
            db.commit()
        except Exception as e:
            db.rollback()
        db.close()



if __name__ == "__main__":
    spider = Spider()
    # print(player_list)
    # spider.get_page_info('letter_link.txt','http://www.stat-nba.com/playerList.php')
    # db_players = spider.get_playername_fromdb()
    # print(len(db_players))
    # player_list = spider.request_letter_page('txt文件/letter_link.txt')#得到的是最终的球员姓名,和url
    # print(len(player_list))
    # spider.write_dict_to_csv('csv文件/player_list.csv',player_list)
    # print(len(player_list))#这个实际上是两个库中都有的球员信息
    #单纯的为了检验有没有数据错误
    # missing_players = spider.get_missing_players(db_players,player_list)
    # spider.write_dict_to_csv('csv文件/missing_players.csv',missing_players)
    # print(missing_players)#因为这里有一些拿不到,通过中文名加英文名的组合依旧拿不到,无能为力,去除这些,还剩488个现役球员
    # print(len(missing_players))
    # player_data_list = spider.read_data_from_csv('csv文件/player_list.csv')
    # spider.update_playersdb(player_data_list)#数据库更新完毕,接下来要爬取球员数据
    # print(len(player_data_list))
    # player_game_data = spider.spider_player_page()#这里从网站上获得了所有的数据,
    spider.write_player_season_data_todb()



  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值