NLP(V):实战分析推特及若干新闻网站文本
wiki词条分析
获得数据
这里我们使用beautifulsoup爬取wiki网页。首先安装requests。
pip install requests
然后爬取网页。
以下是一些将使用的小函数。
import requests
from bs4 import BeautifulSoup
import time # for setting up a delay on getting htmls from wiki server.
from tqdm import tqdm
# First, get the page info from wiki server given an URL.
def getPageFromWiki(url):
# return a soup object with the specified URL
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
return soup
# Second, get the title of the wiki page
def getHeading(soup):
heading = ""
# return the title of the wiki page with the soup object initiated from getPageFromWiki
heading = soup.title.string
return heading
# Third, get the article part of the wiki page
def getContent(page):
content = ""
# return the article part of the wiki page with the soup object initiated from getPageFromWiki
# since there are multiple paragraphs, merge them together into one long string.
paras = page.find_all('p')
for para in paras:
for string in para.stripped_strings:
content = content + string
return content
# Fourth, get the links that the article part mentioned and specifically, linking to other wiki pages.
def getLinks(page):
linksDict = {}
# return the links mentioned in article part of the wiki page with the soup object initiated from getPageFromWiki
# return a dictionary, with keys are the titles of the linked pages, and the values are the links.
paras = page.find_all('p')
for para in paras:
for link in para.find_all('a', href=True):
link_value = link['href']
if link_value[0] == '/':
linksDict[link.string] = 'https://en.wikipedia.org' + link_value
return linksDict
现在我们可以爬取某一条目的wiki了:
pageDict = {}
page = getPageFromWiki('https://en.wikipedia.org/wiki/Pok%C3%A9mon') # scrap the main page we want.
header = getHeading(page)
content = getContent(page)
pageDict[header] = content
print(pageDict)
linksDict = getLinks(page) # get the links contained in the article part of the page.
print("a set of {} links are found.".format(len(linksDict)))
for title in tqdm(list(linksDict.keys())): # set up a loop to , set a delay at each iteration
url = linksDict[title]
page = getPageFromWiki(url)
header = getHeading(page)
content = getContent(page)
pageDict[header] = content
time.sleep(1) # Remember to set a delay >=1 second so you won't break the server.
print("a size of {} content dictionary is built.".format(len(pageDict)))
输出如下:
{'Pokémon - Wikipedia': 'Pokémon[a][1][2][3](an abbreviation forPocket Monsters[b]inJapan) is a Japanesemedia franchisemanaged byThe Pokémon Company, a company founded byNintendo,Game Freak, andCreatures. The franchise was created bySatoshi Tajiriin 1996,[4]and is centered on fictional creatures called"Pokémon". InPokémon, humans, known as Pokémon Trainers, catch and train Pokémon to battle other Pokémon for sport. All media works within the franchise are set in thePokémon universe. The Englishsloganfor the franchise is "Gotta Catch ‘Em All!".[5][6]There are currently 920Pokémon species.[7]The franchise began asPocket Monsters: RedandGreen(later released outside of Japan asPokémon RedandBlue), a pair of video games for the originalGame Boyhandheld system that were developed by Game Freak and published by Nintendo in February 1996. It soon became amedia mixfranchise adapted into various different media.[8]Pokémonis estimated to be thehighest-grossing media franchiseof all time. ThePokémon video game seriesis the thirdbest-selling video game franchiseof all time with more than440 millioncopies sold[9]and onebillionmobile downloads.[10]The Pokémon video game series spawned ananime television seriesthat has become the most successfulvideo game adaptationof all time[11]withover 20 seasons and 1,000 episodesin 192 countries.[9]ThePokémon Trading Card Gameis the highest-sellingtrading card gameof all time[12]with over 43.2billion cards sold. In addition, thePokémonfranchise includes the world\'s top-selling toy brand,[13]ananime film series, a live-action film (Detective Pikachu),books,manga comics,music, merchandise, and atemporary theme park. The franchise is also represented in other Nintendo media, such as theSuper Smash Bros.series, where variousPokémoncharacters are playable.In 1998, Nintendo spent $25 million promoting Pokémon in the United States in partnership withHasbro,KFC, and others.[14]Nintendo initially feared that Pokémon was too Japanese for Western tastes butAlfred Kahn, then CEO of 4Kids Entertainment convinced the company otherwise.[15]The one who spotted Pokémon\'s potential in the United States was Kahn\'s colleague Thomas Kenney.[16]In November 2005,4Kids Entertainment, which had managed the non-game related licensing ofPokémon, announced that it had agreed not to renew thePokémonrepresentation agreement. The Pokémon Company International oversees allPokémonlicensing outside Asia.[17]In 2006, the franchise celebrated its tenth anniversary.[18]In 2016, the Pokémon Company celebratedPokémon\'s 20th anniversary by airing an ad duringSuper Bowl 50in January and re-releasing the firstPokémonvideo games 1996 Game Boy gamesPokémon Red, Green(only in Japan), andBlue,and the 1998 Game Boy Color gamePokémon Yellowfor theNintendo 3DSon February 26, 2016.[19][20]The mobileaugmented realitygamePokémon Gowas released in July 2016.[21]The first live-action film in the franchise,Pokémon Detective Pikachu, based on the 2018 Nintendo 3DS spin-off gameDetective Pikachu, was released in 2019.[22]The eighth generation of core series games began withPokémon Sword and Shield, released worldwide on theNintendo Switchon November 15, 2019.To celebrate its25th anniversary, the company released two additional titles for the Nintendo Switch:Pokémon Brilliant DiamondandShining Pearl, remakes of theNintendo DSPokémon DiamondandPearlgames, on November 19, 2021, and its "premake"Pokémon Legends: Arceus, which was subsequently released on January 28, 2022.[23][24]The most recent games in the main series,Pokémon ScarletandVioletbegan the ninth and latest generation and will be released worldwide for the Nintendo Switch in late 2022.The namePokémonis asyllabic abbreviationof the Japanese brandPocket Monsters.[25]The term "Pokémon", in addition to referring to thePokémonfranchise itself, also collectively refers to themany fictional speciesthat have made appearances inPokémonmedia as of the release of the eighth generation titlesPokémon Sword and Shield. "Pokémon" isidentical in the singular and plural, as is each individual species name; it is and would be grammatically correct to say "one Pokémon" and "many Pokémon", as well as "onePikachu" and "many Pikachu".[26]Pokémonexecutive directorSatoshi Tajirifirst thought ofPokémon, albeit with a different concept and name, around 1989, when theGame Boywas released. The concept of thePokémon universe, in both the video games and the general fictional world ofPokémon, stems from the hobby ofinsect collecting, a popular pastime which Tajiri enjoyed as a child.[27]Players are designated asPokémon Trainersand have three general goals: to complete the regionalPokédexby collecting all of the available Pokémon species found in the fictional region where a game takes place, to complete the national Pokédex by transferring Pokémon from other regions, and to train a team of powerful Pokémon from those they have caught to compete against teams owned by other Trainers so they may eventually win the Pokémon League and become the regional Champion. These themes of collecting, training, and battling are present in almost every version of the Pokémon franchise, including thevideo games, theanimeand manga series, and thePokémon Trading Card Game(also known asTCG).In most incarnations of thePokémonuniverse, a Trainer who encounters a wild Pokémon has the ability to capture that Pokémon by throwing a specially designed, mass-producible spherical tool called aPoké Ballat it. If the Pokémon is unable to escape the confines of the Poké Ball, it is considered to be under the ownership of that Trainer. Afterwards, it will obey whatever commands it receives from its new Trainer, unless the Trainer demonstrates such a lack of experience that the Pokémon would rather act on its own accord. Trainers can send out any of their Pokémon to wage non-lethal battles against other Pokémon; if the opposing Pokémon is wild, the Trainer can capture that Pokémon with a Poké Ball, increasing their collection of creatures. InPokémon Go, and inPokémon: Let\'s Go, Pikachu!andLet\'s Go, Eevee!, wild Pokémon encountered by players can be caught in Poké Balls, but most cannot be battled. Pokémon already owned by other Trainers cannot be captured, except under special circumstances in certain side games. If a Pokémon fully defeats an opponent in battle so that the opponent is knocked out ("faints"), the winning Pokémon gainsexperience pointsand maylevel up. Beginning withPokémon XandY, experience points are also gained from catching Pokémon in Poké Balls. When leveling up, the Pokémon\'s battling aptitude statistics ("stats", such as "Attack" and "Speed") increase. At certain levels, the Pokémon may also learn newmoves, which are techniques used in battle. In addition, many species of Pokémon can undergo a form ofmetamorphosisand transform into a similar but stronger species of Pokémon, a process calledevolution; this process occurs spontaneously under differing circumstances, and is itself a central theme of the series. Some species of Pokémon may undergo a maximum of two evolutionary transformations, while others may undergo only one, and others may not evolve at all. For example, the Pokémon Pichu may evolve into Pikachu, which in turn may evolve into Raichu, following which no further evolutions may occur.Pokémon XandYintroduced the concept of "Mega Evolution," by which certain fully evolved Pokémon may temporarily undergo an additional evolution into a stronger form for the purpose of battling; this evolution is considered a special case, and unlike other evolutionary stages, is reversible.In the main series, each game\'s single-player mode requires the Trainer to raise a team of Pokémon to defeat manynon-player character(NPC) Trainers and their Pokémon. Each game lays out a somewhat linear path through a specific region of thePokémonworld for the Trainer to journey through, completing events and battling opponents along the way (including foiling the plans of anevilteam of Pokémon Trainers who serve as antagonists to the player). ExcludingPokémon SunandMoonandPokémon Ultra SunandUltra Moon, the games feature eight powerful Trainers, referred to asGym Leaders, that the Trainer must defeat in order to progress. As a reward, the Trainer receives a Gym Badge, and once all eight badges are collected, the Trainer is eligible to challenge the region\'s Pokémon League, where four talented trainers (referred to collectively as the "Elite Four") challenge the Trainer to four Pokémon battles in succession. If the trainer can overcome this gauntlet, they must challenge the Regional Champion, the master Trainer who had previously defeated the Elite Four. Any Trainer who wins this last battle becomes the new champion.Pokémonis set in the fictionalPokémonuniverse. There are numerous regions that have appeared in the various media of thePokémonfranchise. There are 8 main series regions set in the main series games: Kanto, Johto, Hoenn, Sinnoh/Hisui, Unova, Kalos, Alola, and Galar. Each of the eight generations of the main series releases focuses on a new region. Every region consists of several cities and towns that the player must explore in order to overcome many waiting challenges, such asGyms,Contestsandvillainous teams. At different locations within each region, the player can find different types of Pokémon, as well as helpful items and characters. Different regions are not accessible from one another at all within a single game, only with the exception of Kanto and Johto being linked together inPokémon Gold,Silver,Crystal,HeartGoldandSoulSilverversions. There are also regions set inspinoffgames and two islands in thePokémonanime(Orange Islands and Decolore Islands), all still set within the samefictional universe.Each main series region in thePokémonuniverse is based on a real world location. The first four regions introduced are based off of locations inJapan, beingKantō,Kansai,Kyushu, andHokkaidō, with later regions being based on parts onNew York City,France,Hawaii, theUnited Kingdom, and theIberian Peninsula.[28][29]All of the licensedPokémonproperties overseen bythe Pokémon Company Internationalare divided roughly by generation. These generations are roughlychronologicaldivisions by release; every several years, when a sequel to the 1996role-playing video gamesPokémon RedandGreenis released that features new Pokémon, characters, and gameplay concepts, that sequel is considered the start of a new generation of the franchise. ThemainPokémonvideo gamesand their spin-offs, the anime, manga, and trading card game are all updated with the new Pokémon properties each time a new generation begins.[30]Some Pokémon from the newer games appear in anime episodes or films months, or even years, before the game they were programmed for came out. The first generation began in Japan withPokémon RedandGreenon the Game Boy. As of 2022, there are nine generations of main series video games. The most recent games in the main series,Pokémon ScarletandVioletbegan the ninth and latest generation and will be released worldwide for the Nintendo Switch on November 18, 2022.[31][32][33]Kanto regionJohto regionKanto regionHoenn regionKanto regionSinnoh regionJohto regionKanto regionUnova regionKalos regionHoenn regionAlola regionKanto regionGalar regionSinnoh/Hisui regionPaldea regionPokémon, also known asPokémon the Seriesto Western audiences since the year 2013, is an anime television series based on thePokémonvideo game series. It was originally broadcast onTV Tokyoin 1997. More than 1,000 episodes of the anime has been produced and aired,[39]divided into 7 series in Japan and 22 seasons internationally. It is one of the longest currently running anime series.[39]The anime follows the quest of the main character,Ash Ketchum(known as Satoshi in Japan), a Pokémon Master in training, as he and a small group of friends travel around the world of Pokémon along with their Pokémon partners.[40]Various children\'s books, collectively known asPokémon Junior, are also based on the anime.[41]An eight part anime series calledPokémon: Twilight Wingsaired on YouTube in 2020.[42]The series was animated byStudio Colorido.[43]In July 2021, it was announced that a live action Pokémon series is in early development at Netflix withJoe Hendersonattached to write and executive produce.[44]An eight part anime series in celebration of the Pokémon 25th anniversary calledPokémon Evolutionsaired onYouTubein 2021.[45]There have been 23 animated theatricalPokémonfilms(latest film to be released on December 25, 2020[46]), which have been directed byKunihiko Yuyamaand Tetsuo Yajima, and distributed in Japan byTohosince 1998. The pair of films,Pokémon the Movie: Black—Victini and ReshiramandWhite—Victini and Zekromare considered together as one film. Collectibles, such as promotional trading cards, have been available with some of the films. Since the20th film, the films have been set in an alternate continuity separate from the anime series.PokémonCDs have been released in North America, some of them in conjunction with the theatrical releases of the first three and the 20thPokémonfilms. These releases were commonplace until late 2001. On March 27, 2007, a tenth anniversary CD was released containing 18 tracks from the English dub; this was the first English-language release in over five years. Soundtracks of thePokémonfeature films have been released in Japan each year in conjunction with the theatrical releases. In 2017, a soundtrack album featuring music from the North American versions of the 17th through 20th movies was released..mw-parser-output .citation{word-wrap:break-word}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}^The exact date of release is unknown.^Featuring music fromPokémon the Movie: Diancie and the Cocoon of Destruction,Pokémon the Movie: Hoopa and the Clash of Ages,Pokémon the Movie: Volcanion and the Mechanical Marvel, andPokémon the Movie: I Choose You!The Pokémon Trading Card Game (TCG) is acollectible card gamewith a goal similar to a Pokémon battle in the video game series. Players use Pokémon cards, with individual strengths and weaknesses, in an attempt to defeat their opponent by "knocking out" their Pokémon cards.[49]The game was published in North America byWizards of the Coastin 1999.[50]With the release of theGame Boy Advancevideo gamesPokémon RubyandSapphire, the Pokémon Company took back the card game from Wizards of the Coast and started publishing the cards themselves.[50]The Expedition expansion introduced thePokémon-e Trading Card Game, where the cards (for the most part) were compatible with theNintendo e-Reader. Nintendo discontinued its production of e-Reader compatible cards with the release ofFireRedandLeafGreen. In 1998, Nintendo released a Game Boy Color version of the trading card game in Japan;Pokémon Trading Card Gamewas subsequently released to the US and Europe in 2000. The game included digital versions of cards from the original set of cards and the first two expansions (Jungle and Fossil), as well as several cards exclusive to the game. A sequel was released in Japan in 2001.[51]There are various Pokémonmangaseries, four of which were released in English byViz Media, and seven of them released in English byChuang Yi. The manga series vary from game-based series to being based on the anime and the Trading Card Game. Original stories have also been published. As there are several series created by different authors, mostPokémonmanga series differ greatly from each other and other media, such as the anime.[example needed]Pokémon Pocket MonstersandPokémon Adventuresare the two manga in production since the first generation.APokémon-styledMonopolyboard gamewas released in August 2014.[60]In July 2021, it was announced that a live-actionPokémonseries is reportedly in development atNetflix. Joe Henderson, showrunner ofLucifer, is signed on as writer and executive producer.[61]Pokémonhas been criticized by somefundamentalist Christiansover perceivedoccultandviolentthemes and the concept of "Pokémon evolution", which they feel goes against the Biblical creation account in Genesis.[62]Sat2000, a satellite television station based inVatican City, has countered that the Pokémon Trading Card Game and video games are "full of inventive imagination" and have no "harmful moral side effects".[63][64]In the United Kingdom, the "Christian Power Cards" game was introduced in 1999 by David Tate who stated, "Some people aren\'t happy with Pokémon and want an alternative, others just want Christian games." The game was similar to the Pokémon Trading Card Game but used Biblical figures.[65]In 1999, Nintendo stopped manufacturing the Japanese version of the "Koga\'s Ninja Trick" trading card because it depicted amanji, a traditionallyBuddhistsymbol with no negative connotations. TheJewishcivil rights groupAnti-Defamation Leaguecomplained because the symbol is the reverse of aswastika, aNazisymbol. The cards were intended for sale in Japan only, but the popularity ofPokémonled to import into the United States with approval from Nintendo. The Anti-Defamation League understood that the portrayed symbol was not intended to offend and acknowledged the sensitivity that Nintendo showed by removing the product.[66][67]In 1999, two nine-year-old boys fromMerrick, New York, sued Nintendo because they claimed the Pokémon Trading Card Game caused theirproblematic gambling.[68]In 2001,Saudi ArabiabannedPokémongames and the trading cards, alleging that the franchise promotedZionismby displaying theStar of Davidin the trading cards (theColorless energyfrom thePokémon Trading Card Gameresembles a six-pointed star) as well as other religious symbols such ascrossesthey associated with Christianity and triangles they associated withFreemasonry; the games also involved gambling, which is in violation ofMuslimdoctrine.[69][70]Pokémonhas also been accused of promotingmaterialism.[71]In 2012,PETAcriticized the concept ofPokémonas supportingcruelty to animals. PETA compared the game\'s concept, of capturing animals and forcing them to fight, tocockfights,dog fightingrings and circuses, events frequently criticized for cruelty to animals. PETA released a game spoofingPokémonwhere the Pokémon battle their trainers to win their freedom.[72]PETA reaffirmed their objections in 2016 with the release ofPokémon Go, promoting the hashtag #GottaFreeThemAll.[73]On December 16, 1997, more than 635 Japanese children were admitted to hospitals with epilepticseizures.[74]It was determined the seizures were caused by watching an episode of Pokémon "Dennō Senshi Porygon", (most commonly translated "Electric Soldier Porygon", season 1, episode 38); as a result, this episode has not been aired since. In this particular episode, there were bright explosions with rapidly alternating blue and red color patterns.[75]It was determined in subsequent research that these strobing light effects cause some individuals to have epileptic seizures, even if the person had no previous history ofepilepsy.[76]This incident is a common focus of Pokémon-related parodies in other media, and was lampooned byThe Simpsonsepisode "Thirty Minutes over Tokyo" in a shortcameo[77]and theSouth Parkepisode "Chinpokomon",[78]among others.In March 2000, Morrison Entertainment Group, a toy developer based atManhattan Beach, California, sued Nintendo over claims thatPokémoninfringed on its ownMonster in My Pocketcharacters. A judge ruled there was no infringement and Morrison appealed the ruling. On February 4, 2003, the U.S. Court of Appeals for the Ninth Circuit affirmed the decision by the District Court to dismiss the suit.[79]Within its first two days of release,Pokémon Goraised safety concerns among players. Multiple people also suffered minor injuries from falling while playing the game due to being distracted.[80]Multiple police departments in various countries have issued warnings, sometongue-in-cheek, regarding inattentive driving, trespassing, and being targeted by criminals due to being unaware of one\'s surroundings.[81][82]People have suffered various injuries from accidents related to the game,[83][84][85][86]and Bosnian players have been warned to stay out of minefields left over from the 1990sBosnian War.[87]On July 20, 2016, it was reported that an 18-year-old boy inChiquimula,Guatemala, was shot and killed while playing the game in the late evening hours. This was the first reported death in connection with the app. The boy\'s 17-year-old cousin, who was accompanying the victim, was shot in the foot. Police speculated that the shooters used the game\'s GPS capability to find the two.[88]Pokémon, being a globally popular franchise, has left a significant mark on today\'spopular culture. Thevarious species ofPokémonhave become pop culture icons; examples include two different Pikachu balloons in theMacy\'s Thanksgiving Day Parade,Pokémon-themed airplanes operated by All Nippon Airways, merchandise items, and atraveling theme parkthat was inNagoya, Japanin 2005 and inTaipeiin 2006.Pokémonalso appeared on the cover of the U.S. magazineTimein 1999.[89]The Comedy Central showDrawn Togetherhas a character namedLing-Lingwho is a parody of Pikachu.[90]Several other shows such asThe Simpsons,[91]South Park[92]andRobot Chicken[93]have made references and spoofs ofPokémon, among other series.Pokémonwas featured onVH1\'sI Love the \'90s: Part Deux. A live action show based on the anime calledPokémon Live!toured the United States in late 2000.[94]Jim ButchercitesPokémonas one of the inspirations for theCodex Aleraseries of novels.[95]Pokémon has even made its mark in the realm of science. This includes animals named after Pokémon, such asStentorceps weedlei(named after the Pokémon Weedle for its resemblance) andChilicola charizard(named after the PokémonCharizard) as well asBinburrum articuno,Binburrum zapdos, andBinburrum moltres(named after the Pokémon Articuno, Zapdos, and Moltres, respectively).[96][97]There is also a protein named after Pikachu, calledPikachurin.In November 2001, Nintendo opened a store called the Pokémon Center in New York, inRockefeller Center,[98]modeled after the two other Pokémon Center stores in Tokyo andOsakaand named after a staple of the video game series. Pokémon Centers are fictional buildings where Trainers take their injured Pokémon to be healed after combat.[99]The store sold Pokémon merchandise on a total of two floors, with items ranging from collectible shirts to stuffed Pokémonplushies.[100]The store also featured a Pokémon Distributing Machine in which players would place their game to receive an egg of a Pokémon that was being given out at that time. The store also had tables that were open for players of the Pokémon Trading Card Game to duel each other or an employee. The store was closed and replaced by theNintendo World Storeon May 14, 2005.[101]Four Pokémon Center kiosks were put in malls in the Seattle area.[102]The Pokémon Center online store was relaunched on August 6, 2014.[103]Professor of educationJoseph Tobintheorizes that the success of the franchise was due to the long list of names that could be learned by children and repeated in their peer groups. Its rich fictional universe provides opportunities for discussion and demonstration of knowledge in front of their peers. The names of the creatures were linked to its characteristics, which converged with the children\'s belief that names have symbolic power. Children can pick their favourite Pokémon and affirm their individuality while at the same time affirming their conformance to the values of the group, and they can distinguish themselves from others by asserting what they liked and what they did not like from every chapter.Pokémongained popularity because it provides a sense of identity to a wide variety of children, and lost it quickly when many of those children found that the identity groups were too big and searched for identities that would distinguish them into smaller groups.[104][page\xa0needed]Pokémon\'s history has been marked at times by rivalry with theDigimonmedia franchise that debuted at a similar time. Described as "the other \'mon\'" byIGN\'s Juan Castro,Digimonhas not enjoyedPokémon\'s level of international popularity or success, but has maintained a dedicated fanbase.[105]IGN\'s Lucas M. Thomas stated thatPokémonisDigimon\'s "constant competition and comparison", attributing the former\'s relative success to the simplicity of its evolution mechanic as opposed toDigivolution.[106]The two have been noted for conceptual and stylistic similarities by sources such as GameZone.[107]A debate among fans exists over which of the two franchises came first.[108]In actuality, the firstPokémonmedia,Pokémon RedandGreen, were released initially on February 27, 1996;[109]whereas theDigimonvirtual petwas released on June 26, 1997.WhilePokémon\'s target demographic is children, early purchasers ofPokémon Omega RubyandAlpha Sapphirewere in their 20s.[110]Many fans are adults who originally played the games as children and had later returned to the series.[110]Numerousfan sitesexist for the Pokémon franchise, including.mw-parser-output .vanchor>:target~.vanchor-text{background-color:#b1d2ff}Bulbagarden, a site hosting thewiki-based encyclopedia Bulbapedia,[111][112][113]and Serebii,[114]a news and reference website.[115]Large fan communities exist on other platforms, such as thesubredditr/pokemon, which has over 4 million subscribers.[116]Asignificant community around thePokémonvideo games\' metagamehas existed for a long time, analyzing the best ways to use each Pokémon to their full potential in competitive battles. The most prolific competitive community is Smogon University, which has created a widely accepted tier-based battle system.[117]Smogon is affiliated with an onlinePokémongame calledPokémon Showdown, in which players create a team and battle against other players around the world using the competitive tiers created by Smogon.[118]In early 2014, an anonymous video streamer onTwitchlaunchedTwitch PlaysPokémon, a small experiment trying tocrowdsourceplaying subsequentPokémongames, that started with the gamePokémon Redand has since included subsequent games in the series.[119][120]A study at Stanford Neurosciences published inNatureperformed magnetic resonance imaging scans of 11 Pokémon experts and 11 controls, finding that seeing Pokémon stimulated activity in the visual cortex, in a different place than is triggered by recognizing faces, places, or words, demonstrating the brain\'s ability to create such specialized areas.[121][122]A challenge called the Nuzlocke Challenge allows players to only capture the first Pokémon encountered in each area. Using rules from a webcomic originally named "Pokémon Hard-Mode", if they do not succeed in capturing that Pokémon, there are no second chances. When a Pokémon faints, it is considered "dead" and must be released or stored in the PC permanently.[123][124]If the player faints, the game is considered over, and the player must restart.[125]The original idea consisted of 2 to 3 rules that the community has built upon. There are manyfan madePokémongames that contain agame modesimilar to the Nuzlocke Challenge, such asPokémon Uranium.'}
a set of 184 links are found.
100%|██████████| 184/184 [04:02<00:00, 1.32s/it]a size of 160 content dictionary is built.
使用如下代码将数据保存在csv文件中:
import csv
driveFolderDirectory = './' # if your are not using Google Colab, edit the value directly here.
savedFileName = 'wikiContents.csv'
pathToSave = driveFolderDirectory + savedFileName
with open(pathToSave, 'w', newline='') as csvfile:
fieldnames = ['idx','wikiTitle', 'wikiContents']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i,wikiContentKey in enumerate(pageDict.keys()):
writer.writerow({'idx': i, 'wikiTitle': wikiContentKey,'wikiContents': pageDict[wikiContentKey]})
加载数据
我们爬取的数据被存储在csv文件中,下面我们将数据读入到内存。
import sys
import csv
csv.field_size_limit(sys.maxsize)
def loadWikiTexts(csvPath):
wikiRawTextDict = {}
with open(csvPath, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
wikiRawTextDict[row['wikiTitle']] = row['wikiContents']
return wikiRawTextDict
# Load wiki text data here
wikiRawTextDict = loadWikiTexts('wikiContents.csv')
使用re, spacy进行数据预处理
安装spacy
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
使用re去除文本中的参考部分(即被中括号‘【】’括起来的部分),使用spacy进行tokenize和lemmatize,随后将所有字母小写。
import re
import spacy
def preprocess(wikiTextDict):
# Input: a wiki text dictionary with keys are titles and values are the corresponding texts.
# Output: a wiki text dictionary with keys are the titles and the values are the preprocessed texts
# (sentences - tokens).
# sub-task 1: remove all the references texts "[...]"
# sub-task 2: segment all the sentences in the wiki texts.
# sub-task 3: tokenize the sentences from sub-task 2.
# sub-task 4: lemmatize the tokens from sub-task 3.
# sub-task 5: lower-case the tokens from sub-task 3/4.
# You don't need to follow the order of the sub-tasks.
wikiTextDictPure = {}
nlp = spacy.load("en_core_web_sm")
for wiki_title, wiki_content_raw in wikiTextDict.items():
pure_tokens = []
wiki_content = re.sub(r'\[[0-9]+\]', '',wiki_content_raw)
doc = nlp(wiki_content)
for sent in doc.sents:
sent_tokens = []
sent_doc = nlp(sent.text)
for token in sent_doc:
sent_tokens.append(token.text.lower())
pure_tokens.append(sent_tokens)
wikiTextDictPure[wiki_title] = pure_tokens
return wikiTextDictPure
wikiPureTextDict = preprocess(wikiRawTextDict)
建立词典
根据之前获得的经过预处理的wiki文本数据,建立一个词典,单词是key,出现频数是value。
def computeFreq(wikiTextDict):
# Input: a wiki text dictionary with keys are titles and values are the preprocessed corresponding texts.
# Output: a dictionary with keys are the word types, and the values are the appearance counts of the word types
freqDict = {}
for _, sents in wikiTextDict.items():
for sent in sents:
for token in sent:
if token in freqDict:
freqDict[token] += 1
else:
freqDict[token] = 1
return freqDict
# Compute the frequency dictionary here.
freqDictWiki = computeFreq(wikiPureTextDict)
这样我们可以知道高频词了:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
def computeTop20Words(freqDict):
# Input: a dictionary with keys are the word types, and the values are the appearance counts of the word types
# Output: a list of 20 words that appear most frequently in all the preprocessed scraped texts.
word_freq_list_sorted = sorted(freqDict.items(), key=lambda x:x[1], reverse=True)
idx = 0
top20 = []
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
while len(top20)<20:
if (word_freq_list_sorted[idx][0] not in stopwords.words('english')) and word_freq_list_sorted[idx][0][0] in chars:
top20.append(word_freq_list_sorted[idx][0])
idx += 1
return top20
# Print your top 20 words here.
print(computeTop20Words(freqDictWiki))
输出如下:
['game', 'nintendo', 'pokémon', 'games', 'also', 'switch', 'released', 'japan', 'new', 'first', 'one', 'used', 'company', 'million', 'player', 'players', 'system', 'video', 'would', 'time']
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
生成词云
为了更直观地检查词汇分布,我们可以生成词云。
首先安装wordcloud包:
pip install wordcloud
之后如下操作:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
def plotWordCloud(image):
# Input: word cloud image
# Output: (display the cloud image in the output)
plt.figure(figsize=(40, 30))
plt.imshow(image)
plt.axis("off");
def generateWordCloud(text):
# Input: all texts in the scraped wiki data.
# Output: word cloud image.
wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='Set2', collocations=False, stopwords = STOPWORDS).generate(text)
return wordcloud
# Draw word cloud here
rawText = ""
for _, wikiText in wikiRawTextDict.items():
rawText += wikiText
plotWordCloud(generateWordCloud(rawText))
输出如下:
推特数据分析
预处理
这里我们使用与处理wiki相同的方法处理推文数据。
import csv
def loadTweetTextFromCSV(csvPath):
tweetDict = {}
with open(csvPath, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
tweetDict[int(row['idx'])] = row['tweetText']
return tweetDict
tweetDict = loadTweetTextFromCSV('./tweetsFishing.csv')
processedTweetData = preprocess(tweetDict)
计算OOV-rate
首先我们计算在推文数据中出现的单词(word type)有多少不在wiki数据中出现:
def computeOOVWordTypes(tweetVocabDict, wikiVocabDict):
# Input: a dictionary of tweet data vocabulary, a dictionary of wiki data vocabulary.
# Output: the ratio of word types in tweets that are out-of-vocabulary w.r.t. wiki vocabulary
# v.s. total number of word types in tweet data.
# The ratio should be in percentage.
cnt = 0
for word,_ in tweetVocabDict.items():
if word not in wikiVocabDict:
cnt += 1
return cnt/len(tweetVocabDict)
print(str(computeOOVWordTypes(computeFreq(processedTweetData), freqDictWiki)*100)+"%")
66.0857223956855%
请注意,以上代码计算的不是正确的OOV rate,正确的计算方法是计算token的比率,如下:
def computeOOVWordTokens(tweetVocabDict, wikiVocabDict):
# Input: a dictionary of tweet data vocabulary, a dictionary of wiki data vocabulary. (E.g. computed from task 3)
# Output: the ratio of word tokens in tweets that are out-of-vocabulary w.r.t. wiki vocabulary
# v.s. total number of word tokens in tweet data.
cnt = 0
total = 0
for word,value in tweetVocabDict.items():
total += value
if word not in wikiVocabDict:
cnt += tweetVocabDict[word]
return cnt/total
print(str(computeOOVWordTokens(computeFreq(processedTweetData), freqDictWiki)*100)+"%")
15.167456297990526%
新闻网站文本分析
爬取数据
这里我们爬取ABC news和Fox news的网页,思路与爬取wiki页面一样,代码如下:
- 获得网站地图:
import requests
from bs4 import BeautifulSoup
# def getPageFromWiki(url): use the getPageFromWiki implemented from the wikipedia section
abcNewsSitemap = getPageFromWiki('https://abcnews.go.com/xmlLatestVideos')
# A similar one can be found in the Fox news robots.txt,
foxNewsSitemap = getPageFromWiki('https://www.foxnews.com/sitemap.xml?type=news')
- 获得页面url:
def getUrlList(sitemap):
# This function should return a list of URLs of news contained in the sitemap page.
url_list = []
url_raw_list = sitemap.find_all('loc')[:100]
for raw_url in url_raw_list:
url_list.append(raw_url.string)
return url_list
foxUrlList = []
abcUrlList = []
foxUrlList = getUrlList(foxNewsSitemap)
abcUrlList = getUrlList(abcNewsSitemap)
# Test here if the list contains the URLs you want.
print("foxUrlList",foxUrlList)
print("abcUrlList",abcUrlList)
print(len(foxUrlList))
print(len(abcUrlList))
- 使用newspaper3k解析页面:
安装newspaper3k:
pip install newspaper3k
解析:
from newspaper import Article
from tqdm import tqdm
def getNewsDict(url_list):
# key should be the news title and value should be the article text of the news.
newsDict = {}
for news_url in tqdm(url_list):
article = Article(news_url)
article.download()
article.parse()
newsDict[article.title] = article.text
return newsDict
abcNews = getNewsDict(abcUrlList)
foxNews = getNewsDict(foxUrlList)
- 将数据保存在csv文件中:
import csv
print(len(abcNews), len(foxNews))
driveFolderDirectory = './' # if your are not using Google Colab, edit the value directly here.
savedFileName = 'newsContents.csv'
pathToSave = driveFolderDirectory + savedFileName
# size check
assert len(abcNews)>=100 and len(foxNews)>=100, "the size of both news dictionary should be no less than 100. got {} for abc news and {} for fox news instead.".format(len(abcNews),len(foxNews))
with open(pathToSave, 'w', newline='') as csvfile:
fieldnames = ['idx','newsSource','newsTitle','newsContents']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i,newsDictKey in enumerate(abcNews.keys()):
writer.writerow({'idx': i,'newsSource':'ABCNews', 'newsTitle': newsDictKey,'newsContents': abcNews[newsDictKey]})
for i,newsDictKey in enumerate(foxNews.keys()):
writer.writerow({'idx': i,'newsSource':'FoxNews', 'newsTitle': newsDictKey,'newsContents': foxNews[newsDictKey]})
读入数据
def loadNewsTexts(csvPath):
# the function returns two dictionaries, one for ABC news text data and one for Fox news text data
abcNewsRawTextDict = {}
foxNewsRawTextDict = {}
with open(csvPath, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if (row['newsSource'] == "ABCNews"):
abcNewsRawTextDict[row['newsTitle']] = row['newsContents']
else:
foxNewsRawTextDict[row['newsTitle']] = row['newsContents']
return abcNewsRawTextDict,foxNewsRawTextDict
abcNewsDict,foxNewsDict = loadNewsTexts('./newsContents.csv')
数据预处理
这里我们在预处理后顺便画一个条形图来看看词汇的分布情况。
import matplotlib.pyplot as plt
def plotHistogram(wordType,wordTokens):
# Input: a list of word types, a list of word token counts to the corresponding word types
# Output: (display the histogram of word count from a news source)
# X-axis should be (indexes) of the word type, and Y-axis should be the word counts of the word type.
plt.bar(range(len(wordType)), wordTokens)
plt.xticks(range(len(wordType)), wordType, rotation=270)
plt.show()
# Preprocess the news data.
preprocessedAbcDict = preprocess(abcNewsDict)
preprocessedFoxDict = preprocess(foxNewsDict)
# Compute word type list and the word token list.
abcFreqDict = computeFreq(preprocessedAbcDict)
foxFreqDict = computeFreq(preprocessedFoxDict)
sortedAbc = sorted(abcFreqDict.items(), key=lambda x:x[1], reverse=True)
abcWords = []
abcFreqs = []
for p in sortedAbc:
abcWords.append(p[0])
abcFreqs.append(p[1])
sortedFox = sorted(foxFreqDict.items(), key=lambda x:x[1], reverse=True)
foxWords = []
foxFreqs = []
for p in sortedFox:
foxWords.append(p[0])
foxFreqs.append(p[1])
# Plot the histogram here.
plotHistogram(abcWords, abcFreqs)
plotHistogram(foxWords, foxFreqs)
绘制词云
利用之前的函数,绘制词云:
rawText = ""
for _, newsText in abcNewsDict.items():
rawText += newsText
plotWordCloud(generateWordCloud(rawText))
rawText = ""
for _, newsText in foxNewsDict.items():
rawText += newsText
plotWordCloud(generateWordCloud(rawText))