Guided Project: Winning Jeopardy

https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb

 

1: Jeopardy Questions

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. If you need help at any point, you can consult our solution notebookhere.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

Imgur

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

  • Show Number -- the Jeopardy episode number of the show this question was in.
  • Air Date -- the date the episode aired.
  • Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  • Category -- the category of the question.
  • Value -- the number of dollars answering the question correctly is worth.
  • Question -- the text of the question.
  • Answer -- the text of the answer.

Instructions

  • Read the dataset into a Dataframe called jeopardy usingPandas.
  • Print out the first 5 rows of jeopardy.
  • Print out the columns of jeopardy using jeopardy.columns.
  • Some of the column names have spaces in front.
    • Remove the spaces in each item in jeopardy.columns.
    • Assign the result back to jeopardy.columns to fix the column names in jeopardy.
  • Make sure you pay close attention to the format of each column.

 

2: Normalizing Text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answercolumns). We covered normalization before, but the idea is to ensure that you lowercase words and remove punctuation so Don't and don'taren't considered to be different words when you compare them.

Instructions

  • Write a function to normalize questions and answers. It should:
    • Take in a string.
    • Convert the string to lowercase.
    • Remove all punctuation in the string.
    • Return the string.
  • Normalize the Question column.
    • Use the Pandas apply method to apply the function to each item in the Question column.
    • Assign the result to the clean_question column.
  • Normalize the Answer column.
    • Use the Pandas apply method to apply the function to each item in the Answer column.
    • Assign the result to the clean_answer column.

3: Normalizing Columns

Now that you've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.

Instructions

  • Write a function to normalize dollar values. It should:
    • Take in a string.
    • Remove any punctuation in the string.
    • Convert the string to an integer.
    • If the conversion has an error, assign 0 instead.
    • Return the integer.
  • Normalize the Value column.
    • Use the Pandas apply method to apply the function to each item in the Value column.
    • Assign the result to the clean_value column.
  • Use the pandas.to_datetime function to convert the Air Datecolumn to a datetime column.

4: Answers In Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

  • How often the answer is deducible from the question.
  • How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

Instructions

  • Write a function that takes in a row in jeopardy, as a Series. It should:
    • Split the clean_answer column on the space character (), and assign to the variable split_answer.
      • Split the clean_question column on the space character (), and assign to the variablesplit_question.
    • Create a variable called match_count, and set it to 0.
    • If the is in split_answer, remove it using the removemethod on lists. The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
    • If the length of split_answer is 0, return 0. This prevents a division by zero error later.
    • Loop through each item in split_answer, and see if it occurs in split_question. If it does, add 1 tomatch_count.
    • Divide match_count by the length of split_answer, and return the result.
  • Count how many times terms in clean_answer occur inclean_question.
    • Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.
    • Pass the axis=1 argument to apply the function across each row.
    • Assign the result to the answer_in_question column.
  • Find the mean of the answer_in_question column using themean method on Series.
  • Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy.

 

5: Recycled Questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

  • Sort jeopardy in order of ascending air date.
  • Maintain a set called terms_used that will be empty initially.
  • Iterate through each row of jeopardy.
  • Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    • If it does, increment a counter.
    • Add each word to terms_used.

This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the andthan, which are commonly used, but don't tell you a lot about a question.

Instructions

  • Create an empty list called question_overlap.
  • Create an empty set called terms_used.
  • Use the iterrows Dataframe method to loop through each row ofjeopardy.
    • Split the clean_question column of the row on the space character (), and assign to split_question.
    • Remove any words in split_question that are less than6 characters long.
    • Set match_count to 0.
    • Loop through each word in split_question.
      • If the term occurs in terms_used, add 1 tomatch_count.
    • Add each word in split_question to terms_used using the add method on sets.
    • If the length of split_question is greater than 0, dividematch_count by the length of split_question.
    • Append match_count to question_overlap.
  • Assign question_overlap to the question_overlap column of jeopardy.
  • Find the mean of the question_overlap column and print it.
  • Look at the value, and think about what this might mean for questions being recycled. Write up your thoughts in a markdown cell.

 

6: Low Value Vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

  • Low value -- Any row where Value is less than 800.
  • High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

  • Find the number of low value questions the word occurs in.
  • Find the number of high value questions the word occurs in.
  • Find the percentage of questions the word occurs in.
  • Based on the percentage of questions the word occurs in, find expected counts.
  • Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

Instructions

  • Create a function that takes in a row from a Dataframe, and:
    • If the clean_value column is greater than 800, assign 1to value.
    • Otherwise, assign 0 to value.
    • Return value.
  • Determine which questions are high and low value.
    • Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.
    • Pass the axis=1 argument to apply the function across each row.
    • Assign the result to the high_value column.
  • Create a function that takes in a word, and:
    • Assigns 0 to low_count.
    • Assigns 0 to high_count.
    • Loops through each row in jeopardy using the iterrowsmethod.
      • Split the clean_question column on the space character ().
      • If the word is in the split question:
        • If the high_value column is 1, add 1 tohigh_count.
        • Else, add 1 to low_count.
    • Returns high_count and low_count. You can return multiple values by separating them with a comma.
  • Create an empty list called observed_expected.
  • Convert terms_used into a list using the list function, and assign the first 5 elements to comparison_terms.
  • Loop through each term in comparison_terms, and:
    • Run the function on the term to get the high value and low value counts.
    • Append the result of running the function (which will be a list) to observed_expected.

7: Applying The Chi-Squared Test

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

Instructions

  • Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.
  • Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.
  • Create an empty list called chi_squared.
  • Loop through each list in observed_expected.
    • Add up both items in the list (high and low counts) to get the total count, and assign to total.
    • Divide total by the number of rows in jeopardy to get the proportion across the dataset. Assign to total_prop.
    • Multiply total_prop by high_value_count to get the expected term count for high value rows.
    • Multiply total_prop by low_value_count to get the expected term count for low value rows.
    • Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
    • Append the results to chi_squared.
  • Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

 

8: Next Steps

That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

  • Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    • Manually create a list of words to remove, like thethan, etc.
    • Find a list of stopwords to remove.
    • Remove words that occur in more than a certain percentage (like 5%) of questions.
  • Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    • Use the apply method to make the code that calculates frequencies more efficient.
    • Only select terms that have high frequencies across the dataset, and ignore the others.
  • Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    • See which categories appear the most often.
    • Find the probability of each category appearing in each round.
  • Use the whole Jeopardy dataset (availablehere) instead of the subset we used in this mission.
  • Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.

We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.

We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work!

 

In [23]:

import pandas
import csv

jeopardy = pandas.read_csv("jeopardy.csv")

jeopardy

Out[23]:

 Show NumberAir DateRoundCategoryValueQuestionAnswer
046802004-12-31Jeopardy!HISTORY$200</td> <td>For the last 8 years of his life, Galileo was ...</td> <td>Copernicus</td> </tr> <tr> <th>1</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$200No. 2: 1912 Olympian; football star at Carlisl...Jim Thorpe
246802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$200</td> <td>The city of Yuma in this state has a record av...</td> <td>Arizona</td> </tr> <tr> <th>3</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$200In 1963, live on "The Art Linkletter Show", th...McDonald's
446802004-12-31Jeopardy!EPITAPHS & TRIBUTES$200</td> <td>Signer of the Dec. of Indep., framer of the Co...</td> <td>John Adams</td> </tr> <tr> <th>5</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$200In the title of an Aesop fable, this insect sh...the ant
646802004-12-31Jeopardy!HISTORY$400Built in 312 B.C. to link Rome & the South of ...the Appian Way
746802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$400No. 8: 30 steals for the Birmingham Barons; 2,...Michael Jordan
846802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$400</td> <td>In the winter of 1971-72, a record 1,122 inche...</td> <td>Washington</td> </tr> <tr> <th>9</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$400This housewares store was named for the packag...Crate & Barrel
1046802004-12-31Jeopardy!EPITAPHS & TRIBUTES$400</td> <td>"And away we go"</td> <td>Jackie Gleason</td> </tr> <tr> <th>11</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$400Cows regurgitate this from the first stomach t...the cud
1246802004-12-31Jeopardy!HISTORY$600</td> <td>In 1000 Rajaraja I of the Cholas battled to ta...</td> <td>Ceylon (or Sri Lanka)</td> </tr> <tr> <th>13</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$600No. 1: Lettered in hoops, football & lacrosse ...Jim Brown
1446802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$600</td> <td>On June 28, 1994 the nat'l weather service beg...</td> <td>the UV index</td> </tr> <tr> <th>15</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$600This company's Accutron watch, introduced in 1...Bulova
1646802004-12-31Jeopardy!EPITAPHS & TRIBUTES$600</td> <td>Outlaw: "Murdered by a traitor and a coward wh...</td> <td>Jesse James</td> </tr> <tr> <th>17</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$600A small demon, or a mischievous child (who mig...imp
1846802004-12-31Jeopardy!HISTORY$800</td> <td>Karl led the first of these Marxist organizati...</td> <td>the International</td> </tr> <tr> <th>19</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$800No. 10: FB/LB for Columbia U. in the 1920s; MV...(Lou) Gehrig
2046802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$800</td> <td>Africa's lowest temperature was 11 degrees bel...</td> <td>Morocco</td> </tr> <tr> <th>21</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$800Edward Teller & this man partnered in 1898 to ...(Paul) Bonwit
2246802004-12-31Jeopardy!EPITAPHS & TRIBUTES$2,000</td> <td>1939 Oscar winner: "...you are a credit to you...</td> <td>Hattie McDaniel (for her role in Gone with the...</td> </tr> <tr> <th>23</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$800In geologic time one of these, shorter than an...era
2446802004-12-31Jeopardy!HISTORY$1000</td> <td>This Asian political party was founded in 1885...</td> <td>the Congress Party</td> </tr> <tr> <th>25</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$1000No. 5: Only center to lead the NBA in assists;...(Wilt) Chamberlain
2646802004-12-31Jeopardy!THE COMPANY LINE$1000The Kirschner brothers, Don & Bill, named this...K2
2746802004-12-31Jeopardy!EPITAPHS & TRIBUTES$1000Revolutionary War hero: "His spirit is in Verm...Ethan Allen
2846802004-12-31Jeopardy!3-LETTER WORDS$1000</td> <td>A single layer of paper, or to perform one's c...</td> <td>ply</td> </tr> <tr> <th>29</th> <td>4680</td> <td>2004-12-31</td> <td>Double Jeopardy!</td> <td>DR. SEUSS AT THE MULTIPLEX</td> <td>$400<a href="http://www.j-archive.com/media/2004-1...Horton
........................
1996956942009-05-14Double Jeopardy!AMERICAN HISTORY$1200In 1960 the last of these locomotives was reti...steam engines
1997056942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$1200Kate: "if I be waspish, best beware my sting";...Petruchio
1997156942009-05-14Double Jeopardy!ALMA MATERS$1,500</td> <td>This private college in Northern California bo...</td> <td>Stanford University</td> </tr> <tr> <th>19972</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1200She voiced Princess Pea in "The Tale of Desper...Emma Watson
1997356942009-05-14Double Jeopardy!2-LETTER WORDS$1200It's the name of the long-awaited new White Ho...Bo
1997456942009-05-14Double Jeopardy!ANGELS & DEMONS$1200Langdon in "Angels & Demons" is looking for <a...an antimatter bomb
1997556942009-05-14Double Jeopardy!AMERICAN HISTORY$1600In the 1600s most of New York State was occupi...the Iroquois
1997656942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$1600Marina's dad (need a hint? he rules Tyre)Pericles
1997756942009-05-14Double Jeopardy!ALMA MATERS$1600</td> <td>Presidential kids are welcome at this New Orle...</td> <td>Tulane</td> </tr> <tr> <th>19978</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1600She didn't vamp it up & did a bella job as Em ...Kristen Stewart
1997956942009-05-14Double Jeopardy!2-LETTER WORDS$1600Third syllable intoned by the giant who smells...fo
1998056942009-05-14Double Jeopardy!ANGELS & DEMONS$1600Much of "Angels & Demons" takes place at one o...a conclave
1998156942009-05-14Double Jeopardy!AMERICAN HISTORY$1,200In 1899 Secretary of State John Hay proclaimed...open-door policy
1998256942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$2000Fruity surname of Peter in "A Midsummer Night'...Quince
1998356942009-05-14Double Jeopardy!ALMA MATERS$2000Quincy Jones, Kevin Eubanks & Branford Marsali...Berklee
1998456942009-05-14Double Jeopardy!ACTRESSES$2000In 2009 she returned to being "Fast & Furious"...Michelle Rodriguez
1998556942009-05-14Double Jeopardy!2-LETTER WORDS$2000The book of Genesis says this ancient city "of...Ur
1998656942009-05-14Double Jeopardy!ANGELS & DEMONS$2000"Habakkuk and the Angel" is one of a series of...Bernini
1998756942009-05-14Final Jeopardy!SCIENCE TERMSNoneIn medieval England, it meant the smallest uni...atom
1998835822000-03-14Jeopardy!U.S. GEOGRAPHY$100</td> <td>This Texas city is the largest in the U.S. to ...</td> <td>Houston (Lee Brown)</td> </tr> <tr> <th>19989</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$100...& the CricketsBuddy Holly
1999035822000-03-14Jeopardy!HISTORIC PEOPLE$100</td> <td>In the 990s this son of Erik the Red brought C...</td> <td>Leif Ericson</td> </tr> <tr> <th>19991</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$100Concerning a failed Windows 98 demonstration, ...Bill Gates
1999235822000-03-14Jeopardy!LLAMA-RAMA$100</td> <td>This llama product is used to make hats, blank...</td> <td>Wool</td> </tr> <tr> <th>19993</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>DING DONG</td> <td>$100In 1967 this company introduced its chocolate-...Hostess
1999435822000-03-14Jeopardy!U.S. GEOGRAPHY$200</td> <td>Of 8, 12 or 18, the number of U.S. states that...</td> <td>18</td> </tr> <tr> <th>19995</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$200...& the New Power GenerationPrince
1999635822000-03-14Jeopardy!HISTORIC PEOPLE$200</td> <td>In 1589 he was appointed professor of mathemat...</td> <td>Galileo</td> </tr> <tr> <th>19997</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$200Before the grand jury she said, "I'm really so...Monica Lewinsky
1999835822000-03-14Jeopardy!LLAMA-RAMA$200Llamas are the heftiest South American members...Camels

19999 rows × 7 columns

In [26]:

jeopardy.columns

Out[26]:

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [27]:

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [31]:

import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [40]:

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [41]:

jeopardy

Out[41]:

 Show NumberAir DateRoundCategoryValueQuestionAnswerclean_questionclean_answerclean_value
046802004-12-31Jeopardy!HISTORY$200For the last 8 years of his life, Galileo was ...Copernicusfor the last 8 years of his life galileo was u...copernicus200
146802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$200</td> <td>No. 2: 1912 Olympian; football star at Carlisl...</td> <td>Jim Thorpe</td> <td>no 2 1912 olympian football star at carlisle i...</td> <td>jim thorpe</td> <td>200</td> </tr> <tr> <th>2</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$200The city of Yuma in this state has a record av...Arizonathe city of yuma in this state has a record av...arizona200
346802004-12-31Jeopardy!THE COMPANY LINE$200In 1963, live on "The Art Linkletter Show", th...McDonald'sin 1963 live on the art linkletter show this c...mcdonalds200
446802004-12-31Jeopardy!EPITAPHS & TRIBUTES$200Signer of the Dec. of Indep., framer of the Co...John Adamssigner of the dec of indep framer of the const...john adams200
546802004-12-31Jeopardy!3-LETTER WORDS$200</td> <td>In the title of an Aesop fable, this insect sh...</td> <td>the ant</td> <td>in the title of an aesop fable this insect sha...</td> <td>the ant</td> <td>200</td> </tr> <tr> <th>6</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$400Built in 312 B.C. to link Rome & the South of ...the Appian Waybuilt in 312 bc to link rome the south of ita...the appian way400
746802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$400</td> <td>No. 8: 30 steals for the Birmingham Barons; 2,...</td> <td>Michael Jordan</td> <td>no 8 30 steals for the birmingham barons 2306 ...</td> <td>michael jordan</td> <td>400</td> </tr> <tr> <th>8</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$400In the winter of 1971-72, a record 1,122 inche...Washingtonin the winter of 197172 a record 1122 inches o...washington400
946802004-12-31Jeopardy!THE COMPANY LINE$400This housewares store was named for the packag...Crate & Barrelthis housewares store was named for the packag...crate barrel400
1046802004-12-31Jeopardy!EPITAPHS & TRIBUTES$400"And away we go"Jackie Gleasonand away we gojackie gleason400
1146802004-12-31Jeopardy!3-LETTER WORDS$400</td> <td>Cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>400</td> </tr> <tr> <th>12</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$600In 1000 Rajaraja I of the Cholas battled to ta...Ceylon (or Sri Lanka)in 1000 rajaraja i of the cholas battled to ta...ceylon or sri lanka600
1346802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$600No. 1: Lettered in hoops, football & lacrosse ...Jim Brownno 1 lettered in hoops football lacrosse at s...jim brown600
1446802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$600On June 28, 1994 the nat'l weather service beg...the UV indexon june 28 1994 the natl weather service began...the uv index600
1546802004-12-31Jeopardy!THE COMPANY LINE$600This company's Accutron watch, introduced in 1...Bulovathis companys accutron watch introduced in 196...bulova600
1646802004-12-31Jeopardy!EPITAPHS & TRIBUTES$600Outlaw: "Murdered by a traitor and a coward wh...Jesse Jamesoutlaw murdered by a traitor and a coward whos...jesse james600
1746802004-12-31Jeopardy!3-LETTER WORDS$600</td> <td>A small demon, or a mischievous child (who mig...</td> <td>imp</td> <td>a small demon or a mischievous child who might...</td> <td>imp</td> <td>600</td> </tr> <tr> <th>18</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$800Karl led the first of these Marxist organizati...the Internationalkarl led the first of these marxist organizati...the international800
1946802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$800</td> <td>No. 10: FB/LB for Columbia U. in the 1920s; MV...</td> <td>(Lou) Gehrig</td> <td>no 10 fblb for columbia u in the 1920s mvp for...</td> <td>lou gehrig</td> <td>800</td> </tr> <tr> <th>20</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$800Africa's lowest temperature was 11 degrees bel...Moroccoafricas lowest temperature was 11 degrees belo...morocco800
2146802004-12-31Jeopardy!THE COMPANY LINE$800Edward Teller & this man partnered in 1898 to ...(Paul) Bonwitedward teller this man partnered in 1898 to s...paul bonwit800
2246802004-12-31Jeopardy!EPITAPHS & TRIBUTES$2,0001939 Oscar winner: "...you are a credit to you...Hattie McDaniel (for her role in Gone with the...1939 oscar winner you are a credit to your cra...hattie mcdaniel for her role in gone with the ...2000
2346802004-12-31Jeopardy!3-LETTER WORDS$800</td> <td>In geologic time one of these, shorter than an...</td> <td>era</td> <td>in geologic time one of these shorter than an ...</td> <td>era</td> <td>800</td> </tr> <tr> <th>24</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$1000This Asian political party was founded in 1885...the Congress Partythis asian political party was founded in 1885...the congress party1000
2546802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$1000</td> <td>No. 5: Only center to lead the NBA in assists;...</td> <td>(Wilt) Chamberlain</td> <td>no 5 only center to lead the nba in assists tr...</td> <td>wilt chamberlain</td> <td>1000</td> </tr> <tr> <th>26</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$1000The Kirschner brothers, Don & Bill, named this...K2the kirschner brothers don bill named this sk...k21000
2746802004-12-31Jeopardy!EPITAPHS & TRIBUTES$1000</td> <td>Revolutionary War hero: "His spirit is in Verm...</td> <td>Ethan Allen</td> <td>revolutionary war hero his spirit is in vermon...</td> <td>ethan allen</td> <td>1000</td> </tr> <tr> <th>28</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$1000A single layer of paper, or to perform one's c...plya single layer of paper or to perform ones cra...ply1000
2946802004-12-31Double Jeopardy!DR. SEUSS AT THE MULTIPLEX$400<a href="http://www.j-archive.com/media/2004-1...Hortona hrefhttpwwwjarchivecommedia20041231dj23mp3be...horton400
.................................
1996956942009-05-14Double Jeopardy!AMERICAN HISTORY$1200In 1960 the last of these locomotives was reti...steam enginesin 1960 the last of these locomotives was reti...steam engines1200
1997056942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$1200</td> <td>Kate: "if I be waspish, best beware my sting";...</td> <td>Petruchio</td> <td>kate if i be waspish best beware my sting his ...</td> <td>petruchio</td> <td>1200</td> </tr> <tr> <th>19971</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1,500This private college in Northern California bo...Stanford Universitythis private college in northern california bo...stanford university1500
1997256942009-05-14Double Jeopardy!ACTRESSES$1200</td> <td>She voiced Princess Pea in "The Tale of Desper...</td> <td>Emma Watson</td> <td>she voiced princess pea in the tale of despere...</td> <td>emma watson</td> <td>1200</td> </tr> <tr> <th>19973</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>2-LETTER WORDS</td> <td>$1200It's the name of the long-awaited new White Ho...Boits the name of the longawaited new white hous...bo1200
1997456942009-05-14Double Jeopardy!ANGELS & DEMONS$1200Langdon in "Angels & Demons" is looking for <a...an antimatter bomblangdon in angels demons is looking for a hre...an antimatter bomb1200
1997556942009-05-14Double Jeopardy!AMERICAN HISTORY$1600In the 1600s most of New York State was occupi...the Iroquoisin the 1600s most of new york state was occupi...the iroquois1600
1997656942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$1600</td> <td>Marina's dad (need a hint? he rules Tyre)</td> <td>Pericles</td> <td>marinas dad need a hint he rules tyre</td> <td>pericles</td> <td>1600</td> </tr> <tr> <th>19977</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1600Presidential kids are welcome at this New Orle...Tulanepresidential kids are welcome at this new orle...tulane1600
1997856942009-05-14Double Jeopardy!ACTRESSES$1600She didn't vamp it up & did a bella job as Em ...Kristen Stewartshe didnt vamp it up did a bella job as em in...kristen stewart1600
1997956942009-05-14Double Jeopardy!2-LETTER WORDS$1600Third syllable intoned by the giant who smells...fothird syllable intoned by the giant who smells...fo1600
1998056942009-05-14Double Jeopardy!ANGELS & DEMONS$1600Much of "Angels & Demons" takes place at one o...a conclavemuch of angels demons takes place at one of a...a conclave1600
1998156942009-05-14Double Jeopardy!AMERICAN HISTORY$1,200In 1899 Secretary of State John Hay proclaimed...open-door policyin 1899 secretary of state john hay proclaimed...opendoor policy1200
1998256942009-05-14Double Jeopardy!MIND YOUR SHAKESPEARE "P"s & "Q"s$2000</td> <td>Fruity surname of Peter in "A Midsummer Night'...</td> <td>Quince</td> <td>fruity surname of peter in a midsummer nights ...</td> <td>quince</td> <td>2000</td> </tr> <tr> <th>19983</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$2000Quincy Jones, Kevin Eubanks & Branford Marsali...Berkleequincy jones kevin eubanks branford marsalis ...berklee2000
1998456942009-05-14Double Jeopardy!ACTRESSES$2000In 2009 she returned to being "Fast & Furious"...Michelle Rodriguezin 2009 she returned to being fast furious as...michelle rodriguez2000
1998556942009-05-14Double Jeopardy!2-LETTER WORDS$2000The book of Genesis says this ancient city "of...Urthe book of genesis says this ancient city of ...ur2000
1998656942009-05-14Double Jeopardy!ANGELS & DEMONS$2000</td> <td>"Habakkuk and the Angel" is one of a series of...</td> <td>Bernini</td> <td>habakkuk and the angel is one of a series of a...</td> <td>bernini</td> <td>2000</td> </tr> <tr> <th>19987</th> <td>5694</td> <td>2009-05-14</td> <td>Final Jeopardy!</td> <td>SCIENCE TERMS</td> <td>None</td> <td>In medieval England, it meant the smallest uni...</td> <td>atom</td> <td>in medieval england it meant the smallest unit...</td> <td>atom</td> <td>0</td> </tr> <tr> <th>19988</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$100This Texas city is the largest in the U.S. to ...Houston (Lee Brown)this texas city is the largest in the us to ha...houston lee brown100
1998935822000-03-14Jeopardy!POP MUSIC PAIRINGS$100...& the CricketsBuddy Hollythe cricketsbuddy holly100
1999035822000-03-14Jeopardy!HISTORIC PEOPLE$100In the 990s this son of Erik the Red brought C...Leif Ericsonin the 990s this son of erik the red brought c...leif ericson100
1999135822000-03-14Jeopardy!1998 QUOTATIONS$100</td> <td>Concerning a failed Windows 98 demonstration, ...</td> <td>Bill Gates</td> <td>concerning a failed windows 98 demonstration h...</td> <td>bill gates</td> <td>100</td> </tr> <tr> <th>19992</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$100This llama product is used to make hats, blank...Woolthis llama product is used to make hats blanke...wool100
1999335822000-03-14Jeopardy!DING DONG$100</td> <td>In 1967 this company introduced its chocolate-...</td> <td>Hostess</td> <td>in 1967 this company introduced its chocolatec...</td> <td>hostess</td> <td>100</td> </tr> <tr> <th>19994</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$200Of 8, 12 or 18, the number of U.S. states that...18of 8 12 or 18 the number of us states that tou...18200
1999535822000-03-14Jeopardy!POP MUSIC PAIRINGS$200...& the New Power GenerationPrincethe new power generationprince200
1999635822000-03-14Jeopardy!HISTORIC PEOPLE$200In 1589 he was appointed professor of mathemat...Galileoin 1589 he was appointed professor of mathemat...galileo200
1999735822000-03-14Jeopardy!1998 QUOTATIONS$200</td> <td>Before the grand jury she said, "I'm really so...</td> <td>Monica Lewinsky</td> <td>before the grand jury she said im really sorry...</td> <td>monica lewinsky</td> <td>200</td> </tr> <tr> <th>19998</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$200Llamas are the heftiest South American members...Camelsllamas are the heftiest south american members...camels200

19999 rows × 10 columns

In [36]:

jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])

In [38]:

jeopardy.dtypes

Out[38]:

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [51]:

def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [53]:

jeopardy["answer_in_question"].mean()

Out[53]:

0.060493257069335872

Answer terms in the question

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

In [54]:

question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

Out[54]:

0.69087373156719623

Question overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [62]:

def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [84]:

def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

Out[84]:

[(1, 2), (0, 1), (1, 0), (0, 1), (1, 1)]

In [86]:

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

Out[86]:

[(0.031881167234403623, 0.85828871632352932),
 (0.40196284612688399, 0.52607729857054686),
 (2.4877921171956752, 0.11473257634454047),
 (0.40196284612688399, 0.52607729857054686),
 (0.44487748166127949, 0.50477764875459963)]

Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

转载于:https://my.oschina.net/Bettyty/blog/750943

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值