Assign03: How to Break into Field

最新推荐文章于 2024-09-15 22:31:42 发布

grinningGrace

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量135

点赞数

文章标签： python pandas 开发语言

本文链接：https://blog.csdn.net/sinat_33418306/article/details/130901754

版权

该文探讨了如何进入编程领域的调查结果，通过分析开发者调查数据中的‘CousinEducation’列，展示了不同教育背景的比例。数据表明在线课程和购买书籍是常见的学习方法。文章还强调了数据清洗的重要性，并对比了具有更高学历的人与其他人对教育方式建议的差异。

摘要由CSDN通过智能技术生成

How to Break into Field

Now you have had a closer look at the data, and you saw how I approached looking at how the survey respondents think you should break into the field. Let’s recreate those results, as well as take a look at another question.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import HowToBreakIntoTheField as t
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
schema = pd.read_csv('./survey_results_schema.csv')
df.head()

Question 1

1. In order to understand how to break into the field, we will look at the CousinEducation field. Use the schema dataset to answer this question. Write a function called get_description that takes the schema dataframe and the column as a string, and returns a string of the description for that column.


def get_description(column_name, schema=schema):
    '''
    INPUT - schema - pandas dataframe with the schema of the developers survey
            column_name - string - the name of the column you would like to know about
    OUTPUT - 
            desc - string - the description of the column
    '''
    desc = list(schema[schema['Column'] == column_name]['Question'])[0]
    return desc

#test your code
#Check your function against solution - you shouldn't need to change any of the below code
get_description(df.columns[0]) # This should return a string of the first column description


#Check your function against solution - you shouldn't need to change any of the below code
descrips = set(get_description(col) for col in df.columns)
t.check_description(descrips)

The question we have been focused on has been around how to break into the field. Use your get_description function below to take a closer look at the CousinEducation column.

get_description('CousinEducation')

在这里插入图片描述

Question 2

2. Provide a pandas series of the different CousinEducation status values in the dataset. Store this pandas series in cous_ed_vals. If you are correct, you should see a bar chart of the proportion of individuals in each status. If it looks terrible, and you get no information from it, then you followed directions. However, we should clean this up!

cous_ed_vals = df.CousinEducation.value_counts()#Provide a pandas series of the counts for each CousinEducation status

cous_ed_vals # assure this looks right

# The below should be a bar chart of the proportion of individuals in your ed_vals
# if it is set up correctly.

(cous_ed_vals/df.shape[0]).plot(kind="bar");
plt.title("Formal Education");

在这里插入图片描述
We definitely need to clean this. Above is an example of what happens when you do not clean your data. Below I am using the same code you saw in the earlier video to take a look at the data after it has been cleaned.

possible_vals = ["Take online courses", "Buy books and work through the exercises", 
                 "None of these", "Part-time/evening courses", "Return to college",
                 "Contribute to open source", "Conferences/meet-ups", "Bootcamp",
                 "Get a job as a QA tester", "Participate in online coding competitions",
                 "Master's degree", "Participate in hackathons", "Other"]

def clean_and_plot(df, title='Method of Educating Suggested', plot=True):
    '''
    INPUT 
        df - a dataframe holding the CousinEducation column
        title - string the title of your plot
        axis - axis object
        plot - bool providing whether or not you want a plot back
        
    OUTPUT
        study_df - a dataframe with the count of how many individuals
        Displays a plot of pretty things related to the CousinEducation column.
    '''
    study = df['CousinEducation'].value_counts().reset_index()
    study.rename(columns={'index': 'method', 'CousinEducation': 'count'}, inplace=True)
    study_df = t.total_count(study, 'method', 'count', possible_vals)
    print(study)
    print(study_df) 
    study_df.set_index('method', inplace=True)
    if plot:
        (study_df/study_df.sum()).plot(kind='bar', legend=None);
        plt.title(title);
        plt.show()
    props_study_df = study_df/study_df.sum()
    return props_study_df
    
props_df = clean_and_plot(df)

请添加图片描述

在这里插入图片描述

total_count(df, col1, col2, look_for) is added below:


def total_count(df, col1, col2, look_for):
    '''
    INPUT:
    df - the pandas dataframe you want to search
    col1 - the column name you want to look through
    col2 - the column you want to count values from
    look_for - a list of strings you want to search for in each row of df[col]

    OUTPUT:
    new_df - a dataframe of each look_for with the count of how often it shows up
    '''
    new_df = defaultdict(int)
    #loop through list of ed types
    for val in look_for:
        #loop through rows
        for idx in range(df.shape[0]):
            #if the ed type is in the row add 1
            if val in df[col1][idx]:
                new_df[val] += int(df[col2][idx])
    new_df = pd.DataFrame(pd.Series(new_df)).reset_index()
    new_df.columns = [col1, col2]
    new_df.sort_values('count', ascending=False, inplace=True)
    return new_df

Question 4

4. I wonder if some of the individuals might have bias towards their own degrees. Complete the function below that will apply to the elements of the FormalEducation column in df.

def higher_ed(formal_ed_str):
    '''
    INPUT
        formal_ed_str - a string of one of the values from the Formal Education column
    
    OUTPUT
        return 1 if the string is  in ("Master's degree", "Doctoral", "Professional degree")
        return 0 otherwise
    
    '''
    if formal_ed_str in ("Master's degree", "Doctoral", "Professional degree"):
        return 1
    else:
        return 0
    

df["FormalEducation"].apply(higher_ed)[:5] #Test your function to assure it provides 1 and 0 values for the df

# Check your code here
df['HigherEd'] = df["FormalEducation"].apply(higher_ed)
higher_ed_perc = df['HigherEd'].mean()
t.higher_ed_test(higher_ed_perc)

Question 5

5. Now we would like to find out if the proportion of individuals who completed one of these three programs feel differently than those that did not. Store a dataframe of only the individual’s who had HigherEd equal to 1 in ed_1. Similarly, store a dataframe of only the HigherEd equal to 0 values in ed_0.

Notice, you have already created the HigherEd column using the check code portion above, so here you only need to subset the dataframe using this newly created column.

ed_1 = df[df['HigherEd'] == 1] # Subset df to only those with HigherEd of 1
ed_0 = df[df['HigherEd'] == 0] # Subset df to only those with HigherEd of 0


print(ed_1['HigherEd'][:5]) #Assure it looks like what you would expect
print(ed_0['HigherEd'][:5]) #Assure it looks like what you would expect

#Check your subset is correct - you should get a plot that was created using pandas styling
#which you can learn more about here: https://pandas.pydata.org/pandas-docs/stable/style.html

ed_1_perc = clean_and_plot(ed_1, 'Higher Formal Education', plot=False)
ed_0_perc = clean_and_plot(ed_0, 'Max of Bachelors Higher Ed', plot=False)

comp_df = pd.merge(ed_1_perc, ed_0_perc, left_index=True, right_index=True)
comp_df.columns = ['ed_1_perc', 'ed_0_perc']
comp_df['Diff_HigherEd_Vals'] = comp_df['ed_1_perc'] - comp_df['ed_0_perc']
comp_df.style.bar(subset=['Diff_HigherEd_Vals'], align='mid', color=['#d65f5f', '#5fba7d'])

请添加图片描述

Question 6

6. What can you conclude from the above plot? Change the dictionary to mark True for the keys of any statements you can conclude, and False for any of the statements you cannot conclude.

sol = {'Everyone should get a higher level of formal education': False, 
       'Regardless of formal education, online courses are the top suggested form of education': True,
       'There is less than a 1% difference between suggestions of the two groups for all forms of education': False,
       'Those with higher formal education suggest it more than those who do not have it': True}

t.conclusions(sol)