统计思维读书笔记（第一章习题）

最新推荐文章于 2024-08-04 14:57:48 发布

农夫山泉是糖水

最新推荐文章于 2024-08-04 14:57:48 发布

阅读量279

点赞数

分类专栏：机器学习 Python 文章标签：数据分析机器学习

本文链接：https://blog.csdn.net/x1355399155/article/details/105275370

版权

机器学习同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

Python

2 篇文章 0 订阅

订阅专栏

习题

习题1-2

这个练习需要从 NSFG 下载数据，本书接下来会用到这些数据。打开
http://thinkstats.com/nsfg.html，阅读数据的使用协议，然后点击“I accept these
terms”（假设你确实同意）。
下载 2002FemResp.dat.gz 和 2002FemPreg.dat.gz 两个文件。前者是被调查者文件，
每一行代表一个被调查者，总共 7643 个女性被调查者。后者是各个被调查者的怀孕情况。
调查的在线资料地址： http://www.icpsr.umich.edu/nsfg6 。浏览左侧导航栏中调查的各部分，
大致了解一下其中的内容。还可以在
http://cdc.gov/nchs/data/nsfg/nsfg_2002_questionnaires.htm 上阅读调
查问卷的内容。
本书的配套网站提供了处理 NSFG 数据文件的代码。从 http://
thinkstats.com/survey.py 下载，然后在放置数据文件的目录中运行。

处理NSFG数据文件的代码

import sys
import gzip
import os

class Record(object):
    """Represents a record."""

class Respondent(Record): 
    """Represents a respondent."""

class Pregnancy(Record):
    """Represents a pregnancy."""

class Table(object):
    """Represents a table as a list of objects"""

    def __init__(self):
        self.records = []
        
    def __len__(self):
        return len(self.records)

    def ReadFile(self, data_dir, filename, fields, constructor, n=None):
        """Reads a compressed data file builds one object per record.

        Args:
            data_dir: string directory name
            filename: string name of the file to read

            fields: sequence of (name, start, end, case) tuples specifying 
            the fields to extract

            constructor: what kind of object to create
        """
        filename = os.path.join(data_dir, filename)

        if filename.endswith('gz'):
            fp = gzip.open(filename)
        else:
            fp = open(filename)

        for i, line in enumerate(fp):
            if i == n:
                break
            record = self.MakeRecord(line, fields, constructor)
            self.AddRecord(record)
        fp.close()

    def MakeRecord(self, line, fields, constructor):
        """Scans a line and returns an object with the appropriate fields.

        Args:
            line: string line from a data file

            fields: sequence of (name, start, end, cast) tuples specifying 
            the fields to extract

            constructor: callable that makes an object for the record.

        Returns:
            Record with appropriate fields.
        """
        obj = constructor()
        for (field, start, end, cast) in fields:
            try:
                s = line[start-1:end]
                val = cast(s)
            except ValueError:
                # If you are using Visual Studio, you might see an
                # "error" at this point, but it is not really an error;
                # I am just using try...except to handle not-available (NA)
                # data.  You should be able to tell Visual Studio to
                # ignore this non-error.
                val = 'NA'
            setattr(obj, field, val)
        return obj

    def AddRecord(self, record):
        """Adds a record to this table.

        Args:
            record: an object of one of the record types.
        """
        self.records.append(record)

    def ExtendRecords(self, records):
        """Adds records to this table.

        Args:
            records: a sequence of record object
        """
        self.records.extend(records)

    def Recode(self):
        """Child classes can override this to recode values."""
        pass


class Respondents(Table):
    """Represents the respondent table."""

    def ReadRecords(self, data_dir='.', n=None):
        filename = self.GetFilename()
        self.ReadFile(data_dir, filename, self.GetFields(), Respondent, n)
        self.Recode()

    def GetFilename(self):
        return '2002FemResp.dat.gz'

    def GetFields(self):
        """Returns a tuple specifying the fields to extract.

        The elements of the tuple are field, start, end, case.

                field is the name of the variable
                start and end are the indices as specified in the NSFG docs
                cast is a callable that converts the result to int, float, etc.
        """
        return [
            ('caseid', 1, 12, int),
            ]

class Pregnancies(Table):
    """Contains survey data about a Pregnancy."""

    def ReadRecords(self, data_dir='.', n=None):
        filename = self.GetFilename()
        self.ReadFile(data_dir, filename, self.GetFields(), Pregnancy, n)
        self.Recode()

    def GetFilename(self):
        return '2002FemPreg.dat.gz'

    def GetFields(self):
        """Gets information about the fields to extract from the survey data.

        Documentation of the fields for Cycle 6 is at
        http://nsfg.icpsr.umich.edu/cocoon/WebDocs/NSFG/public/index.htm

        Returns:
            sequence of (name, start, end, type) tuples
        """
        return [
            ('caseid', 1, 12, int),
            ('nbrnaliv', 22, 22, int),
            ('babysex', 56, 56, int),
            ('birthwgt_lb', 57, 58, int),
            ('birthwgt_oz', 59, 60, int),
            ('prglength', 275, 276, int),
            ('outcome', 277, 277, int),
            ('birthord', 278, 279, int),
            ('agepreg', 284, 287, int),
            ('finalwgt', 423, 440, float),
            ]

    def Recode(self):
        for rec in self.records:

            # divide mother's age by 100
            try:
                if rec.agepreg != 'NA':
                    rec.agepreg /= 100.0
            except AttributeError:
                pass

            # convert weight at birth from lbs/oz to total ounces
            # note: there are some very low birthweights
            # that are almost certainly errors, but for now I am not
            # filtering
            try:
                if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and
                    rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):
                    rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz
                else:
                    rec.totalwgt_oz = 'NA'
            except AttributeError:
                pass


def main(name, data_dir='.'):
    resp = Respondents()
    resp.ReadRecords(data_dir)
    print 'Number of respondents', len(resp.records)

    preg = Pregnancies()
    preg.ReadRecords(data_dir)
    print 'Number of pregnancies', len(preg.records)

    
if __name__ == '__main__':
    main(*sys.argv)

被调查者文件中的每一行都表示一个被调查者。这行信息称为一条记录（record），组成记录的变量称为字段（field），若干记录的集合就组成了一个表（table）。
看一下 survey.py 中的代码，就会看到 Record 和 Table 这两个类的定义，前者是代表记录的对象，后者则是表示表的对象。
Record 有两个子类，分别是 Respondent 和 Pregnancy，两者分别是被调查者和怀孕的记录。目前这些类暂时还是空的，其中还没有用于初始化其属性的 init 方法。我们会用 Table.MakeRecord 方法将一行文本转换成一个 Record 对象。
Table 也有两个子类 Respondents 和 Pregnancies。这两个类的 init方法设置了数据文件的默认名称和要创建的记录的类型。每个 Table 对象都有一个 records 属性，是一个 Record对象的列表。
每个 Table 的 GetFields 方法返回一个指定记录字段的元组（tuple）列表，这些字段就是 Record 对象的属性。

农夫山泉是糖水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
统计思维读书笔记（第一章习题）

习题习题1-2这个练习需要从 NSFG 下载数据，本书接下来会用到这些数据。打开http://thinkstats.com/nsfg.html，阅读数据的使用协议，然后点击“I accept theseterms”（假设你确实同意）。下载 2002FemResp.dat.gz 和 2002FemPreg.dat.gz 两个文件。前者是被调查者文件，每一行代表一个被调查...
复制链接

扫一扫

专栏目录