Web Scraping with Python 学习笔记7

Chapter 7:Cleaning Your Dirty Data

Cleaning in Code

        首先简单介绍一下N-Gram,N-Gram是大词汇连续语音识别中常用的一种语言模型,对中文而言,我们称之为汉语语言模型(CLM, Chinese Language Model)。在做自然语言处理时,通常会根据句子中的固定搭配把句子划分为小片段,这里的固定搭配有2个词组成(2-gram),也有三个词组成(3-gram)的,甚至更多。

from urllib import urlopen
from bs4 import BeautifulSoup
def  ngrams(input,n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html,'html.parser')
content = bsObj.find("div",{"id":"mw-content-text"}).get_text()
ngrams = ngrams(content,2)
print ngrams
print ("2-grams count is : "+str(len(ngrams)))

        运行代码后得到摘取的数据1,数据2是加入去空格代码后摘取的结果,如下:

数据1:
[u'Specification', u'Promise\nRevolution'], [u'Promise\nRevolution', u'OS\n\n\n\n\n\n\n\n\n\n\n\n'],
[u'OS\n\n\n\n\n\n\n\n\n\n\n\n', u'Book\n'], [u'Book\n', u'Category\n'], 
[u'Category\n', u'Commons\n'], [u'Commons\n', u'Portal\n\n\n\n\n\n\n\n\n\n\n\n']

数据2:
[u'Python', u'(2'], [u'(2', u'ed.).'], [u'ed.).', u'ISBN\xa0978-0-9821060-1-3.'],
[u'ISBN\xa0978-0-9821060-1-3.', u'Retrieved'], [u'Retrieved', u'20'], 
[u'20', u'February'], [u'February', u'2014.\xa0\n^'],
[u'2014.\xa0\n^', u'van'], [u'van', u'Rossum,'], [u'Rossum,', u'Guido'], [u'Guido', u'(22'],
[u'(22', u'April'], [u'April', u'2009).'], [u'2009).', u'"Tail']

        在数据1中我们发现大量不必要的空格,因此要除去,另外不符合utf-8的编码数据也要除去,使用的当然是正则表达式了,在ngrams函数中加入:

    content = re.sub('\n+', " ", content) 
    content = re.sub(' +', " ", content) 
    content = bytes(content, "UTF-8")
    content = content.decode("ascii", "ignore")

        再次运行后得到数据2,发现有大量不需要的数字,甚至会有单个英文字母(“i”和“a”除外)和特殊字符(string.punctuation),这些东西在绝大多数数据清洗工作中都是要被清洗掉的,很幸运可以再次通过编写代码去除。便于扩展,去除脏数据的代码和ngrams代码分开编写:

def cleanInput(input):
    input = re.sub('\n+', " ", input) input = re.sub('\[[0-9]*\]', "", input) 
    input = re.sub(' +', " ", input)
    #下面两句被我注释掉的原因是在python 2版本bytes是关键字,只接受一个参数,显然作者写的是python 3代码,
    #input = bytes(input, "UTF-8")
    #input = input.decode("ascii", "ignore") 
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
        cleanInput.append(item)

    return cleanInput


def ngrams(input, n):
    input = cleanInput(input) 
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n]) 

    return output

Data Normalization

        数据标准化也即数据归一化,它是数据挖掘的一项基础工作,不同评价指标往往具有不同的量纲和量纲单位,这样的情况会影响到数据分析的结果,为了消除指标之间的量纲影响,需要进行数据标准化处理,以解决数据指标之间的可比性。本文并未涉及到那么深,只是进行一些简单的标准化处理,举个例子,电话号码会有“(555)123-4567”、“555.123.4567”、“555-1234567”等不同的写法,而我们要的就是5551234567这种格式,所以要以此为标准进行处理。又比如在“Cleaning in code”小节中我们得到的结果很多都是重复的,”[‘Software’, ‘Foundation’]”这个2-grams就出现的40次之多,类似这样重复的进行合并一方面减少数据存储的压力另一方面便于分析。
针对上小节的处理方法,由于Python的字典是无序的,我们需要用到OrderedDict,来自Python的collections library:

from collections import OrderedDict
...
ngrams = ngrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)

        部分结果:

("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38), ("['of', 'th
e']", 35), ("['Foundation', 'Retrieved']", 34), ("['of', 'Python']", 28), ("['in
', 'the']", 21), ("['van', 'Rossum']", 18)

        除此之外还有一些特殊情况需要我们处理:2-grams中“Python 1st”和“Python first”意义一样,需要合并,单个词“co-ordinated”和“coordinated”也是一样的,也要作同一个词处理。

Cleaning After the Fact

        本小节主要介绍了一款数据清洗的开源软件OpenRefine,使用它之前你需要将你的数据保存为CSV格式。需要了解更多请访问:OpenRefine’s GitHub page

        注:本文为读书笔记,内容基本来自Web Scraping with Python(Ryan Mitchell著),部分代码有所改动,版权归作者所有,转载请注明出处

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once. Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice. Learn how to parse complicated HTML pages Traverse multiple pages and sites Get a general overview of APIs and how they work Learn several methods for storing the data you scrape Download, read, and extract data from documents Use tools and techniques to clean badly formatted data Read and write natural languages Crawl through forms and logins Understand how to scrape JavaScript Learn image processing and text recognition Table of Contents Part I. Building Scrapers Chapter 1. Your First Web Scraper Chapter 2. Advanced HTML Parsing Chapter 3. Starting to Crawl Chapter 4. Using APIs Chapter 5. Storing Data Chapter 6. Reading Documents Part II. Advanced Scraping Chapter 7. Cleaning Your Dirty Data Chapter 8. Reading and Writing Natural Languages Chapter 9. Crawling Through Forms and Logins Chapter 10. Scraping JavaScript Chapter 11. Image Processing and Text Recognition Chapter 12. Avoiding Scraping Traps Chapter 13. Testing Your Website with Scrapers Chapter 14. Scraping Remotely Appendix A. Python at a Glance Appendix B. The Internet at a Glance Appendix C. The Legalities and Ethics of Web Scraping

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值