python if统计人数_python实现爬虫统计学校BBS男女比例之数据处理(三)

本文主要介绍了数据处理方面的内容,希望大家仔细阅读。

一、数据分析

20151231162342489.jpg?20151131162351

得到了以下列字符串开头的文本数据,我们需要进行处理

20151231162357493.jpg?2015113116245

二、回滚

我们需要对httperror的数据进行再处理

因为代码的原因,具体可见本系列文章(二),会导致文本里面同一个id连续出现几次httperror记录:

//httperror265001_266001.txt

265002 httperror

265002 httperror

265002 httperror

265002 httperror

265003 httperror

265003 httperror

265003 httperror

265003 httperror

所以我们在代码里要考虑这种情形,不能每一行的id都进行处理,是判断是否重复的id。

java里面有缓存方法可以避免频繁读取硬盘上的文件,python其实也有,可以见这篇文章。

def main():

reload(sys)

sys.setdefaultencoding('utf-8')

global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5

sexRe = re.compile(u'em>\u6027\u522b(.*?)')

timeRe = re.compile(u'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4(.*?)')

notexistRe = re.compile(u'(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')

url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'

url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'

file1 = 'ruisi\\correct_re.txt'

file2 = 'ruisi\\errTime_re.txt'

file3 = 'ruisi\\notexist_re.txt'

file4 = 'ruisi\\unkownsex_re.txt'

file5 = 'ruisi\\httperror_re.txt'

#遍历文件夹里面以httperror开头的文本

for filename in os.listdir(r'E:\pythonProject\ruisi'):

if filename.startswith('httperror'):

count = 0

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

readFile = open(newName,'r')

oldLine = '0'

for line in readFile:

#newLine 用来比较是否是重复的id

newLine = line

if (newLine != oldLine):

nu = newLine.split()[0]

oldLine = newLine

count += 1

searchWeb((int(nu),))

print "%s deal %s lines" %(filename, count)

本代码为了简便,没有再把httperror的那些id分类,直接存储为下面这5个文件里

file1 = 'ruisi\\correct_re.txt'

file2 = 'ruisi\\errTime_re.txt'

file3 = 'ruisi\\notexist_re.txt'

file4 = 'ruisi\\unkownsex_re.txt'

file5 = 'ruisi\\httperror_re.txt'

可以看下输出Log记录,总共处理了多少个httperror的数据。

"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/reload.py

httperror132001-133001.txt deal 21 lines

httperror2001-3001.txt deal 4 lines

httperror251001-252001.txt deal 5 lines

httperror254001-255001.txt deal 1 lines

三、单线程统计unkownsex 数据

代码简单,我们利用单线程统计一下unkownsex(由于权限原因无法获取、或者该用户没有填写)的用户。另外,经过我们检查,没有性别的用户也是没有活动时间的。

数据格式如下:

253042 unkownsex

253087 unkownsex

253102 unkownsex

253118 unkownsex

253125 unkownsex

253136 unkownsex

253161 unkownsex

import os,time

sumCount = 0

startTime = time.clock()

for filename in os.listdir(r'E:\pythonProject\ruisi'):

if filename.startswith('unkownsex'):

count = 0

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

readFile = open(newName,'r')

for line in open(newName):

count += 1

sumCount +=1

print "%s deal %s lines" %(filename, count)

print '%s unkowns sex' %(sumCount)

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

处理速度很快,输出如下:

unkownsex1-1001.txt deal 204 lines

unkownsex100001-101001.txt deal 50 lines

unkownsex10001-11001.txt deal 206 lines

#...省略中间输出信息

unkownsex99001-100001.txt deal 56 lines

unkownsex_re.txt deal 1085 lines

14223 unkowns sex

cost time 0.0813142301261 s

四、单线程统计 correct 数据

数据格式如下:

31024 男 2014-11-11 13:20

31283 男 2013-3-25 19:41

31340 保密 2015-2-2 15:17

31427 保密 2014-8-10 09:17

31475 保密 2013-7-2 08:59

31554 保密 2014-10-17 17:02

31621 男 2015-5-16 19:27

31872 保密 2015-1-11 16:49

31915 保密 2014-5-4 11:01

31997 保密 2015-5-16 20:14

代码如下,实现思路就是一行一行读取,利用line.split()获取性别信息。sumCount 是统计一个多少人,boycount 、girlcount 、secretcount 分别统计男、女、保密的人数。我们还是利用unicode进行正则匹配。

import os,sys,time

reload(sys)

sys.setdefaultencoding('utf-8')

startTime = time.clock()

sumCount = 0

boycount = 0

girlcount = 0

secretcount = 0

for filename in os.listdir(r'E:\pythonProject\ruisi'):

if filename.startswith('correct'):

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

readFile = open(newName,'r')

for line in readFile:

sexInfo = line.split()[1]

sumCount +=1

if sexInfo == u'\u7537' :

boycount += 1

elif sexInfo == u'\u5973':

girlcount +=1

elif sexInfo == u'\u4fdd\u5bc6':

secretcount +=1

print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)

print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

注意,我们输出的是截止某个文件的统计信息,而不是单个文件的统计情况。输出结果如下:

until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret;

until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret;

#...省略

until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;

until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret;

total is 46885; 13937 boys; 4007 girls; 28941 secret;

cost time 3.60047888495 s

五、多线程统计数据

为了更快统计,我们可以利用多线程。

作为对比,我们试下单线程需要的时间。

# encoding: UTF-8

import threading

import time,os,sys

#全局变量

SUM = 0

BOY = 0

GIRL = 0

SECRET = 0

NUM =0

#本来继承自threading.Thread,覆盖run()方法,用start()启动线程

#这和java里面很像

class StaFileList(threading.Thread):

#文本名称列表

fileList = []

def __init__(self, fileList):

threading.Thread.__init__(self)

self.fileList = fileList

def run(self):

global SUM, BOY, GIRL, SECRET

#可以加上个耗时时间,这样多线程更加明显,而不是顺序的thread-1,2,3

#time.sleep(1)

#acquire获取锁

if mutex.acquire(1):

self.staFiles(self.fileList)

#release释放锁

mutex.release()

#处理输入的files列表,统计男女人数

#注意这儿数据同步问题,global使用全局变量

def staFiles(self, files):

global SUM, BOY, GIRL, SECRET

for name in files:

newName = 'E:\\pythonProject\\ruisi\\%s' % (name)

readFile = open(newName,'r')

for line in readFile:

sexInfo = line.split()[1]

SUM +=1

if sexInfo == u'\u7537' :

BOY += 1

elif sexInfo == u'\u5973':

GIRL +=1

elif sexInfo == u'\u4fdd\u5bc6':

SECRET +=1

# print "thread %s, until %s, total is %s; %s boys; %s girls;" \

# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)

def test():

#files保存多个文件,可以设定一个线程处理多少个文件

files = []

#用来保存所有的线程,方便最后主线程等待所以子线程结束

staThreads = []

i = 0

for filename in os.listdir(r'E:\pythonProject\ruisi'):

#没获取10个文本,就创建一个线程

if filename.startswith('correct'):

files.append(filename)

i+=1

#一个线程处理20个文件

if i == 20 :

staThreads.append(StaFileList(files))

files = []

i = 0

#最后剩余的files,很可能长度不足10个

if files:

staThreads.append(StaFileList(files))

for t in staThreads:

t.start()

# 主线程中等待所有子线程退出,如果不加这个,速度更快些?

for t in staThreads:

t.join()

if __name__ == '__main__':

reload(sys)

sys.setdefaultencoding('utf-8')

startTime = time.clock()

mutex = threading.Lock()

test()

print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET)

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

输出

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;

cost time 0.132137192794 s

我们发现时间和单线程差不多。因为这儿涉及到线程同步问题,获取锁和释放锁都是需要时间开销的,线程间切换保存中断和恢复中断也都是需要时间开销的。

六、较多数据的单线程和多线程对比

我们可以对correct、errTime 、unkownsex的文本都进行处理。

单线程代码

# coding=utf-8

import os,sys,time

reload(sys)

sys.setdefaultencoding('utf-8')

startTime = time.clock()

sumCount = 0

boycount = 0

girlcount = 0

secretcount = 0

unkowncount = 0

for filename in os.listdir(r'E:\pythonProject\ruisi'):

# 有性别、活动时间

if filename.startswith('correct') :

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

readFile = open(newName,'r')

for line in readFile:

sexInfo =line.split()[1]

sumCount +=1

if sexInfo == u'\u7537' :

boycount += 1

elif sexInfo == u'\u5973':

girlcount +=1

elif sexInfo == u'\u4fdd\u5bc6':

secretcount +=1

# print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)

#没有活动时间,但是有性别

elif filename.startswith("errTime"):

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

readFile = open(newName,'r')

for line in readFile:

sexInfo =line.split()[1]

sumCount +=1

if sexInfo == u'\u7537' :

boycount += 1

elif sexInfo == u'\u5973':

girlcount +=1

elif sexInfo == u'\u4fdd\u5bc6':

secretcount +=1

# print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)

#没有性别,也没有时间,直接统计行数

elif filename.startswith("unkownsex"):

newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)

# count = len(open(newName,'rU').readlines())

#对于大文件用循环方法,count 初始值为 -1 是为了应对空行的情况,最后+1得到0行

count = -1

for count, line in enumerate(open(newName, 'rU')):

pass

count += 1

unkowncount += count

sumCount += count

# print "until %s, sum is %s unkownsex" %(filename, unkowncount)

print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount)

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

输出为

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;

cost time 1.37444645628 s

多线程代码

__author__ = 'admin'

# encoding: UTF-8

#多线程处理程序

import threading

import time,os,sys

#全局变量

SUM = 0

BOY = 0

GIRL = 0

SECRET = 0

UNKOWN = 0

class StaFileList(threading.Thread):

#文本名称列表

fileList = []

def __init__(self, fileList):

threading.Thread.__init__(self)

self.fileList = fileList

def run(self):

global SUM, BOY, GIRL, SECRET

if mutex.acquire(1):

self.staManyFiles(self.fileList)

mutex.release()

#处理输入的files列表,统计男女人数

#注意这儿数据同步问题

def staCorrectFiles(self, files):

global SUM, BOY, GIRL, SECRET

for name in files:

newName = 'E:\\pythonProject\\ruisi\\%s' % (name)

readFile = open(newName,'r')

for line in readFile:

sexInfo = line.split()[1]

SUM +=1

if sexInfo == u'\u7537' :

BOY += 1

elif sexInfo == u'\u5973':

GIRL +=1

elif sexInfo == u'\u4fdd\u5bc6':

SECRET +=1

# print "thread %s, until %s, total is %s; %s boys; %s girls;" \

# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)

def staManyFiles(self, files):

global SUM, BOY, GIRL, SECRET,UNKOWN

for name in files:

if name.startswith('correct') :

newName = 'E:\\pythonProject\\ruisi\\%s' % (name)

readFile = open(newName,'r')

for line in readFile:

sexInfo = line.split()[1]

SUM +=1

if sexInfo == u'\u7537' :

BOY += 1

elif sexInfo == u'\u5973':

GIRL +=1

elif sexInfo == u'\u4fdd\u5bc6':

SECRET +=1

# print "thread %s, until %s, total is %s; %s boys; %s girls;" \

# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)

#没有活动时间,但是有性别

elif name.startswith("errTime"):

newName = 'E:\\pythonProject\\ruisi\\%s' % (name)

readFile = open(newName,'r')

for line in readFile:

sexInfo = line.split()[1]

SUM +=1

if sexInfo == u'\u7537' :

BOY += 1

elif sexInfo == u'\u5973':

GIRL +=1

elif sexInfo == u'\u4fdd\u5bc6':

SECRET +=1

# print "thread %s, until %s, total is %s; %s boys; %s girls;" \

# " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)

#没有性别,也没有时间,直接统计行数

elif name.startswith("unkownsex"):

newName = 'E:\\pythonProject\\ruisi\\%s' % (name)

# count = len(open(newName,'rU').readlines())

#对于大文件用循环方法,count 初始值为 -1 是为了应对空行的情况,最后+1得到0行

count = -1

for count, line in enumerate(open(newName, 'rU')):

pass

count += 1

UNKOWN += count

SUM += count

# print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN)

def test():

files = []

#用来保存所有的线程,方便最后主线程等待所以子线程结束

staThreads = []

i = 0

for filename in os.listdir(r'E:\pythonProject\ruisi'):

#没获取10个文本,就创建一个线程

if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"):

files.append(filename)

i+=1

if i == 20 :

staThreads.append(StaFileList(files))

files = []

i = 0

#最后剩余的files,很可能长度不足10个

if files:

staThreads.append(StaFileList(files))

for t in staThreads:

t.start()

# 主线程中等待所有子线程退出

for t in staThreads:

t.join()

if __name__ == '__main__':

reload(sys)

sys.setdefaultencoding('utf-8')

startTime = time.clock()

mutex = threading.Lock()

test()

print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN)

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

endTime = time.clock()

print "cost time " + str(endTime - startTime) + " s"

输出为

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;

cost time 1.23049112201 s可以看出多线程还是优于单线程的,由于使用的同步,数据统计是一直的。

注意python在类内部经常需要加上self,这点和java区别很大。

def __init__(self, fileList):

threading.Thread.__init__(self)

self.fileList = fileList

def run(self):

global SUM, BOY, GIRL, SECRET

if mutex.acquire(1):

#调用类内部方法需要加self

self.staFiles(self.fileList)

mutex.release()

total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;

cost time 1.25413238673 s

以上就是本文的全部内容,希望对大家的学习有所帮助。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值