python近期使用总结

最新推荐文章于 2023-02-21 14:15:30 发布

GitzLiu

最新推荐文章于 2023-02-21 14:15:30 发布

阅读量904

点赞数 1

分类专栏： Python 文章标签： python 数据处理

本文链接：https://blog.csdn.net/GitzLiu/article/details/86674318

版权

Python 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

文章目录

【一】读1

readlines() 之间的差异是后者一次读取整个文件，象 .read() 一样。.readlines() 自动将文件内容分析成一个行的列表，该列表可以由 Python 的 for … in … 结构进行处理。
https://www.cnblogs.com/zywscq/p/5441145.html

with open('./data/threedaysdata/t_xifan_click_detail', 'r') as doc_click:
        list_click = doc_click.readlines()

list_click 形式如下：
[“第一行”，“第二行”, “第三行”]
中括号是整个文本

注：这样式一次将整个文件读进来，小文件没问题，大文件要注意内存是否够

接上面

    sp_list_click = []
    for i in range(0, len(list_click)):
        sp_list_click.append(list_click[i].split( ))

sp_list_click形式如下：
[[第一行]，[第二行], [第三行]]
也可以下面这样，一个意思

    for item in list_click:
        sp_list_click.append( item.split( ))

【二】读2

若文件太大，比如 content_xiaochengxu 文件有 600w行，因此，因此采用

for line in open('./content_xiaochengxu'):
        list_item = line.split('\t')

每次只读文件中的一行，不用一下把整个文件加载进内容
line就是某一行的内容，类型为 str
经典的一段代码 get_click_show.py

#!/bin/env python
#coding=utf-8

################################################################
# File: get_click_show.py
# Author: xxx
# Name: xxx
# Mail: xxx
# Created Time: 2019/01/25 15:47:18
################################################################

if __name__ == '__main__':
    '''
    main
    '''
    dict_click = {}
    dict_show = {}
    set_nids = set()

    for line in open('v4_nidmap_fin_xiaochengxu_xiaohongshu'):
        tmp = line.split('\t')[0]
        set_nids.add(tmp) #set 与下面的dict同步添加
        dict_click[tmp] = 0
        dict_show[tmp] = 0


    list_firstlevel = []
    list_secondlevel = []

    count = 0
    for line in open('./content_xiaochengxu'):     #这种读文件方式很爽，content_xiaochengxu有600w行，全读进内存不现实
        list_item = line.split('\t')
        if list_item[0] in set_nids:                           # 想要用set代替 dict查找，前面set.add(aaa)就必须与dict，dict[aaa]保持同步添加   注：set 重复添加，没有效果，字典 key唯一 因此重复添加 只是覆盖key的value

            if cmp(list_item[1], "click") == 0:
                dict_click[list_item[0]] = dict_click[list_item[0]] + 1
            elif cmp(list_item[1], "show") == 0:
                dict_show[list_item[0]] = dict_show[list_item[0]] + 1
            else:
                print list_item[0], ' no click and show'
            count = count + 1
            if count % 100000 == 0:
                print count

    print 'finish_1 done'

    count_2 = 0
    f1 = open('log_click_show', 'a')
    for key, value in dict_click.iteritems():
        if dict_show[key] != 0:
            score = value * 1.0 / dict_show[key]
            score = round(score, 4)
        else:
            score = 0
        tmp_str = key + '\t' + str(value) + '\t' + str(dict_show[key]) + '\t' + str(score) + '\n'
        f1.write(tmp_str)


    f1.close()

    print 'finish 2 done'

里面涉及了读、写、字典与集合的结合使用（查找用集合）

【三】写

 nids = ["234234234234", "454546576765", "876876867768"]
    f1 = open('./data/dir_nid/all_nid', 'a')
    for item in nids:
        f1.write(str(item) + '\n')
    f1.close()

open参数a 是追加写，不会覆盖之前文件中存在的内容，若文件不存在，则自动创建文件
f1.write() 的write函数只能写 str，因此若不是str 需要强制转换下
另外一段示例代码：

f1 = open('xiaochengxu_xiaohongshu_map', 'a')
    for key, value in dict_nid.iteritems():
        tmp_str = key + '\t' + value + '\n'
        f1.write(tmp_str)
f1.close()

【四】去重

nids = ["23", "23", "45", "87"]
nids = list(set(nids))
print nids
#输出nids = ["23", "45", "87"]

用集合去重

【五】一些函数

1. round函数

a = 5.026
round(a,2)
#输出 5.03

保留小数位

2. sorted函数

https://www.cnblogs.com/dylan-wu/p/6041465.html」
https://blog.csdn.net/u013193903/article/details/81096367 ！！！
https://yq.aliyun.com/ziliao/4502 ！！！

（1）sorted函数按key值对字典排序

sorted(iterable,key,reverse)，sorted一共有iterable,key,reverse这三个参数。
其中iterable表示可以迭代的对象，例如可以是 dict.items()、dict.keys()等，key是一个函数，用来选取参与比较的元素，reverse则是用来指定排序是倒序还是顺序，reverse=true则是倒序，reverse=false时则是顺序，默认时reverse=false。
要按key值对字典排序，则可以使用如下语句：

dic = {'chen': 24, 'alex': 34, 'egon': 37, 'evaJ':'18'}
s_dic = sorted(dic.keys())  # 只对key排序，结果也只有key
print(s_dic)
s_dic1 = sorted(dic.items(), key=lambda x: x[0])  # 结果包含key，和value
print(s_dic1)

# 输出结果
# ['alex', 'chen', 'egon', 'evaJ']
# [('alex', 34), ('chen', 24), ('egon', 37), ('evaJ', '18')]

直接使用sorted(d.keys())就能按key值对字典排序，这里是按照顺序对key值排序的，如果想按照倒序排序的话，则只要将reverse置为true即可。

dic = {'chen': 24, 'alex': 34, 'egon': 37, 'evaJ':'18'}
s_dic = sorted(dic.keys(), reverse=True)
# 输出结果
# ['evaJ', 'egon', 'chen', 'alex']

（2）sorted函数按value值对字典排序

要对字典的value排序则需要用到key参数，在这里主要提供一种使用lambda表达式的方法，如下：

dict_t[item[2]] = int(count_tmp)
tmp_cmids = sorted(dict_t.items(), key = lambda x:x[1], reverse = True)
# tmp_cmid形式如下：
# [(), (), ()]，
# 其中是元组

在这里插入图片描述
这里的d.items()实际上是将d转换为可迭代对象，迭代对象的元素为（‘lilee’,25）、（‘wangyan’,21）、（‘liqun’,32）、（‘lidaming’,19），items()方法将字典的元素转化为了元组，而这里key参数对应的lambda表达式的意思则是选取元组中的第二个元素作为比较参数（如果写作key=lambda item:item[0]的话则是选取第一个元素作为比较对象，也就是key值作为比较对象。lambda x:y中x表示输出参数，y表示lambda 函数的返回值），所以采用这种方法可以对字典的value进行排序。注意排序后的返回值是一个list，而原字典中的名值对被转换为了list中的元组。
访问方式如下：

for tup in tmp_cmids:
    print tup[0]   #tup 是tmp_cmids 列表中的元组
    print tup[....]

3. count函数

Python count() 方法用于统计字符串里某个字符出现的次数。可选参数为在字符串搜索的开始与结束位置。

#!/usr/bin/python
 
str = "this is string example....wow!!!";
 
sub = "i";
print "str.count(sub, 4, 40) : ", str.count(sub, 4, 40)
sub = "wow";
print "str.count(sub) : ", str.count(sub)

#输出
str.count(sub, 4, 40) :  2
str.count(sub) :  1

4. 列表添加

list_tmp = []
list_a = [1,2,3]
list_tmp.append( list_a ) #这个列表是嵌套， list_a作为一个元素被添加到list_tmp
list_tmp.extend(list_a) #这个是list_a列表中的元素被添加到list_tmp中

5. strip() 和 split()

（1）strip
Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。

注意：该方法只能删除开头或是结尾的字符，不能删除中间部分的字符。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
str = "00000003210Runoob01230000000"; 
print str.strip( '0' );  # 去除首尾字符 0
 
str2 = "   Runoob      ";   # 去除首尾空格
print str2.strip();
#########
tmp_str="2134324\r\n"
tmp2 = tmp_str.strip('\r\n')
#strip() 返回的还是字符串

（2）split
str.split(str="", num=string.count(str)).

Python split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串
str – 分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。
num – 分割次数。默认为 -1, 即分隔所有

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
print str.split( );       # 以空格为分隔符，包含 \n
print str.split(' ', 1 ); # 以空格为分隔符，分隔成两个

#输出
['Line1-abcdef', 'Line2-abc', 'Line4-abcd']
['Line1-abcdef', '\nLine2-abc \nLine4-abcd']

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
txt = "Google#Runoob#Taobao#Facebook"
 
# 第二个参数为 1，返回两个参数列表
x = txt.split("#", 1)
 
print x
#输出
['Google', 'Runoob#Taobao#Facebook']

6. 判断字符串相等 cmp()

a_str = "hello"
if cmp (a_str , "hello") == 0:
    print "equal"

两个字符串作减法，返回0为相等

7. 判断类型

（1）print type(a_str) 输出a_str的类型
（2）isinstance
isinstance() 与 type() 区别：
type() 不会认为子类是一种父类类型，不考虑继承关系。
isinstance() 会认为子类是一种父类类型，考虑继承关系。

>>>a = 2
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))    # 是元组中的一个返回 True
True

         def check_valid(self, res_js):
             """
             check valid for res_js
             """
             flag = True
             if flag:
                 flag =  'id' in res_js and isinstance(res_js['id'], (int, long))
             if flag:
                 flag =  'cs' in res_js and isinstance(res_js['cs'], basestring)
             if flag:
                 ds_ele_list = res_js['cs'].split(' ')
                 ds = 0
                 if len(ds_ele_list) == 2 and ds != '0 0':
                     ds = (int(ds_ele_list[0]) << 32) + int(ds_ele_list[1])
                 flag = (ds != 0) and isinstance(ds, (int, long))
             if flag:
                 flag = 'ext' in res_js and isinstance(res_js['ext'], dict)
             if flag:
                 flag = 'display_strategy' in res_js and isinstance(res_js['display_strategy'], dict)
             if flag:
                 flag = 'channel' in res_js and isinstance(res_js['channel'], int)
    
             return flag

【六】字典与集合的查找

    set_xiaochengxu = set() # 声明一个集合
    for item in sp_list_xiaochengxu:
        dict_xiaochengxu[item[1]] = item[0]
        set_xiaochengxu.add(item[1])

    set_xifan = set()
    for item in sp_list_xifan:
        dict_xifan[item[1]] = item[0]
        set_xifan.add(item[1])

    count = 0
    f1 = open('v4_nidmap_fin_xiaochengxu_xiaohongshu', 'a')
    for key, value in dict_xiaochengxu.iteritems():
        #if key in dict_xifan.keys():
        if key in set_xifan: #set的查找效率远高于dict
            str_tmp = value + '\t' + dict_xifan[key] + '\n'
            f1.write(str_tmp)
            count = count + 1
            if count % 1000 == 0:
                print count

在搜索方面，用集合替代字典的搜索，提升非常显著。
集合应该是一颗红黑树，红黑树有较高的查找性能。
查找效率：set>dict>list

【七】字典访问方式

item
for key, value in dict_click.iteritems():
    key, value .....
# ===
dict_tmp = {"liu": 1 , "wang": 2}
#可以这样访问value
dict["liu"]
dict.get("liu", 0) 即取不到key liu则，返回默认值 0 ，线上这么写防止程序崩溃

【八】json

# 以json形式输出
nids = ["7823836730143656219", "10431881815279249772"]
f1 = open('content_gcms_20190120', 'a')
for item in nids:
        result = get_full_info_by_nid(item) 
        json_result = json.dumps(result ,indent=4)
        #print json_result
        f1.write(str(json_result)+'\n')

f1.close()

result 是个字典，json.dumps(result ,indent=4) 返回一个json格式 json_result，以str（ json_result）写入文件
参考：https://www.cnblogs.com/sharfir/p/8000127.html
https://www.cnblogs.com/xiaomingzaixian/p/7286793.html
(1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码（可以这么理解，json.dumps()函数是将字典转化为字符串）
(2)json.loads()函数是将json格式数据( 字符串 )转换为字典（可以这么理解，json.loads()函数是将字符串转化为字典）

1 import json
2 
3 # json.loads函数的使用，将字符串转化为字典
4 json_info = '{"age": "12"}'
5 dict1 = json.loads(json_info)
6 print("json_info的类型："+str(type(json_info)))  # 输出为str
7 print("通过json.dumps()函数处理：")   
8 print("dict1的类型："+str(type(dict1)))   #输出位 dick

【九】正则匹配

reg=r'id=[0-9A-Za-z]{1,50}'
imgre=re.compile(reg)
imglist=re.findall(imgre, data[nid]['displaytype_exinfo'])
# 解释：
imglist=re.findall(imgre, 一个字符串)
# 规则为： id=数字和大小写字母混合的字符串，长度为1到50之间
# imglist 是将匹配结果返回一个列表  比如有3个匹配结果  
# 则列表为 ["id=asdq232412wqe", "id=21312wdqwdq", "id=56756dffwef"]

中文乱码
临时调整 set encoding=utf-8

【十】python 里面调用脚本

         err, res = commands.getstatusoutput("sh ./bin/get_data.sh %s %s %s" % (self.ip, self.port, type))
              try:
                  if not err:
                      js = json.loads(res)

err为0 表示正确，
res为脚本的输出