- 博客(31)
- 收藏
- 关注
原创 统计文本中出现最高的词频数并存储为excel
print(d_train.head()) #d_train 为 Dataframedocument = " ".join(d_train.title).split() #将文本连成文件再用空格分词,return listprint(document)ss = Counter(document).most_common(100) #return 单词出现最多的100个单词--次数
2017-09-29 10:32:59 2717
原创 kaggle 泰坦尼克号生还者预测
import pandas as pdfrom sklearn.tree import DecisionTreeClassifier #决策树from sklearn.model_selection import cross_val_scoredf = pd.read_csv("train.csv")#数据清洗,补全有缺失的数据df.Age.fillna(df.Age.mean(),in
2017-09-28 17:07:30 1254
原创 series.map() 映射值
官方文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.htmlSeries.map(arg, na_action=None)[source]Map values of Series using input correspondence (which can be a dict
2017-09-28 16:55:03 858
原创 交叉验证 sklearn.model_selection.cross_val_score
官方文档:http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.htmlX=data.loc[:,features]y=data.Surviveddtc= DecisionTreeClassifier()scores = cross_val_score(dtc
2017-09-28 16:15:22 2423
原创 Datafram缺失值补全 fillna
官方文档: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html当数据中存在NaN缺失值时,我们可以用其他数值替代NaN,主要用到了DataFrame.fillna()方法:例子 : df.Age.fillna(df.Age.mean(),inplace =
2017-09-28 16:11:53 4791
原创 os.walk
Python os.walk() 方法 Python OS 文件/目录方法概述os.walk() 方法用于通过在目录树种游走输出在目录中的文件名,向上或者向下。在Unix,Windows中有效。语法walk()方法语法格式如下:os.walk(top[, topdown=True[, onerror=None[, followlinks=
2017-09-28 12:09:41 308
原创 Excel if函数用法
Excel if函数用法 1.IF函数的语法结构 IF函数的语法结构:IF(条件,结果1,结果2)。 2.IF函数的功能 对满足条件的数据进行处理,条件满足则输出结果1,不满足则输出结果2。可以省略结果1或结果2,但不能同时省略。 3.条件表达式 把两个表达式用关系运算符(主要有=,,=, 4.IF函数嵌套的执行过程 如果按等级来判断
2017-09-27 11:07:37 3509 2
原创 优化 oe抓取 tech where_used
采用csv 添加的方式,由于多线程,造成读写错误。解决方法: 将每个oe数据分开存储import loggingimport randomimport threadingimport urllib.parseimport urllib.parseimport requestsfrom queue import Queueimport pymysqlfrom bs
2017-09-26 18:48:28 470
原创 spacy 英文模型
import spacynlp = spacy.load('en') #加载英文模型doc = nlp(u"it's word tokenize test for spacy")print(doc)for d in doc: print(d)test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love
2017-09-26 15:21:38 1441
原创 jieba分词
参考链接 https://github.com/fxsjy/jieba#encoding=utf-8from __future__ import print_function, unicode_literalsimport syssys.path.append("../")import jiebajieba.load_userdict("dict.txt") #
2017-09-26 13:38:25 222
原创 根据关键词 抓取ebayno title price
抓取代码:1 import randomfrom http.cookiejar import CookieJarimport requestsfrom bs4 import BeautifulSoupimport csvimport numpy as npimport refrom queue import Queueimport timeimport
2017-09-22 17:56:37 406
原创 根据oe抓取ebayno title fits
示例:网址:https://www.ebay.com/sch/am-autoparts/m.html?item=371393241499&rt=nc&_trksid=p2047675.l4064oe:171340L这里需要设置美国收货地址,否则搜索条数少代码如下:import loggingimport randomimport threadingimpo
2017-09-22 17:53:59 846
原创 Dataframe 样本打乱
import pandas as pddf = pd.read_excel("window regulator01 _0914新增样本.xlsx")df = df.sample(frac = 1) #打乱样本
2017-09-22 09:54:04 5742
原创 删除重复元素 drop_duplicates()
import pandas as pddf = pd.read_excel("合并fitment.xlsx")print(len(df))skus = df.SKU.drop_duplicates()result = []for sku in skus: df_sub = df[df.SKU == str(sku)] makes = df_sub.Make.drop_du
2017-09-19 17:38:17 5491
原创 not in 两个列表
import pandas as pddf_fb = pd.read_excel("fb_title.xlsx" ,sheetname="Sheet1" ,encoding="gbk")df_exist = pd.read_excel("category_data.xlsx" ,sheetname="category_data" ,encoding="gbk")notin = []for
2017-09-19 16:24:20 742
原创 统计销量 Counter
import pandas as pdfrom collections import Counterimport numpy as npdf = pd.read_csv("fb_viogi.csv",encoding="gbk")ebaynos = df.ebayno.valuesitem = Counter(ebaynos) #返回的是字典 #list中出现个数统计result =
2017-09-19 16:19:19 425
原创 merge
import pandas as pddfl = pd.read_excel("window_regulator原始数据.xlsx",sheetname= "fb_title_sale_price")dfr = pd.read_excel("window_regulator原始数据.xlsx",sheetname="category总")print(len(dfl))print(len(d
2017-09-19 16:09:13 348
原创 数据库查询 in传入数组
import pymysqlfrom pandas import DataFrameimport numpy as npimport pandas as pddef findFromViogiData(ebaynos): db = pymysql.connect(host='****', user='**', passwd='**', db='*', port=3306, char
2017-09-18 12:52:13 6015
原创 python requests
Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的,所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 哦!发送请求
2017-09-14 15:01:48 355
原创 random.choice
choice() 方法返回一个列表,元组或字符串的随机项。print "choice([1, 2, 3, 5, 9]) : ", random.choice([1, 2, 3, 5, 9]) #输出2print "choice('A String') : ", random.choice('A String') #输出ndef randHeader():
2017-09-14 14:55:04 953
原创 bagofwords tf-idf word2vec特征实践
1 bagofwords + bayesimport pandas as pdfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_score , roc_auc
2017-09-13 14:34:48 1140 1
原创 kaggle 电影评论情感分析 贝叶斯分类
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_score, roc_auc_score, roc_curveimpo
2017-09-12 18:10:25 3042 2
原创 python sklearn.metrics roc_curve
对分类结果的度量:参考链接:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
2017-09-12 17:57:24 2503
原创 2类分类器实践1
import pandas as pdimport nltk# 定义特征提取器def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features["contai
2017-09-08 18:09:08 382 1
原创 python 两个列表合并
两个方法:1用list的extend方法,L1.extend(L2),该方法将参数L2的全部元素添加到L1的尾部,例如:train_set = featuresets0[:2000]train_set.extend(featuresets1[:2000])print(len(train_set))2 切片 用切片(slice)操作,L1[len(L1):len(L1)] = L
2017-09-08 18:08:23 4621
原创 ebay抓取specific 考虑remove 0results
def getspecific(self , ebayno): print(ebayno) out = open("specific.csv","a",newline="") csv_writer = csv.writer(out) url = 'http://www.ebay.com/itm/' + ebayno r
2017-09-08 16:17:35 328
原创 ebay价格抓取 考虑remove 0results end
import urllib.requestfrom bs4 import BeautifulSoupimport redef getPrice( ebayno): print(ebayno) url = "http://www.ebay.com/itm/" + str(ebayno) req = urllib.request.Request(url=url)
2017-09-08 13:43:42 478
原创 python 自动批量打开网页
import webbrowserimport codecsimport timewith open("test.txt") as fp: for ebayno in fp: url = 'http://ebay.com/itm/'+ebayno.strip() time.sleep(1) #打开间隔时间 webbrowser.open
2017-09-07 17:53:04 6942 1
原创 影评分析初级 nltk 电影语料库
from nltk.corpus import movie_reviewsimport randomimport nltk#定义特征提取器def document_features(document , word_features): document_words = set(document) features = {} for word in word_fe
2017-09-06 17:22:27 1534
原创 python gutenberg古腾堡语料库
import nltkfrom nltk.corpus import gutenberga = gutenberg.fileids()print(a)emma = gutenberg.words("shakespeare-macbeth.txt")print(emma[1030 :1037])for fileid in gutenberg.fileids(): num_char
2017-09-01 14:25:53 3691
原创 python 自然语言处理学习1
from urllib.request import *import nltkfrom bs4 import BeautifulSoupurl = "http://www.gutenberg.org/files/2554/2554-h/2554-h.htm"raw = urlopen(url).read()print(type(raw))print(len(raw))print(r
2017-09-01 12:06:30 241
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人