数据准备
数据说明:第二行第二列是空值;最后一列是空值;
两个重要的库
using DataFrames;
using CSV;
using Dates;
取数据:有首行(header),“,”分割;日期处理
csv_2 = s"C:\Users\songroom\Desktop\000002.XSHE.csv"
@time df = CSV.read(csv_2,header=true,delim=',', dateformat="yyyy/mm/dd");
输出:
可见,有不少的missing.除了第二列open中有一个missing外,但是,还有一列是全是missing.
1、如何把missing换成别的,比如missing=>0.0?
(1)replace!,有时侯可以,但是在这里不行
julia> replace!(df.open,missing =>0.0)
ERROR: setindex! not defined for CSV.Column{Union{Missing, Float64},Union{Missing, Float64}}
Stacktrace:
[1] error(::String, ::Type) at .\error.jl:42
[2] error_if_canonical_setindex(::IndexLinear, ::CSV.Column{Union{Missing, Float64},Union{Missing, Float64}}, ::Int64) at .\abstractarray.jl:1082
[3] setindex! at .\abstractarray.jl:1073 [inlined]
[4] _replace!(::Base.var"#new#252"{Tuple{Pair{Missing,Float64}}}, ::CSV.Column{Union{Missing, Float64},Union{Missing, Float64}}, ::CSV.Column{Union{Missing,
Float64},Union{Missing, Float64}}, ::Int64) at .\set.jl:626
[5] replace_pairs! at .\set.jl:455 [inlined]
[6] #replace!#251 at .\set.jl:445 [inlined]
[7] replace!(::CSV.Column{Union{Missing, Float64},Union{Missing, Float64}}, ::Pair{Missing,Float64}) at .\set.jl:445
[8] top-level scope at REPL[75]:1
(2)coalesce
df.open = coalesce.(df.open, 0);
看一下效果:
结果是成功的!
(3)简单赋值,不一定可以的。
2、字段更名
把第一列,改为datetime
rename!(df,1=>:datetime) #字段重新命名 ,第1列
row,col = size(df)
3、insert一列,在第一列地方,insert 字段名字dtime
col = [x for x in df.open] .+[x for x in df.close]
insertcols!(df,1,:dtime =>col )
4、删除某列
select!(df, Not(:dtime))
5、names(df) 和 propertynames(df)
names => String
julia> names(df)
14-element Array{String,1}:
"dt"
"open"
"close"
"low"
"high"
"volume"
"money"
"factor"
"high_limit"
"low_limit"
"avg"
"pre_close"
"paused"
"open_interest"
propertynames => Symbol
julia> propertynames(df)
14-element Array{Symbol,1}:
:dt
:open
:close
:low
:high
:volume
:money
:factor
:high_limit
:low_limit
:avg
:pre_close
:paused
:open_interest
6、判断字段是否在其中?
hasproperty(df, :x1) #df是否有字段“x1”
columnindex(df, :x2) #df中索引“x2”排第几?不在其中,为0;
7、赋值
julia> df[1,2]=1111111
1111111
但需要注意,赋值时需要是同一类型。
8、块赋值
9、取前后几行
# 取df中某前几行
first(df,5) # 等价于python pandas head(df)
# 取df中后几行
last(df,5) # 等价于python pandas tail(df)
10、df =>array
11、对行进行循还
for row in eachrow(df)
println(row[:close] - row[:open])
end
12、对列进行循 还
for col in eachcol(df)
println(col[1])
end
13、综合循还
function iter_each_value(df) #check
rows = eachrow(df)
nrow,ncol = size(df)
i = 1
for row in rows
for j =1:ncol
if rows[i][1] == 926.73
println(row)
end
end
i = i+1
end
end
dataframe循还效率:找一个相对大的样本
和Array来进行比较:注意,这里没有用列进行优先循还!
function iter_each_value_array(data)
nrow,ncol = size(df)
for i =1:nrow
for j =1:ncol
if data[i,j] == 926.73
println("aa")
end
end
end
end
具体速度如下:
julia> data =rand(590000,14)
julia> @time iter_each_value_array(data)
0.588667 seconds (10.61 M allocations: 305.910 MiB, 6.45% gc time)
julia> @time iter_each_value_array(data)
0.650769 seconds (10.61 M allocations: 305.910 MiB, 8.23% gc time)
初步看,两者速度差不多。