spark解析csv文件
我发现自己经常使用大型CSV文件,并且意识到我现有的工具集不能让我快速浏览它们,我以为我会花一些时间在Spark上看看是否有帮助。
我正在使用芝加哥市发布的犯罪数据集 :它的大小为1GB,其中包含400万种犯罪的详细信息:
$ ls -alh ~/Downloads/Crimes_-_2001_to_present.csv
-rw-r--r--@ 1 markneedham staff 1.0G 16 Nov 12:14 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
$ wc -l ~/Downloads/Crimes_-_2001_to_present.csv
4193441 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
通过查看第一行和标题,我们可以大致了解文件的内容:
$ head -n 2 ~/Downloads/Crimes_-_2001_to_present.csv
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,"(41.75017626412204, -87.55494559131228)"
我想对“主要类型”列进行计数,以了解我们每种犯罪有多少种。 仅使用Unix命令行工具,这就是我们的方法:
$ time tail +2 ~/Downloads/Crimes_-_2001_to_present.csv | cut -d, -f6 | sort | uniq -c | sort -rn
859197 THEFT
757530 BATTERY
489528 NARCOTICS
488209 CRIMINAL DAMAGE
257310 BURGLARY
253964 OTHER OFFENSE
247386 ASSAULT
197404 MOTOR VEHICLE THEFT
157706 ROBBERY
137538 DECEPTIVE PRACTICE
124974 CRIMINAL TRESPASS
47245 PROSTITUTION
40361 WEAPONS VIOLATION
31585 PUBLIC PEACE VIOLATION
26524 OFFENSE INVOLVING CHILDREN
14788 CRIM SEXUAL ASSAULT
14283 SEX OFFENSE
10632 GAMBLING
8847 LIQUOR LAW VIOLATION
6443 ARSON
5178 INTERFERE WITH PUBLIC OFFICER
4846 HOMICIDE
3585 KIDNAPPING
3147 INTERFERENCE WITH PUBLIC OFFICER
2471 INTIMIDATION
1985 STALKING
355 OFFENSES INVOLVING CHILDREN
219 OBSCENITY
86 PUBLIC INDECENCY
80 OTHER NARCOTIC VIOLATION
12 RITUALISM
12 NON-CRIMINAL
6 OTHER OFFENSE
2 NON-CRIMINAL (SUBJECT SPECIFIED)
2 NON - CRIMINAL
real 2m37.495s
user 3m0.337s
sys 0m1.471s
这还不错ÿ