流程:
数据来源:
数据集预览(原始数据500w行,使用excel打不开,因此使用notepad++打开):
。
。
。
数据清洗:
数据存储到HDFS:
使用pyspark对数据进行分析:
//数据导入
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
airportsData = sqlContext.read.format("com.databricks.spark.csv").\
options(header="true",inferschema="true").\
load("hdfs://localhost:9000/airports/airports-csv.csv")
# airportsData.show()
airportsData.registerTempTable("airports")
flightData = sqlContext.read.format("com.databricks.spark.csv").\
options(header="true",inferschema="true").\
load("hdfs://localhost:9000/air1998/1998.csv")
# flightData.show()
flightData.registerTempTable("flights")
//数据分析
//统计1998年在不同时间段飞机离港总数
sqlContext.sql("SELECT COUNT(FlightNum) FROM flights WHERE DepTime BETWEEN 0 AND 600").show()
sqlContext.sql("SELECT COUNT(FlightNum) FROM flights WHERE DepTime BETWEEN 601 AND 1000").show()
sqlContext.sql("SELECT COUNT(FlightNum) FROM flights WHERE DepTime BETWEEN 1001 AND 1400").show()
sqlContext.sql("SELECT COUNT(FlightNum) FROM flights WHERE DepTime BETWEEN 1401 AND 1900").show()
sqlContext.sql("SELECT COUNT(FlightNum) FROM flights WHERE DepTime BETWEEN 1901 AND 2359").show()
(此处输出少了一个)
同时还统计了:哪些机场零延误的次数最少、延误最严重的几次信息等。
统计结果可视化:
预测航班是否准时:
使用逻辑回归,提取出四列与航班是否准时相关的特征值,进行二分类:
//核心代码
//设置参数
lmp = new LogisticModelParameters();
output = new PrintWriter(new OutputStreamWriter(System.out,
Charsets.UTF_8), true);
lmp.setLambda(0.001);
lmp.setLearningRate(50);
lmp.setMaxTargetCategories(4); //总共有4列特征值
lmp.setNumFeatures(2); //预测结果只有0和1两种
List<String> targetCategories = Lists.newArrayList("DayofMonth", "DayOfWeek", "FlightNum","Distance"); //特征值
lmp.setTargetCategories(targetCategories);
lmp.setTargetVariable("ArrDelay"); // 需要进行预测的是ArrDelay属性
List<String> typeList = Lists.newArrayList("numeric", "numeric", "numeric", "numeric");
List<String> predictorList = Lists.newArrayList("sepallength", "sepalwidth", "petallength", "petalwidth");
lmp.setTypeMap(predictorList, typeList);
//读数据
List<String> raw = FileUtils.readLines(new File("/home/hadoop/桌面/大数据三级项目/清洗后的数据/air1998ForPre.csv"), "UTF-8"); //使用common-io进行文件读取
System.out.println(FileUtils.readLines(new File("/home/hadoop/桌面/大数据三级项目/清洗后的数据/air1998ForPre.csv"), "UTF-8"));
String header = raw.get(0);
List<String> content = raw.subList(1, raw.size());
// parse data
CsvRecordFactory csv = lmp.getCsvRecordFactory();
// csv.firstLine(header);
//训练
OnlineLogisticRegression lr = lmp.createRegression();
for (String line : content) {
Vector input = new RandomAccessSparseVector(lmp.getNumFeatures());
int targetValue = csv.processLine(line, input);
lr.train(targetValue, input);
}
//输出准确率
double correctRate = 0;
double sampleCount = content.size();
for (String line : content) {
Vector v = new SequentialAccessSparseVector(lmp.getNumFeatures());
int target = csv.processLine(line, v);
int score = lr.classifyFull(v).maxValueIndex(); // 分类核心语句!!!
if(score == target) {
correctRate++;
}
}
output.printf(Locale.ENGLISH, "Rate = %.2f%n", correctRate / sampleCount);
输出的结果文件:
准确率(0.9左右):