回忆录 - IKCEST首届“一带一路”国际大数据竞赛(2019)获奖经历

置顶 Flying_Dutch

已于 2022-03-17 02:21:51 修改

阅读量1.1k

点赞数

分类专栏：机器学习文章标签：机器学习大数据神经网络深度学习

于 2020-05-15 09:39:45 首次发布

本文链接：https://blog.csdn.net/Flying_Dutch/article/details/106134801

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

中文版

摘要

作为队长，我在2019年5月20日-2019年8月1日，共计两个半月的时间里率团队（几个本校的研究生）参加了百度大数据竞赛，并获得了18/2312（top0.78%）的好成绩。此次比赛的主要内容是城市区域功能分类，是个多模态分类任务，内容包括图像识别和文本特征挖掘。

在参与这次比赛的过程中，我们队经历了不少挫折，也获得了不少收获。我们最终提出了图像-文本融合网络识别模型和基于文本特征的投票器，拿到了初赛第17，复赛第18的成绩。

实现经历

我们首先从图片入手，用ResNeXt作为模型喂入图片，最后发现准确率保持在50%左右，效果不是很理想。

随后，我们仔细地检查了一遍图片，发现大约20%的遥感图片中雾霾严重，甚至存在大面积黑块，于是我们进行了一遍图片清洗，把含有黑块的图片从数据集中去除，同时对剩余的图片进行dehaze，这样一来，纯图片的准确率达到了55%。

之后，我们把文本的时间序列信息提取出来，转化成128×24的图片，用DPN26进行训练。把图片和文本的识别器输出结果concat后喂入fc层，这也就是Net1的结构，此时的综合识别准确率为64%。

接着，我们开始用stacking和集成学习的方法来训练数据。通过对图片进行 TTA、缩放、上下采样、加权，以及对文本进行特征提取等各种操作，我们获得了其他六个网络模型：Net2~Net7。同时我们把训练集数据分成5份进行交叉训练和识别，并把第一步网络获得的结果加权平均后用Xgboost进行二次训练。这一步之后识别的准确率提高到了76%。

此时，我们发现自己和排名靠前的其他队伍在准确率上还有一定差距，同时结合bbs的信息。我们发现：用户对于不同地区的访问记录间存在一些关联，而这种关联是通过用户ID产生的，而此类信息通过时间序列模型是挖掘不到的。

于是我们创造性地提出了基于单个用户在某地区出现次数的投票器，用于进一步挖掘文本特征，最终将准确率提高到了81.62%。虽然此后我们又提出了一个基于小时数的更强的投票器，但由于算力和时间的限制，没能实现。

通过人为对结果进行一定调整，我们队最终拿到了82.18%，也就是第18名的成绩。没能获得更好的名次，还是有一些遗憾。

结语

这次比赛经历不仅提高了我的动手能力，促进了我的自学能力，培养了我对机器学习相关领域的兴趣。也让我学会了不少理论知识和实践技巧：在此期间，我对神经网络调参的相关技巧、常用的集成学习方法、主流的CNN网络框架、和文本特征工程的一般流程有了较为深刻的了解和掌握。这为我未来从事相关领域的学术研究打下了坚实的和实践基础。

English Version

Summary

My best research experience during undergraduate studies is about a big data competition I participated in last year.

Last year, I led several postgraduates taking part in the IKCEST first “Belt and Road” big data competition. The competition lasted from May to August, about two and a half months. Finally we ranked 18th of all 2312 teams. The main content of the competition is functional classification of urban areas, which is a multi-modal classification task, including image recognition and text feature mining.

During the competition, our team experienced lots of setbacks but also gained a lot. We finally put forward the image-text fusion network recognition model and the text-feature-based voter, and ranked 17th in preliminary,18th in semi-final.

Implementation Details

Here’s some detailed process:

First, we started with the images, used ResNeXt as the model and fed images into the network, but found that the accuracy is kept at about 50%. The effect was not very ideal.

After examining the images carefully, we found that about 20% of remote sensing images had serious haze and even large black blocks, so we cleaned the images, removed those pictures containing black blocks from data set, and dehazed the remaining images, so that the accuracy of our model reached 55%.

Then we extracted time-series information from text and transformed each text file’s time-series features into a matrix of 128 by 24 and trained with DPN26. We concatenated the output 1D vectors of image recognizer and text recognizer and fed it into full-connected layer. That is the structure of the Net1. At this time, our comprehensive recognition accuracy is 64%.

Afterwards, we began to train data using stacking and some integrated learning methods. We constructed another six network models Net2~Net7 by adding some extra operations to the data, including TTA, scaling, up and down-sampling, weighting, and feature extraction. At the same time, we divided the training data set into 5 parts for 5-folds stacking, weighted the results obtained by our first-step network and then used the Xgboost for secondary training. After these steps, recognition accuracy increased to 76%.

Unfortunately, we found there were still some gaps between ourselves and other top teams. Combining with information in BBS, we realized that there were some associations between users' access records in different regions, and these associations took effect through user ID, while such information could not be mined by our previous time series model.

So we creatively proposed a voter based on the number of times a single user appeared in a certain area to further mine the text features and finally improved the accuracy to 81.62%. Shortly afterwards, we proposed another stronger voter based on the number of hours, but due to our limited computational power and time, the later idea wasn’t successfully implemented.

After some manual adjustment to the results, our team finally got 82.18%, that is, 18th place in the rank list. There is still some pity for not getting a higher rank.

Conclusion

During this competition, I not only improved my self-study ability and hands-on ability, cultivated my interest in machine learning and AI research, but also learned a lot of theoretical knowledge and practical skills, including neural network tuning, integrated learning, customize CNN network framework, feature engineering. This experience laid a solid practical foundation for my future academic research in related fields.

But I also found that I was inexperienced in data science, and that I couldn't quickly find or construct an effective model when I got a specific machine learning project, and it took a lot of experimentation with existing models. Besides, I'm not very good at writing papers, and I’m not very familiar with how to do academic research in a particular field, and these are the deficiencies that I intend to get exercise and improvement in my later research career.