My writing is bad. This is my term paper in CS5312-deep learning course.

Section II: List and highlight of papers you have studied.In this section, I separate the papers into 3 parts-NN networks, algorithms, hardware designs.

1.NN-networks

Gradient-Based Learning Applied to Document Recognition. Yann Lecun, Yoshua Bengio. (1998) Neural networks used in this paper are called LeNet, which is well applied in the MNIST dataset. Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient-based learning technique. This paper shows the potential computational capacity in computer vision problems.

ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Geoffrey E. Hinton. (2012) AlexNet is a deep convolutional neural network(Deep CNN), being created to classify the 1.2 million-high-resolution images into the 1000 different classes in the ImageNet LSVRC-2010 contest, which is held by Feifei-Li(Stanford). Compared with LeNet, AlexNet is deeper and needs GPU to train the models. Dropout is also used in AlexNet which highly reduce the test errors(overfitting). Momentum and data augmentation is also useful in this neural network.https://github.com/kratzert/finetune_alexnet_with_tensorflow

Very Deep Convolutional Networks for Large-Scale Image Recognition. Karen Simonyan, Andrew Zisserman. (VGGNet, ICLR2015) VGGNet is even deeper than AlexNet.And this network takes great scores on many datasets, such as ILSVRC.https://github.com/machrisaa/tensorflow-vgg

Going Deeper with Convolutions, Christian Szegedy.(GoogleNet, CVPR2015) Google succeeded to build the deepest neural networks till now. These neural networks have common in some architectures like fully-connect layers, pooling layers or softmax algorithm. Inception is used in GoogleNet, which is the main innovation. I have also found the differences between these networks, which made me easy to grasp these complex and deeeeeeep networks. Experiments on these networks are necessary. https://github.com/lim0606/caffe-googlenet-bn

Deep Residual Learning for Image Recognition, Kaiming He. (CVPR 2015) As is said that VGGNet or GoogleNet is getting deeper and deeper, ResNet is similar to a wonderful artwork which is even deeper than GoogleNet. But the deep networks lead to gradient vanishing or gradient explosion. When the network gets deeper, our training loss is decreasing. In short, Resnet uses a short connection between the former input and the output to avoid gradient problem. The results from LSVRC dataset show that ResNet has a significant improvement on the deeper-layer network. By the way, Kaiming He is graduated from Tsinghua(Beijing) who is my favorite researcher.https://github.com/ry/tensorflow-resnet

Long Short-Term Memory, Jürgen Schmidhuber. (1997) RNN is different from neural networks introduced in the former papers. And LSTM is the most useful Recurrent Neural Network created by Jürgen Schmidhuber in Switzerland. Compared with CNN, RNN can use their feedback connections to store representations of recent input events in the form of activations. LSTM uses a special unit in RNN, or we call the unit as a forgot gates. Then we combine these forget gates as a chain. As for special unit and the architecture, we have several papers to introduce the efficient update on LSTM.https://github.com/nicodjimenez/lstm

On the Properties of Neural Machine Translation: Encoder–DecoderApproaches, Yoshua Bengio, University of Montreal. (2014)Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Yoshua Bengio, University of Montreal. (2014) The first paper gives a special unit in RNN named GRU(Gated Recurrent Unit). The second paper gives an empirical evaluation of GRU on a specific dataset. Nothing is better than the high accuracy on these "famous datasets".It's not difficult for us to prove the formula if we have learned the backpropagation algorithm.https://github.com/dennybritz/rnn-tutorial-gru-lstm

Bidirectional Recurrent Neural Networks, Mike Schuster. (1997) As the name is Bidirectional RNN, the architecture is also "bidirected". This framework seems symmetrical. Further, we need to run the forward algorithm the same process as the backward algorithm.

Speech Recognition with Deep Recurrent Neural Networks, Geoffrey Hinton. (ICASSP 2013) This paper is using Deep RNN for speech recognition. This experiment is trained and tested on TIMIT corpus. (https://catalog.ldc.upenn.edu/LDC93S1) An end-to-end train strategy is also used in the paper.https://github.com/zzw922cn/Automatic_Speech_Recognition

Sequence-to-Sequence Learning with Neural Networks. This paper use RNN to solve speech recognition problem. What special idea the writer present is the sequence-to-sequence strategy, which is one efficient way to transform sequences into vectorshttps://github.com/pannous/tensorflow-speech-recognition

Generative Adversarial Nets, Ian J.Goodfellow. The short title always represents the confidence in research. GAN may be the most exciting creation in the recent five years. I have present this paper last week, absorbed in reproducing projects applied with GAN. Just like an army is applied with "GUN". This paper shortly describes a discriminative model and a generative model. Cite one analogy in the article-"A team of counterfeiters try to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles." So this architecture is like a zero-sum game. Goodfellow and I love drinking, and he finished the great works when he was drunk. It's a legendary story ha.http://www.github.com/goodfeli/adversarial

Binarized Neural Networks: Training Neural Networks with Weights andActivations Constrained to +1 or −1, Yoshua Bengio. (2016)Bengio introduces a method to train Binarized Neural Networks - neural networks with binary weights and activations at run-time. This is trained so fast with GPUs.https://github.com/HirokiNakahara/GUINNESS

SparseNN: A Performance-Efficient Accelerator for Large-Scale Sparse Neural Networks(2017)Sparse data remains a challenging problem. We have many methods to deal with sparse data. This paper shows how to use neural networks with FPGA accelerators to analysis sparse data.

SqueezeNet: AlexNet-level accuracy with 50x Fewer Parameters and <0.5MB Model Size, Song Han. (2016) Recent research on deep convolutional neural networks (CNNs) has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple CNN architectures that achieve that accuracy level. In this paper, with model compression techniques, we are able to compress SqueezeNet to less than 0.5MB (510× smaller than AlexNet).https://github.com/DeepScale/SqueezeNet

2.algorithm

Gradient-Based Learning Applied to Document Recognition, Yann Lecun. (1998) This paper shows what is the first convolution networks. Gradient Descent is useful in the optimization. Each time we compute the partial weights of loss function and update our weights in the neural network. So LeNet is created based on these algorithms and it beat other algorithms on the Document Recognition with the MNIST dataset.https://github.com/udacity/CarND-LeNet-Lab

Handwritten Digit Recognition with a Back-Propagation Network, Yann Lecun. (NIPS1990)This paper published in 1990. At that ages, there is no computer or GPUs can experiment with the bp algorithms. But this paper still gives an excellent algorithm and formulas prove. Somehow, I think there is much more similarity between gradient descent and backpropagation.

Rectified Linear Units Improve Restricted Boltzmann Machines, Geoffrey Hinton. (ICML2010) Relu is a simple function using in many activation methods.(max(0, x+N(0, σ(x))) And Relu have a highly improvement compared with tanh function or sigmoid function.https://blog.csdn.net/zchang81/article/details/70224688

Dropout: A Simple Way to Prevent Neural Networks fromOverfitting, Geoffrey Hinton. (JMLR2014) Dropout means drop your parameters randomly which effectively reduce the overfitting problem. Relu and Dropout are both used in Restricted Boltzmann Machines experiment conduct by Geoffrey Hinton.https://github.com/mdenil/dropout

Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift, Szegedy. (2015) Tuning parameters are always important in deep learning. Batch normalization is another way to minimize the overfitting. Compared with dropout algorithm, BN is a data preprocessing algorithm.https://github.com/vacancy/Synchronized-BatchNorm-PyTorchhttp://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

Improving the speed of neural networks on CPUs. This technique described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Representation Learning: A Review and New Perspectives, Yoshua Bengio. (2013) This is also an encouraging work. Representation learning has become a field in itself in the machine learning community.

3.hardware designsIn-Datacenter Performance Analysis of a Tensor Processing Unit, Google Inc. (2017) This paper describes the new product by Google. It shows the power of TPU. Compared with CPU Intel, GPU has a high improvement.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding, Song Han. (2016) First, for many mobile-first companies such as Baidu and Facebook, various apps are updated via different app stores, and they are very sensitive to the size of the binary files. Deep Compression is a useful method to compression files which can make huge cuts in storage cost. Based on “deep compression”, the EIE hardware accelerator was later proposed that works on the compressed model, achieving significant speedup and energy efficiency improvement.https://github.com/songhan/Deep-Compression-AlexNet

Cambricon-X: An Accelerator for Sparse Neural Networks, Cambricon Inc, China. Cambricon is an AI chip company. In this paper, experimental results over a number of representative sparse networks show that our accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.

DaDianNao: A Machine-Learning Supercomputer, ICT lab, China.This paper shows that it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs, Sense Time Inc, China. This paper achieves an average 1006.4 GOP/s for the convolutional layers and 854.6 GOP/s for the overall AlexNet and an average 3044.7 GOP/s for the convolutional layers and 2940.7 GOP/s for the overall VGG16 on Xilinx ZCU102 platform.

From High-Level Deep Neural Models to FPGAs, GT. Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments.

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks, PKU&UCLA. Long Short-Term Memory Recurrent neural networks (LSTM-RNNs) have been widely used for speech recognition, machine translation, scene analysis, etc. FPGA-based accelerator for LSTM-RNNs that optimizes both computation performance and communication requirements. The peak performance of our accelerator achieves 7.26 GFLOP/S, which significantly outperforms previous approaches.

ShiDianNao: Shifting Vision Processing Closer to the Sensor, ICT lab, China. We can figure that Chinese corporations have made great progress on AI chip area. In this paper, it proposes such a CNN accelerator, placednext to a CMOS or CCD sensor. It presents a full design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30× faster than high-end GPUs

4.Addition paper found by myself

Word embeddings:https://github.com/Embedding/Chinese-Word-Vectorsword2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Yoav Goldberg. (2014) This paper is shot. Word2vec is an important algorithm to convert words to the data we can process. And in many NLP problems, word2vec or word embedding is common.

Adam: A Method for Stochastic Optimizationhttps://arxiv.org/abs/1412.6980 This paper is not in the paper list. But this is really useful in the most hyperparameter tuning problems. https://www.sohu.com/a/156495506_465975https://github.com/bigdatagenomics/adam

Yolo: You Only Look Once: Unified, Real-Time Object Detection. (CVPR2016)https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf A new approach to object detection, real-time. This paper may be the most famous in Driverless cars company.https://github.com/gliese581gg/YOLO_tensorflow

A Neural Algorithm of Artistic Stylehttps://arxiv.org/pdf/1508.06576.pdf This paper presents an artistic masterpiece. https://github.com/titu1994/Neural-Style-Transfer

Section III: Which inviting talks inspired you the most? For me, I think the best talks are given by Professor Wang Yu. He is really a humorous man. Also, he is lucky and he has good students Song Yao and good partners, Song Han. Build a company is difficult, but Prof. Wang concentrates on a special area-FPGA mobile AI chip. And he told many VC business. We only see how success DEPHI has made, but seldom know how many problems they have overcome. He encouraged me to work harder and make my own business one day.

Section IV:What you have learned from classmates’ presentations? Thanks for the brilliant presentations by my classmates during the whole semester in this course. I found my leakage is the hardware design which attracts me a lot. From the Resnet presentation delivered by my friends Yujie Su, I found a useful neural network to train. From FPGA papers, other classmates made a full explanation of the hardware problem. Even I didn't learn any hardware methods before, I enjoy listening to their detailed presents.

Section V: Your novel ideas to design a CNN accelerator. Give a conceptual description. My own idea but far from novel one is using Resnet on FPGA accelerator. Traditionally, GPUs are used in accelerating the deep learning algorithms. But experiments show that Resnet has good results on sparse data, I think Resnet with FPGA accelerator can beat GPUs no longer. In addition, I deliver this idea by myself and search for it on Google. Surprisingly, it did have some papers arguing to use Resnet in FPGA. The paper claims that Low-precision, sparse ternary version of DNN on Resnetm may have good results on FPGA accelerator. That's all.