There are two ways to start a new data science project, you either have an idea that you want to implement, so you look up datasets to use to make your vision come to life. Or, you come across an exciting dataset that inspires you to start a new project.

Often, as a beginner, you’ll probably be a little lost, looking around for a good place to start a project. For me, a good starting place was always finding an interesting dataset that triggers my curiosity.

Well, this article is about looking for datasets to inspire you. When you stumble across an intriguing dataset, with so much potential that sparks the creativity within you and you can’t help but using to build something great.

Wherever you are on your data science journey, just starting out, or trying to grow your skills and maybe build new ones, there is no better way to improve a skill then practice it. The more projects you build, the more fluent in data science you will get, and the better and more appealing your profile will become.

In order to increase your data science skills and establish a good set of skills needed to build a strong profile, you need to tackle the 5 aspects of data science:


  1. Deep Learning

  2. Natural Language Processing

  3. Big Data

  4. Machine Learning

  5. Image Processing


In this article, I will present you with five options of datasets—one for each of the 5 aspects of data science. I will talk a little bit about the construction of the dataset, the formate of it, and some possible ideas of how you can use it to build some fantastic projects.

These datasets can be used for building projects on more than one aspect of data science. So, use your creativity and start building up your profile — or grow it.

MNIST Datasets

There's no better place to start than with the — arguably — most famous datasets collection of them all, the MNIST datasets. Here, we will talk about two MNIST datasets:

  1. MNIST of handwritten digits.

  2. Fashion-MNIST.


MNIST of handwritten digits

The MNIST dataset is a collection of handwritten digits. The dataset has a training set of 60,000 images and a test set of 10,000 images that can be used for model evaluation. The digits in the dataset have been size-normalized and centered in a fixed-size image.

This dataset is excellent for both beginners who want to try learning techniques and pattern recognition methods, as well as intermediate data scientists wanting to test their models on real-world data while spending minimal efforts on preprocessing and formatting.


The original format of this dataset might be a little confusing for absolute beginners; luckily, this dataset is also available in an easy to handle CSV format.

Fashion-MNIST

Fashion-MNIST is a dataset of Zalando's article images. Just like the original dataset, this one also consists of a training set of 60,000 examples and a test set of 10,000 samples.

Each data entry is a 28x28 grayscale image, associated with a label from 10 classes. The structure of training and testing pictures are the same.

How can you use the MNIST datasets?

  • MNIST datasets are great in helping beginners understand and learn different machine learning and pattern recognition techniques.

  • The dataset contains both training and testing data, so you won't need to split your data.

  • This dataset can be used to build image recognition applications for handwriting, digit recognition, and clothes-item recognition in the fashion dataset.

  • You can use this dataset to learn and practice the different methods and techniques of convolution neural networks (CNN). You can eave use Keras and build your model.

Amazon product data Dataset

Amazon product data dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 — July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).


This dataset will give you a basic understanding of real business problems and helps you comprehend and extract trends in sales over the years.


How can you use the Amazon product data datasets?

  • You can use this dataset to analyze sentiment, which is one of the most popular applications of natural language processing(NLP).

  • This dataset is a text processing data, which you can use to build all various types of NLP models.

  • You can also use this dataset to build product trending models and predict future trends based on that.


YouTube Videos Statistics Dataset

YouTube maintains a continuously-updated list of the top trending videos on the platform. To decide on the year's top trending videos, YouTube uses various factors, including measuring users' interactions (number of views, shares, comments, and likes).

YouTube videos statistics Dataset is a daily record of the top trending YouTube videos.


How to use the YouTube Videos Statistics Dataset?

  • To perform sentiment analysis on different types of videos and find patterns.

  • Categorizing YouTube videos based on their comments and statistics. Using the results, you can build your own database, of which vides tend to engage the audience more.

  • Train machine learning algorithms like RNNs to generate YouTube comments.

  • Use the previous years' lists of popular video to build a machine learning model that predicts a future top-trending list of videos.


SMS Spam Dataset

Nowadays, we are surrounded by spam all around us, spam mail, spam advertising, and spam SMSs. The SMS spam dataset contains a set of SMS messages in English of 5,574 messages, tagged as spam.

This dataset represents the different spam messages as entries of a CSV file for easy reading and extracting. The dataset CSV file contains two columns, one for the classification of the message as spam or not, and the other one is the raw text of the message.

How can you use the SMS Spam Dataset?

  • You can use machine learning classifications algorithms to build a spam message classifier then test it on some messages to label them as safe or spam.

  • You can build a model and train on this dataset to predict and detect spam messages.


COCO Dataset

COCO is a large-scale object detection, segmentation, and captioning dataset created by Microsoft and sponsored by many other big companies. The acronym COCO stands for Common Objects in Context. This dataset includes many features, such as 1.5 million object instances for 80 object categories, 330K of images, 91 stuff categories, and 5 captions per image.

COCO has amazing documentation, and you can explore the dataset online using an explorer before you decide to download it.


How can you use COCO dataset?

  • COCO can be used to train and build machine learning models to detect and classify different objects. For example, you can use it to classify different viable types.

  • By nature, working with COCO enables you to build different image processing applications, such as image segmentation and compression.

  • You can also use COCO to train your model to analyze footage data from a security camera, by detecting people/animals.

  • COCO can also be used for non-data science projects, such as object detection in robotics.


Takeaway

Starting a new project can be somewhat tricky, especially if you’re just starting with data science. When I first started, I couldn’t decide on a project, simply, because I didn’t have enough knowledge to choose a project and a dataset.

One of the things that helped me is, browsing datasets websites (e.g. Kaggle) and reading about different datasets and how can they be used. That gave me the inspiration I needed to kick start new projects. I still do that now.

The only time I didn't need to explore and browse different datasets, is when I had a set project with a specific dataset. Which was the case if I were employed by a company or a client.

As my knowledge base grew, I developed an eye for good datasets and how to see their potential. Data always tells a story, you just need to listen to it.

So, I hope this article inspired you to build a new project or browse the web for exciting datasets that inspire you to build something awesome.



