What is Data Science

EmmauelZ

已于 2023-12-12 18:10:29 修改

阅读量986

点赞数 19

文章标签：笔记

于 2023-11-25 14:11:34 首次发布

本文链接：https://blog.csdn.net/weixin_43553672/article/details/134612918

版权

What is Data Science

Data science definition

•	Information, especially facts or numbers collected to be examine and considered and used to help decision-making 
•	actions focus on facts surround data, data is collected, examined and most importantly, used to inform decision
•	A set of values of qualitative (定质变量) and quantitative variables (定量变量)
•	Focuses more on what data involves
•	A set of items to measure from (In statistics, this set of items is often called population), like the information of country census and its corresponding decision-making
•	Variables are measurement or characteristics of an item (sequencing data and discrete data)
•	Qualitative variables are information about qualitative, described by words, country origin, sex or treatment group
•	Quantitative variables are information about quantities, described by numbers, and are measured on a continuous scale, like height, weight, and blood pressure

在这里插入图片描述

#Messiness data set looks like above, you have to extract the information you need to answer your questions.

#Theses are the data sources you might encounter:

sequencing data
population census data
electronic medical records (EMR), other large databases
geographic information system (GIS) data (mapping)
image analysis and image extrapolation
language and translations
website traffic
personal/Ad (eg: Facebook, Netflix prediction, etc)

Given the raw file format produced by sequencing machines, when this data is generally first encountered in the fast queue format, then this type of data is regularly called sequencing data. These files are often hundreds of millions of lines long, and it is our job to parse this into an understandable and interpreted into expression data, and produced a plot called the Volcano Plot.

A good data scientist asks questions first and seeks out relevant data second.

The certain questions you are trying to ask are eligible, often the available data will limit.

In these cases, you may have to re-frame your questions or perhaps even enable certain questions you are to ask.

In these cases, you may have to re-frame your questions or answer a related question but the data itself does not drive the question asking.

In this lesson, we are to focus on data, both in defining it and in exploring what data should look like, and how it can be used.

Finally, we return to our beliefs on the relationship between data and your question, and emphasize the importance of the strategies of the first question.

You could have all the data you could ever hope for, but if you don’t have a question to start, the data is useless.

Question:
What is the most important thing in Data Science?
⁃ The question you are going to ask for coding problems generally fall into two categories.
⁃ Your commands produce no data and split out an error message.
⁃ Your commands produce an output that it’s not all for what you wanted.

Asking questions:
• The question you are trying to ask (the most complicated and complexed that you can provide of the only solution)
• How you approached the problem, what steps you took to answer the question
• What steps will reproduce the problem (including sample data for troubleshooters to work from!)
• What was the expected output
• What you saw instead (including any error message you received)
• What troubleshooting steps you have already tried
• Refactor your set-up
• What operating system you are using, and what sort of coding logics to be optimize
• What version of the product you have installed (R, Rpackages), and how to integrate the system indicated above

Overview of the Data Analytics Ecosystem

在这里插入图片描述

Data Analytics Ecosystem
A Data Analyst’s ecosystem includes the infrastructure, software, tools, frameworks, and processes used to:

· Gather Data
· Clean Data
· Mine Data
· Visualize Data

Data can be categorized as:
• Structured
• Data that follows a rigid format and can be categorized into rows and columns.
• E.g. Data in database and spreadsheet
• Semi-structured
• Mix of data that has consistent characteristics and data that does not conform to a rigid structure.
• E.g. Email
• Unstructured
• Data that is complex and mostly qualitative information that can’t be structured into rows and columns.
• E.g. photos, videos, text files, PDFs, and social media content

The type of data drives the kind of data repositories that the data can be collected and stored in, and also the tools that can be used to query or process the data.

Data can come in a variety of file format being collected from a variety of resources, such as:

•	Relational and non-relational databases
•	APIs
•	Web services
•	Data stream
•	Social platform
•	Sensor devices

Data repositories

•	Databases
•	Data warehouse
•	Data marts
•	Data lake
•	Big data stores

The type, format, and sources of data influence the types of data repositories that you can use to collect, store, clean, analyze, and mine the data for analysis.

If you’re working with big data, you will need big data warehouses, that allow you to store and process large-volume and high velocity data, and also frameworks that allow you to perform complex analytics in real-time on big data.

The Data Analytics Ecosystem Languages

•	Query languages
•	SQL for querying and manipulating data
•	Programming languages
•	Python for developing data applications
•	Shell and Scripting languages
•	Shell scripts for repetitive operational tasks

在这里插入图片描述

Types of Data

Data comprises of facts, observations, perceptions, numbers, characters, symbols, and images that can be interpreted to derive meaning.

One of the ways in which data can be categorized by its structure. Data can be structured, semi-structured, and unstructured.

Structured Data:

•	Has a well-defined structure
•	Can be stored in well-defined schemas
•	Can be represented in a tabular manner with rows and columns

Some of the sources of structured data could include:

•	SQL Databases
•	Online Transaction Processing Systems (OLTP)
•	Spreadsheets, such as Excel or Google Spreadsheets
•	Online forms
•	Sensors, such as Global Positioning Systems (GPS)
•	Radio Frequency Identification tags (RFID)
•	Network and Web server logs

You can easily examine structured data with standard data analysis methods and tools. (what is the methodology of the analytical processes?)

Semi-structured Data:

•	Has some organization properties but lacks a fixed or rigid schema
•	Can’t be stored in the form of rows and columns in databases
•	Contains tags and elements, or metadata, which is used of group data and organize it in a hierarchy

Some of the sources of semi-structured data could include:

•	Emails
•	XML and other markup languages
•	Binary executables
•	TCP/IP packets
•	Zipped files
•	Integration of data

For example, XML and JSON allow users to define tags and attributes to store data in a hierarchical form and used widely to store and exchange semi-structured data.

在这里插入图片描述

Unstructured Data:

•	Doesn’t have an easily identifiable structure
•	Can’t be organized in a mainstream relational databases in the form of rows and columns
•	Doesn’t follow any particular format, sequence, semantics, or rules (?)

Unstructured data can deal with the heterogeneity of sources and has a variety of business intelligence and analytics applications.

Some of the sources of unstructured data could include:

•	Web pages
•	Social media feeds
•	Images in varied file formats, such as  JPEG, GIF, and PNG
•	Video and audio files
•	Documents and PDF files
•	PowerPoint presentations
•	Media logs
•	Surveys

Unstructured data can be stored in files and documents for manual analysis or in NoSQL databases that have their own analysis tools for examining this type of data.

Understanding different types of the formats

Given you’r about to work with a variety of data file types and formats, when it’s important to understand the underlying structure of formats along with their benefits and limitations, then this underlying understanding will support you to make the correct decisions on the formats best suited for your data and performance needs.

Standard file formats:
· Delimited text file formats, or CSV
· Microsoft Excel Open, or XML Spreadsheet, or XLSX
· Extensible Markup Language or XML
· Portable Document Format, or PDF
· JavaScript Object Notation, or JSON

Sources of Data (Common data sources) could include:
• Relational Databases
• Flat files and XML Datasets
• APIs and Web Services
• Web Scrapping
• Data streams, and Feeds

在这里插入图片描述
Relational Databases - store structured data that can be leveraged for analysis:
• Typically, organizations have internal applications to support them in managing day to day business activities, customer transactions, Human Resources activities, and their workflow.
• These systems use relational databases such as SQL Server, Oracle, MySQL, and IBM DB2, to store data in a structured way.
• Data stored in these databases and data warehouses can be used as a source for analysis.
For example,
• Data from a retail transactions system can be used to analyze sales in different regions
• Data from a customer relationship management system can be used for making sales projections

在这里插入图片描述
Flat files, Spreadsheet files and XML Dataset:
• External to the organization, there are other publicly and privately available datasets

For Examples:
•	Government organizations releasing demographic and economic datasets on an ongoing basis
•	There are companies that sell specific data, Point-of-Sale data or Financial data, or weather data, which businesses can use to define strategy, predict demand, and make decisions related to distribution or marketing promotions, among other things

#Flat files
• Store data in plain text format
• Each line, or row, in one record
• Each value is separated by a delimiter
• All of the data in a flat file maps to a single table
• Most common flat file format is .CSV

#Spreadsheet files
• Special type of flat files
• Organize data in a tabular format
• Can contain multiple worksheets
• .XLS or .SXLS are common spreadsheet formats
• Other formats include Google Sheets, Apple Numbers, and LibreOffice Calc

#XML files
• Contain data values that are identified or marked up using tags
• Can support complex data structures
• Common uses include online surveys, bank statements, and other unstructured data sets

APIs and Web services - typically listen for incoming requests, which can be in the form of web requests from users or network requests from applications and return data in plain text, XML, HTML, JSON, or media files.

#APIs - Popular examples of APIs
• Twitter and Facebook APIs
• For customer sentiment analysis
• Stock Market APIs
• For trading and analysis
• Data Lookup and Validation APIs
• For cleaning and co-relating data

#Web Scraping
• Extract relevant data from unstructured sources
• Also known as Screen scraping, Web harvesting, and Web data extraction
• Downloads specific data based on defined parameters
• Can extract text, contain information, images, videos, product items, and more from a website

#Web Scraping - Popular usage
• Providing price comparison by collecting product details from retailer, manufacturers, and eCommerce websites
• Generating sales leads through public data sources
• Extracting data from posts and authors on various forums and communities
• Collecting training and testing datasets for machine learning models

#Data Stream and feeds - data from cloud platform
• Aggregating streams of data flowing from instruments, IoT devices and applications, GPS data from cars, computer programs, websites, and social media posts
• Some of the data streams and ways in which they can be leveraged include:
• Stock and market tickets for financial trading
• Retail transaction streams for predicting demand and supply chain management
• Surveillance and video feeds for threat detection
• Social media feeds for sentiment analysis
• Sensor data feeds for monitoring industrial or farming machinery
• Web click feeds form monitoring web performance and improving design
• Real-time flight events for rebooking and reschedules

Languages relevant to the work of data professionals

These can be categorized as:

•	Querying languages - Query languages are designed for accessing and manipulating data in a database (SQL)
•	SQL, or Structured Query Language, is a querying language designed fro accessing and manipulating information from, mostly, though not exclusively, relational databases

Using SQL, you can:
• Insert, update, and delete records in a database
• Create new databases, tables, and views
• Write stored procedures

Advantages of using SQL:

•	SQL is portable and platform independent
•	Can be used for querying data in a wide variety of databases and data repositories
•	Has a simple syntax that is similar to the English language
•	Its syntax allows developer to write programs with fewer lines of code using basic keywords

•	Can retrieve large amount of data quickly and efficiently
•	Runs on an interpreter system
•	Programming languages - Programming languages are designed fro developing applications and controlling application behavior (Python, R, Java)
•	Python is a widely-used open-source, general-purpose, high-level programming language

•	Its syntax allows programmers to express their concepts in fewer lines of code
•	An ideal tool for beginning programmers because of its focus on simplicity and readability
•	Great for performing high-computational tasks in large volumes of data
•	Has in-bulit functions for frequently used concepts

•	Supports multiple programming paradigms-object-oriented, imperative, functional, and procedural
•	Python is one of the fastest-growing programming languages in the world
•	Easy to learn
•	Open sources

•	Can be imported to multiple platforms
•	Has widespread community support
•	Provide open-source libraries for data manipulation, data visualization, statistics, mathematics

Its libraries and functionalities also include:
• Pandas for data cleaning and analysis
• Numpy and Scipy, for statistical and analysis
• Beautifulsoup and Scrapy for web scraping
• Matplotlib and Seaborn to visually represent data in the form of bar graphs, histogram, and pie-charts
• Openness for image processing
• R is an open-sourced programming language and environment for data analysis, data visualization, machine learning, and statistics.

	Widely-used for:
	•	Developing statistical software
	•	Performing data analytics
	•	Creating compelling visualizations
	
	Key benefits:
	•	Open-source
	•	Platform-independent
	•	Can be paired with many programming languages
	•	Highly extensible
	•	Facilitates the handling of structured and unstructured data
	•	Includes libraries such as Ggplot2 and Plotly that offer aesthetic graphical plots to its users
	•	Allows data and scripts to be embedded in reports
	•	Allows creation of interactive web apps
	•	Can be used for developing statistical tools

Java is an object-oriented, class-based, and platform-independent programming language originally developed by Sun Microsystems.

•	One the top-ranked programming language used today
•	Used in a number of data analytical processes - cleaning data, importing and exporting data, scientifical analysis, data visualization
•	Used in the development of big data framework and tools (Hadoop, Hive, Spark)
•	Shell scripting - Shell and other scripting languages are ideal for receptive and time-consuming operational tasks (Unix/Linux shell, PowerShell)

•	Unix/Linux Shell is a computer programming script language written for the UNIX shell (It is a seris of UNIX commands written in a plain text file to accomplished a specific task).

•	Typical operations performed by shell scripts include:
•	File manipulation
•	Programming execution
•	System administration tasks such as severs' KPIs'
•	Executing routing backups
•	Running batches

•	PowerShell is a cross-platform automation tool and configuration framework by Microsoft that is optimized for working with structural data formats, such as JSON, CSV, XML, and REST APIs, websites, and office applications
•	Consist of command-line shells and other scripting languages
•	Used for data mining, building GUIs'; creating charts, dashboards, and interactive reports

在这里插入图片描述

Summary and Highlights

In this lesson, you have learned the following information.

A data analyst ecosystem includes the infrastructure, software, tools, frameworks, and processes used to gather, clean, analyze, mine, and visualize data.

Based on how well-defined the structure of the data is, data can be categorized as:
• Structured Data, that is data which is well organized in formats that can be stored in databases.
• Semi-Structured Data, that is data which is partially organized and partially free form.
• Unstructured Data, that is data which can not be organized conventionally into rows and columns.

Data comes in a widely-ranging variety of file formats, such as unlimited text files, spreadsheets, XML, PDF, and JSON, each of with its own lists of benefits and limitations of use.

Data is extracted from multiple data sources, ranging from relational and non-relational databases to APIs, web services, data stream, social platforms, and sensor devices.

Once the data is identified and gathered from different sources, it needs to be staged in data repositories, so that it can be prepared for analysis. (The type, format, and sources of data influence the type of data repository that can be used. )

Data need the languages that can help them to extract, prepare, and analyze data. These can be classified as:
• Querying languages, such as SQL, which is used for accessing and manipulating data from databases.
• Programming languages such as Python, R, and Java, which are for developing applications and controlling applications’ behavior.
• Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, for automating receptive operational tasks.

Foundation of Big Data

Ernst and Yong offers the following definition:

Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value.

The V’s of Big Data
Velocity - velocity is the speed at which data caculates. Data is being generated extremely fast in the process that stops. Nearly or real-time streamings, local, and cloud-based technologies can process information very quickly.

Volume

volume is the scale of the data or the increase in the amount of data stored. Drivers of volume are the increase in data sources, higher resolution sensors, and scalable infrastructure.

Variety

variety is the diverseness of the data. Structured data fit neatly into rows and columns in relational databases, while unstructured data which is not organized in the predefined way like tweets, blog posts, pictures, numbers, and video.
Variety also reflects that data comes from different sources: machines, people, and processes, both internal and external to organizations. Drivers are mobile technologies, social media, wearable technologies, geo technologies, video and many many more.

Big Data Processing Tool

•	The Big Data processing technology provides the methods to work with large sets of structured, semi-structured, and unstructured data whose value can be derived from big data, such as NoSQL database, Data Lakes.
•	Open Big Data technology -- Apache Hadoop, Apache Hive, and Apache Spark
•	Big Data Processing Tools -- Hadoop => Hive => Spark

Apache Hadoop

Hadoop is the collection of tools that provides distributed storage and processings of big data:

•	Hadoop is a java-based open-sourced frameworks, allowing distributed storage and processings of large dataset acrossing clusters of computers.
•	In Hadoop distributed systems, a node is a single computer, and a collection of nodes forms clusters.
•	Hadoop can scale up from a single node to numbers of nodes, each offering local storage and computation.
•	Hadoop provides a reliable, scalable, and costed-effectively solution for storing data with no format requirements.

Benefit include:
• Better real-time data driven decisions
• Incorporates emerging data formats, such as streaming audio, video, social media sentiments, and clicking streamly data, along with structured, semi-structured, and unstructured data not traditionally used in a data warehouse:

•	Improves data accessings and analyse
•	Provides real-time, self-services accesses towards stakeholders
•	Data offloaded and consolidation

•	Optimizes and streamlines costly by consolidating data, including deadly data, across the organization
•	Hadoop Distributed File Systems, or HDFS, are the storage systems for big data which runs on multiple commercial  hardwares that connected throughout networks

•	Provides scalable and reliable big data storages by partitioning file systems over variouse nodes
•	Splits large files acrossing multiple computers, allowing parallel accessings towards them

Deplicated files blocking on different nodes to prevent the data lost:

Benefits from using HDFS' including:
•	Fastly recoveries from failures, because HDFS is built onto detecting faults and automatically recoverings
•	Accesses into streamingly data of HDFS. 
•	Accommodation of large data sets, since HDFS can scale up into hundreds of nodes, or computers in clusters
•	Portable because of HDFS is portable acrossing multiple hardware platforms and compatibely with a variety of the underlying operating systems

Apache Hive – Hive is a data warehouse for data query and analysis built on top of Hadoop.

•	Hive is an opened-source data warehouse for reading, writing, and managing large dataset files which are stored directly in HDFS, or other data storages' systems such as Apache Base.
•	Queries have had the latencies, and which is not suitable for application that needed fast responses
•	Not suitable for transactional processings that involve a highly percentages of writing operation
•	Better suited for data warehouse operations such as reporting, and data analyzing
•	Better suited for easy accesses to data processings via SQL

Spark – Spark is the distributed data analytical hardware framework designed for performing complex data in real-time.

Spark is a generally purposive data process which is designed to extract and process large volumes of data for a widely range of applications including:

•	Interactive Analysis
•	Stream Processings
•	Machine Learning
•	Data Integration

•	ETL
•	In-memory processings whose significant processings' speed of computations
•	Providing interactive interaction into of the majority of the programming languages, such as Java, Scala, Python, R, and SQL
• Running on using its standalone clusterings
•	Can also running on top of the other infrastructures, such as Hadoop

•	Accessing data in a large variety of data sources, including HDFS and Hive
•	Processing streamingly dataset
•	Performing complex analytical realtime analysis

Summary and Highlights

In this lesson, you learned the followings:

A Data Repository is a general terminology which referring into data that has been collected, organized, and isolated, so that it can be used for reporting, analysis, and for accomplished processes.

The different types of Data Repositories include:

•	Database  which can be relational or non-relational, each following a set of organizationed principles, the type of data they store, and the tools they use to query, and organize data.
•	Data Warehouse that of consolidating of with incoming data towards comprehensive storehouses.  
•	Data Marts, which are the subsectional of data warehouses, built on isolating data for particular businesses.
•	Data Lake, whose serves that is stored for the repositories from the large amounts of structural, semi-structured, and unstructured data.
•	Big Data Storage, which provides distributed computations and storages in from the infrastructures to store, scale, and processings of the very large datasets.

ETL is an automated tool which converting raw data has built-in readily data:

•	Data from source locations
•	Raw data by cleaning, standardizing, and validating

Data Pipelines, sometimes used interactively with ETL, moves the data from the sources into the data lakes or applications of from using the ETL processes.

Big Data refers to the major amounts of data that is being produced in moments everyday, by tools, and machinates. The velocity, volumes, and varieties for data challenges that the tool and system can be used for traditional data warehouse sets.

By these challenges that leads towards into the emerging processing streams, and the platform that designed specifically for Big Data, Apache Hadoop, and Apache Spark.