Key Points from "Introduce to Data Science"

Week 1  Introduction

----------------------------------------------------------

Data Science refersto an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.


Three types of tasks in a data science project:

         Preparing to run a model (80% of the work)

         Running the model

         communicating the results(the other 80% of the work)


Science is about asking questions

         Taditionally: query the world

        eScience: download the word


Ways to do science

         Empirical (for thousands of years)

         Theoretical (in the last few hundred years, reinforcing empirical methods)

         Computational (in the last 50 years or so, simulate phenomenon that cannot be obseve directly and theoretical models become too complex to sovle analytically)

         eScience( = Data Science) (int last ten years or so, explore massive data)


What's Big Data

       Big data is any data that is expensive to manage and hard to extract value from


Big Data: Three Challenges

       Volumne(the size of the data)

       Velocity (the latency of data processing relative to the growing demand for interactivity)

       Variety (the diversity of sources, formats, quality, structures)


Week  2   Relational Databases, Relational Algebra

-----------------------------------------------------------------------------------

What is a Data Model

            trhee components:

                      1. structures

                      2. constraints

                      3. operations


Database:

          Physical Data Independence (just table of algebra)

                      select, project, cross-product, join

                       SQL is declarative language, about "what not how"

                      Algebraic Optimization

           Logical Data Independence

                        view: a query with a name


Database can exploit index and it is sure to complete an operation no matter how large the data size is


Week 3    MapReduce

------------------------------------------------------------------------------------------

Scalable 

          Operationally:

                     Scale up: works even if data doesn't fit in main memory

                     Scale out:can make use of 1000s of cheap computers

           Algorithmically:

                     the complex should be polynomial, parallelized polynomial or nlog(n)


Parallel Architectures

           Shared nothing

           Shared disc

           shared memory


Two notions of parallel query processing

           distributed query

                       rewrite they query as a union of subqueries, finally the results are combined (bottleneck)

           parallel query (Teradata, parallel database)

                       each operator is implemented with a parallel algorithm (like the mapreduce fashion)


Pig (Yahoo)

          Relational Algebra over Hadoop

Hive (Facebook)

          SQL over Hadoop

Both are Declarative query lanquages, support schemas and algebraic optimization


Hadoop vs. RDBMS

            loading data: hadoop is faster (Hadoop just needs to do parition, databases need extra effort)

            execution: RDBMS is faster (becasuse of index)


Week 4     NoSql

----------------------------------------------------------------------


NoSql is mainly used to building very large scalable web application

Social Network application (when to see a friend's status)

            database: see all or nothing

                       two-phase commit

                                prepareto be ready: usually write to a log

                                 commit: if all subordinates are ready

                        if one coordinator used: signal point failure

                        distributed protocol for committing: Paxos   

            MongoDB

                        eventual consistency through vector clocks


CAP Theorem

        sacrifices Consistency or availablity to achieve parition


NoSQL features

          lookup, read, write 1 or few records over many servers  (high scale)

          able to replicate and partition data  (high scale)

          no sql  (no sql)

          weaker concurrency model than ACID(Atomicity, Consistency, Isolation, Durability)  transactions (no transaction)

          dynamically add new attributes to records (no schema)


Category for data models

           document = nested values,extensible records(XML, JSON)

            extensible record (hbase/ BigTable)

            key-value object (memcache)


Consistent hashing ( Memcached: no persistence, no replication, no fault-tolerance)

           map server IDs and  the key values  into the same space 


schema-on-read, instead schema-on-write (pig)


When data is too big, you cannot bring data to computation, you have to bring the computation to the data


Three Special Join:

           Replicated Join

           Skewed Join

           Merge Join


NoSQL Features:

            No Schema

            No Language

            No Transactions


          

           


                      

                               


### 回答1: 《线性代数导论》是一门数学分支,研究了线性方程组、向量空间、线性变换、特征值与特征向量等概念。线性代数主要通过线性方程组的求解来研究线性空间的性质和变换特征。在线性代数中,我们学习如何求解线性方程组,以及如何理解向量在空间中的性质和变换。线性代数是计算机科学、物理学、经济学、统计学等领域中的基础课程,它为这些领域的深入研究提供了重要的工具和方法。 线性代数的核心概念之一是向量空间。向量空间是由一组向量组成的集合,它具有特定的运算规则和性质。我们通过研究向量空间的性质,可以帮助我们理解向量在空间中的几何特征和变换规律。线性代数也涉及到线性变换和矩阵运算,它们可以描述向量的旋转、缩放、投影等操作。线性代数的另一个重要概念是特征值和特征向量,它们对于理解线性变换的特性和模式起到了重要的作用。 通过学习线性代数,我们可以更好地理解和解决实际问题。线性代数的方法可以应用于求解问题的最优解、拟合曲线、图像处理、数据压缩等领域。线性代数也为更高级的数学领域如线性空间、泛函分析和矩阵论等提供了基础。总之,《线性代数导论》是一门重要的数学课程,它不仅在数学领域中扮演着重要的角色,也在其他领域中具有广泛的应用。 ### 回答2: 线性代数是数学的一个分支,研究向量空间和线性映射的性质和操作方法。它是应用广泛的数学工具,在科学、工程和经济学等领域都有重要的应用。线性代数的核心是研究线性方程组的解的性质。 在线性代数中,我们研究向量,向量空间和矩阵以及它们之间的关系。线性方程组可以用向量和矩阵的形式进行描述,通过解线性方程组,我们可以得到向量空间的基本性质,例如维数、子空间等。线性映射是一种可以保持向量加法和数乘的函数,通过研究线性映射,我们可以得到矩阵的特征值和特征向量等重要概念。 线性代数的基本概念包括线性方程组、判断向量线性相关性的条件、矩阵的行列式、逆矩阵和转置矩阵等。其中,矩阵的行列式可以判断矩阵是否可逆,逆矩阵可以帮助我们解线性方程组。转置矩阵是将矩阵的行和列进行互换。此外,还有特征值和特征向量、正定矩阵、对称矩阵等概念也是线性代数的重要内容。 线性代数不仅是一门重要的数学学科,也是许多其他学科的基础。在计算机图形学、机器学习、信号处理等领域,线性代数的知识都扮演着重要的角色。因此,学好线性代数对于理解和应用这些学科都至关重要。 ### 回答3: 线性代数是数学的一个分支,研究向量空间和线性映射的性质。它主要涉及解决线性方程组、求解向量空间的基、研究线性变换等问题。线性代数的核心概念是向量和矩阵。 向量是有大小和方向的量,可以用箭头表示。它可以进行加法、乘法和线性组合等运算。向量空间就是由向量构成的集合,具有加法和数量乘法运算,并且满足一些特定的公理。 矩阵是一个矩形的数表,其中的元素通常为实数或复数。矩阵可以进行加法、乘法和求逆等运算。矩阵可以表示线性映射,通过变换矩阵可以将一个向量映射到另一个向量空间中。 线性代数的应用非常广泛。在工程、物理、计算机科学等领域,线性代数被用于解决问题、建立模型和优化算法。例如,在计算机图形学中,线性代数可以用来描述和操作三维物体的位置和方向。在机器学习中,线性代数可以用来处理高维数据和构建模型。在密码学中,线性代数的概念被用来设计和分析加密算法。 总之,线性代数是一门重要的数学学科,它提供了丰富的工具和方法来解决各种实际问题。通过学习线性代数,我们可以更好地理解和描述现实世界中的现象,并运用它们来解决实际问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值