数据ml_ML项目的数据加载

数据ml

数据ml

ML项目的数据加载 (Data Loading for ML Projects)

Suppose if you want to start a ML project then what is the first and most important thing you would require? It is the data that we need to load for starting any of the ML project. With respect to data, the most common format of data for ML projects is CSV (comma-separated values).

假设如果要启动ML项目,那么您需要做的第一件事也是最重要的事情是什么? 这是我们启动任何ML项目都需要加载的数据。 关于数据,对于ML项目,最常见的数据格式是CSV(逗号分隔值)。

Basically, CSV is a simple file format which is used to store tabular data (number and text) such as a spreadsheet in plain text. In Python, we can load CSV data into with different ways but before loading CSV data we must have to take care about some considerations.

基本上,CSV是一种简单的文件格式,用于以纯文本格式存储表格数据(数字和文本),例如电子表格。 在Python中,我们可以通过不同的方式将CSV数据加载到其中,但是在加载CSV数据之前,我们必须要注意一些注意事项。

加载CSV数据时的注意事项 (Consideration While Loading CSV data)

CSV data format is the most common format for ML data, but we need to take care about following major considerations while loading the same into our ML projects −

CSV数据格式是ML数据中最常见的格式,但是在将其加载到ML项目中时,我们需要注意以下主要注意事项-

文件头 (File Header)

In CSV data files, the header contains the information for each field. We must use the same delimiter for the header file and for data file because it is the header file that specifies how should data fields be interpreted.

在CSV数据文件中,标题包含每个字段的信息。 我们必须对头文件和数据文件使用相同的定界符,因为头文件指定了应如何解释数据字段。

The following are the two cases related to CSV file header which must be considered −

以下是与CSV文件标题相关的两种情况,必须考虑以下两种情况:

  • Case-I: When Data file is having a file header − It will automatically assign the names to each column of data if data file is having a file header.

    情况一:当数据文件具有文件头时 - 如果数据文件具有文件头 ,它将自动为数据的每一列分配名称。

  • Case-II: When Data file is not having a file header − We need to assign the names to each column of data manually if data file is not having a file header.

    情况二:当数据文件没有文件头时 - 如果数据文件没有文件头 ,我们需要为每个数据列手动分配名称。

In both the cases, we must need to specify explicitly weather our CSV file contains header or not.

在这两种情况下,我们都必须明确指定我们的CSV文件是否包含标题的天气。

注释 (Comments)

Comments in any data file are having their significance. In CSV data file, comments are indicated by a hash (#) at the start of the line. We need to consider comments while loading CSV data into ML projects because if we are having comments in the file then we may need to indicate, depends upon the method we choose for loading, whether to expect those comments or not.

任何数据文件中的注释都具有其重要性。 在CSV数据文件中,注释在行的开头用井号(#)表示。 在将CSV数据加载到ML项目中时,我们需要考虑注释,因为如果文件中包含注释,则可能需要根据选择的加载方法(是否期望这些注释)进行指示。

定界符 (Delimiter)

In CSV data files, comma (,) character is the standard delimiter. The role of delimiter is to separate the values in the fields. It is important to consider the role of delimiter while uploading the CSV file into ML projects because we can also use a different delimiter such as a tab or white space. But in the case of using a different delimiter than standard one, we must have to specify it explicitly.

在CSV数据文件中,逗号(,)字符是标准分隔符。 分隔符的作用是分隔字段中的值。 将CSV文件上传到ML项目中时,考虑分隔符的作用很重要,因为我们还可以使用其他分隔符,例如制表符或空格。 但是在使用与标准分隔符不同的分隔符的情况下,我们必须必须明确指定它。

行情 (Quotes)

In CSV data files, double quotation (“ ”) mark is the default quote character. It is important to consider the role of quotes while uploading the CSV file into ML projects because we can also use other quote character than double quotation mark. But in case of using a different quote character than standard one, we must have to specify it explicitly.

在CSV数据文件中,双引号(“”)是默认的引号字符。 将CSV文件上传到ML项目中时,考虑引号的作用很重要,因为我们还可以使用双引号以外的其他引号字符。 但是,如果使用的引号字符不同于标准引号字符,则必须明确指定它。

加载CSV数据文件的方法 (Methods to Load CSV Data File)

While working with ML projects, the most crucial task is to load the data properly into it. The most common data format for ML projects is CSV and it comes in various flavors and varying difficulties to parse. In this section, we are going to discuss about three common approaches in Python to load CSV data file −

在处理ML项目时,最关键的任务是将数据正确加载到其中。 机器学习项目最常见的数据格式是CSV,它具有多种形式,并且解析起来也有不同的难度。 在本节中,我们将讨论有关Python中加载CSV数据文件的三种常见方法-

使用Python标准库加载CSV (Load CSV with Python Standard Library)

The first and most used approach to load CSV data file is the use of Python standard library which provides us a variety of built-in modules namely csv module and the reader()function. The following is an example of loading CSV data file with the help of it −

加载CSV数据文件的第一个也是最常用的方法是使用Python标准库,该库为我们提供了各种内置模块,即csv模块和reader()函数。 以下是借助它加载CSV数据文件的示例-

Example

In this example, we are using the iris flower data set which can be downloaded into our local directory. After loading the data file, we can convert it into NumPy array and use it for ML projects. Following is the Python script for loading CSV data file −

在此示例中,我们使用的鸢尾花数据集可以下载到我们的本地目录中。 加载数据文件后,我们可以将其转换为NumPy数组,并将其用于ML项目。 以下是用于加载CSV数据文件的Python脚本-

First, we need to import the csv module provided by Python standard library as follows −

首先,我们需要导入Python标准库提供的csv模块,如下所示-


import csv

Next, we need to import Numpy module for converting the loaded data into NumPy array.

接下来,我们需要导入Numpy模块,以将加载的数据转换为NumPy数组。


import numpy as np

Now, provide the full path of the file, stored on our local directory, having the CSV data file −

现在,提供具有CSV数据文件的文件的完整路径,该路径存储在我们的本地目录中-


path = r"c:\iris.csv"

Next, use the csv.reader()function to read data from CSV file −

接下来,使用csv.reader()函数从CSV文件读取数据-


with open(path,'r') as f:
   reader = csv.reader(f,delimiter = ',')
   headers = next(reader)
   data = list(reader)
   data = np.array(data).astype(float)

We can print the names of the headers with the following line of script −

我们可以使用以下脚本行打印标题的名称:


print(headers)

The following line of script will print the shape of the data i.e. number of rows & columns in the file −

脚本的以下行将打印数据的形状,即文件中的行数和列数-


print(data.shape)

Next script line will give the first three line of data file −

下一个脚本行将给出数据文件的前三行-


print(data[:3])

Output

输出量


['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
(150, 4)
[  [5.1  3.5  1.4  0.2]
   [4.9  3.   1.4  0.2]
   [4.7  3.2  1.3  0.2]
]

用NumPy加载CSV (Load CSV with NumPy)

Another approach to load CSV data file is NumPy and numpy.loadtxt() function. The following is an example of loading CSV data file with the help of it −

加载CSV数据文件的另一种方法是NumPy和numpy.loadtxt()函数。 以下是借助它加载CSV数据文件的示例-

(Example)

In this example, we are using the Pima Indians Dataset having the data of diabetic patients. This dataset is a numeric dataset with no header. It can also be downloaded into our local directory. After loading the data file, we can convert it into NumPy array and use it for ML projects. The following is the Python script for loading CSV data file −

在此示例中,我们使用的是具有糖尿病患者数据的Pima Indians数据集。 该数据集是没有标题的数字数据集。 也可以将其下载到我们的本地目录中。 加载数据文件后,我们可以将其转换为NumPy数组,并将其用于ML项目。 以下是用于加载CSV数据文件的Python脚本-


from numpy import loadtxt
path = r"C:\pima-indians-diabetes.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")
print(data.shape)
print(data[:3])

输出量 (Output)


(768, 9)
[  [ 6.  148.  72.  35.  0.  33.6  0.627  50. 1.]
   [ 1.  85.   66.  29.  0.  26.6  0.351  31. 0.]
   [ 8.  183.  64.  0.   0.  23.3  0.672  32. 1.]
]

用熊猫加载CSV (Load CSV with Pandas)

Another approach to load CSV data file is by Pandas and pandas.read_csv()function. This is the very flexible function that returns a pandas.DataFrame which can be used immediately for plotting. The following is an example of loading CSV data file with the help of it −

加载CSV数据文件的另一种方法是Pandaspandas.read_csv()function 。 这是一个非常灵活的函数,它返回一个pandas.DataFrame ,可以立即将其用于绘图。 以下是借助它加载CSV数据文件的示例-

(Example)

Here, we will be implementing two Python scripts, first is with Iris data set having headers and another is by using the Pima Indians Dataset which is a numeric dataset with no header. Both the datasets can be downloaded into local directory.

在这里,我们将实现两个Python脚本,第一个是使用带有标题的Iris数据集,另一个是使用Pima Indians数据集,它是一个没有头的数字数据集。 这两个数据集都可以下载到本地目录中。

Script-1

脚本1

The following is the Python script for loading CSV data file using Pandas on Iris Data set −

以下是用于使用Iris数据集上的熊猫加载CSV数据文件的Python脚本-


from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)
print(data[:3])

Output:

(150, 4)
   sepal_length   sepal_width  petal_length   petal_width
0         5.1     3.5          1.4            0.2
1         4.9     3.0          1.4            0.2
2         4.7     3.2          1.3            0.2

Script-2

脚本2

The following is the Python script for loading CSV data file, along with providing the headers names too, using Pandas on Pima Indians Diabetes dataset −

以下是使用Pima Indians Diabetes数据集上的Pandas加载CSV数据文件的Python脚本,还提供了标头名称-


from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
print(data.shape)
print(data[:3])

Output

输出量


(768, 9)
   preg  plas  pres   skin  test   mass    pedi    age   class
0   6    148    72      35    0     33.6   0.627    50      1
1   1    85     66      29    0     26.6   0.351    31      0
2   8    183    64      0     0     23.3   0.672    32      1

The difference between above used three approaches for loading CSV data file can easily be understood with the help of given examples.

借助给定的示例,可以轻松理解上面使用的三种加载CSV数据文件的方法之间的区别。

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_data_loading_for_ml_projects.htm

数据ml

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值