Kaggle翻译，第七天：Python 7/7

最新推荐文章于 2024-04-11 16:03:36 发布

King Stars

最新推荐文章于 2024-04-11 16:03:36 发布

阅读量486

点赞数

分类专栏： Kaggle与人工智能文章标签： python 人工智能

原文链接：https://www.kaggle.com/code/colinmorris/working-with-external-libraries

版权

Kaggle与人工智能专栏收录该内容

25 篇文章 5 订阅

订阅专栏

使用外部库——Python 7/7

导入、运算符重载和进入外部库的世界冒险的生存技巧
本课你将学到Python中的导入方法，获取一些使用不熟悉的外部库的技巧，还有深入了解运算符重载。

导入外部库

目前，我们已经讨论过语言内置的类型和函数。
但是另一个Python很棒的功能就是有大量的、已经写好的、高质量的、自定义外部库。
有些库是“标准库”，意思是你在运行Python是都可以找得到他们。其他库可以很容易的加入进来，即使他们不常常和Python自动绑定。
总之，我们可以通过导入来访问这些代码。
我们先从导入math库开始：

import math

print("It's math! It has type {}".format(type(math)))

It's math! It has type <class 'module'>

math是一个模块，模块只是一个其他人定义的变量的集合。我们可通过内置函数dir()查看所有math定义的名字。

print(dir(math))

['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'pi', 'pow', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc']

我们可以用点语法访问这些变量。有些只是普通的值，像math.pi

print("pi to 4 significant digits = {:.4}".format(math.pi))

pi to 4 significant digits = 3.142

但大多数是指函数如math.log()

math.log(32, 2)

5.0

当然，如果我们不知道math.log是干啥的，可以调用help()

help(math.log)

Help on built-in function log in module math:

log(...)
    log(x, [base=math.e])
    Return the logarithm of x to the given base.
    
    If the base not specified, returns the natural logarithm (base e) of x.

我们可以对模块调用help()函数，这回返回一个组合的文档说明，里面包含了模块中所有的变量、函数（也包括一个更高级别的模块描述）。

help(math)

Help on module math:

NAME
    math

MODULE REFERENCE
    https://docs.python.org/3.7/library/math
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides access to the mathematical functions
    defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.
    
    acosh(x, /)
        Return the inverse hyperbolic cosine of x.
    
    asin(x, /)
        Return the arc sine (measured in radians) of x.
    
    asinh(x, /)
        Return the inverse hyperbolic sine of x.
    
    atan(x, /)
        Return the arc tangent (measured in radians) of x.
    
    atan2(y, x, /)
        Return the arc tangent (measured in radians) of y/x.
        
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(x, /)
        Return the inverse hyperbolic tangent of x.
    
    ceil(x, /)
        Return the ceiling of x as an Integral.
        
        This is the smallest integer >= x.
    
    copysign(x, y, /)
        Return a float with the magnitude (absolute value) of x but the sign of y.
        
        On platforms that support signed zeros, copysign(1.0, -0.0)
        returns -1.0.
    
    cos(x, /)
        Return the cosine of x (measured in radians).
    
    cosh(x, /)
        Return the hyperbolic cosine of x.
    
    degrees(x, /)
        Convert angle x from radians to degrees.
    
    erf(x, /)
        Error function at x.
    
    erfc(x, /)
        Complementary error function at x.
    
    exp(x, /)
        Return e raised to the power of x.
    
    expm1(x, /)
        Return exp(x)-1.
        
        This function avoids the loss of precision involved in the direct evaluation of exp(x)-1 for small x.
    
    fabs(x, /)
        Return the absolute value of the float x.
    
    factorial(x, /)
        Find x!.
        
        Raise a ValueError if x is negative or non-integral.
    
    floor(x, /)
        Return the floor of x as an Integral.
        
        This is the largest integer <= x.
    
    fmod(x, y, /)
        Return fmod(x, y), according to platform C.
        
        x % y may differ.
    
    frexp(x, /)
        Return the mantissa and exponent of x, as pair (m, e).
        
        m is a float and e is an int, such that x = m * 2.**e.
        If x is 0, m and e are both 0.  Else 0.5 <= abs(m) < 1.0.
    
    fsum(seq, /)
        Return an accurate floating point sum of values in the iterable seq.
        
        Assumes IEEE-754 floating point arithmetic.
    
    gamma(x, /)
        Gamma function at x.
    
    gcd(x, y, /)
        greatest common divisor of x and y
    
    hypot(x, y, /)
        Return the Euclidean distance, sqrt(x*x + y*y).
    
    isclose(a, b, *, rel_tol=1e-09, abs_tol=0.0)
        Determine whether two floating point numbers are close in value.
        
          rel_tol
            maximum difference for being considered "close", relative to the
            magnitude of the input values
          abs_tol
            maximum difference for being considered "close", regardless of the
            magnitude of the input values
        
        Return True if a is close in value to b, and False otherwise.
        
        For the values to be considered close, the difference between them
        must be smaller than at least one of the tolerances.
        
        -inf, inf and NaN behave similarly to the IEEE 754 Standard.  That
        is, NaN is not close to anything, even itself.  inf and -inf are
        only close to themselves.
    
    isfinite(x, /)
        Return True if x is neither an infinity nor a NaN, and False otherwise.
    
    isinf(x, /)
        Return True if x is a positive or negative infinity, and False otherwise.
    
    isnan(x, /)
        Return True if x is a NaN (not a number), and False otherwise.
    
    ldexp(x, i, /)
        Return x * (2**i).
        
        This is essentially the inverse of frexp().
    
    lgamma(x, /)
        Natural logarithm of absolute value of Gamma function at x.
    
    log(...)
        log(x, [base=math.e])
        Return the logarithm of x to the given base.
        
        If the base not specified, returns the natural logarithm (base e) of x.
    
    log10(x, /)
        Return the base 10 logarithm of x.
    
    log1p(x, /)
        Return the natural logarithm of 1+x (base e).
        
        The result is computed in a way which is accurate for x near zero.
    
    log2(x, /)
        Return the base 2 logarithm of x.
    
    modf(x, /)
        Return the fractional and integer parts of x.
        
        Both results carry the sign of x and are floats.
    
    pow(x, y, /)
        Return x**y (x to the power of y).
    
    radians(x, /)
        Convert angle x from degrees to radians.
    
    remainder(x, y, /)
        Difference between x and the closest integer multiple of y.
        
        Return x - n*y where n*y is the closest integer multiple of y.
        In the case where x is exactly halfway between two multiples of
        y, the nearest even value of n is used. The result is always exact.
    
    sin(x, /)
        Return the sine of x (measured in radians).
    
    sinh(x, /)
        Return the hyperbolic sine of x.
    
    sqrt(x, /)
        Return the square root of x.
    
    tan(x, /)
        Return the tangent of x (measured in radians).
    
    tanh(x, /)
        Return the hyperbolic tangent of x.
    
    trunc(x, /)
        Truncates the Real x to the nearest Integral toward 0.
        
        Uses the __trunc__ magic method.

DATA
    e = 2.718281828459045
    inf = inf
    nan = nan
    pi = 3.141592653589793
    tau = 6.283185307179586

FILE
    /opt/conda/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so

其他导入语法

如果我们知道自己将十分频繁的使用导入的模块，我们可以为模块起一个更短的别名。（即使math已经够短了）

import math as mt
mt.pi

3.141592653589793

你或许已经在使用一些流行库像 Pandas、Numpy、Tensorflow、Matplotlib时见过这样的用法，如import numpy as np import pandas as pd

as很简单的就将模块重命名了，和下面的功能是一样的：

import math
mt = math

我们使用math中的所有变量都仅使用他们的变量名，比如用pi而不是math.pi 或 mt.pi岂不美哉？好消息：你可以！

from math import *
print(pi, log(32, 2))

3.141592653589793 5.0

import *可以让你直接访问模块中的所有变量，而不需要点语法。
坏消息是：一些语言纯粹者会抱怨你这样的做法。
他们这样说也不无道理：

from math import *
from numpy import *
print(pi, log(32, 2))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_19/3018510453.py in <module>
      1 from math import *
      2 from numpy import *
----> 3 print(pi, log(32, 2))

TypeError: return arrays must be of ArrayType

怎么回事，之前是可以用的呀？
这种星型导入偶尔会产生奇怪的、难以调试的情况。
这里的问题出现在math和numpy模块都有函数叫log，但是他们有不同的语义，由于我们第二次导入了numpy，他的log函数重写了math库导入的log函数。
最理想的妥协方法就是只从各自模块中导入我们需要的东西：

from math import log, pi
from numpy import asarray

子模块

我们已经将模块包含特定的函数和变量。我们需要知道的是特们也可以包含其他模块

import numpy
print("numpy.random is a", type(numpy.random))
print("it contains names such as...",
      dir(numpy.random)[-15:]
     )

numpy.random is a <class 'module'>
it contains names such as... ['seed', 'set_state', 'shuffle', 'standard_cauchy', 'standard_exponential', 'standard_gamma', 'standard_normal', 'standard_t', 'test', 'triangular', 'uniform', 'vonmises', 'wald', 'weibull', 'zipf']

所以如果我们像上面导入了numpy那我们要调用random子模块中的函数就要用两个点。

# Roll 10 dice
rolls = numpy.random.randint(low=1, high=6, size=10)
rolls

array([5, 5, 3, 4, 5, 1, 2, 2, 1, 1])

你走过的地方，你见过的对象

在第六课结束后，你应该已将是使用整数、浮点数、布尔值、列表、字符串、和字典的高手了（吧？）

即使是这样，学习也从未止步。当你使用一些库来完成特定的任务时，你会发现他们定义了自己独特的对象需要你继续学习。例如，在图像库matplotlib中你会遇到他们定义的Subplots, Figures, TickMarks, and Annotations等对象。pandas函数中会出现 DataFrames 和 Series.
这部分，我想跟你分享下学习这些奇奇怪怪的类型的快速生存手册。

理解奇怪对象的三大法宝

上个代码块中我们看到numpy中的奇怪函数“array”。别担心，我们有三个熟悉的函数会帮助我们。

type() （是什么？）

type(rolls)

numpy.ndarray

dir()（怎么用？）

print(dir(rolls))

['T', '__abs__', '__add__', '__and__', '__array__', '__array_finalize__', '__array_function__', '__array_interface__', '__array_prepare__', '__array_priority__', '__array_struct__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__complex__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__doc__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__ilshift__', '__imatmul__', '__imod__', '__imul__', '__index__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lshift__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__', 'all', 'any', 'argmax', 'argmin', 'argpartition', 'argsort', 'astype', 'base', 'byteswap', 'choose', 'clip', 'compress', 'conj', 'conjugate', 'copy', 'ctypes', 'cumprod', 'cumsum', 'data', 'diagonal', 'dot', 'dtype', 'dump', 'dumps', 'fill', 'flags', 'flat', 'flatten', 'getfield', 'imag', 'item', 'itemset', 'itemsize', 'max', 'mean', 'min', 'nbytes', 'ndim', 'newbyteorder', 'nonzero', 'partition', 'prod', 'ptp', 'put', 'ravel', 'real', 'repeat', 'reshape', 'resize', 'round', 'searchsorted', 'setfield', 'setflags', 'shape', 'size', 'sort', 'squeeze', 'std', 'strides', 'sum', 'swapaxes', 'take', 'tobytes', 'tofile', 'tolist', 'tostring', 'trace', 'transpose', 'var', 'view']

# If I want the average roll, the "mean" method looks promising...
rolls.mean()

2.9

# Or maybe I just want to turn the array into a list, in which case I can use "tolist"
rolls.tolist()

[5, 5, 3, 4, 5, 1, 2, 2, 1, 1]

help()（我想知道更多）

# That "ravel" attribute sounds interesting. I'm a big classical music fan.
help(rolls.ravel)

Help on built-in function ravel:

ravel(...) method of numpy.ndarray instance
    a.ravel([order])
    
    Return a flattened array.
    
    Refer to `numpy.ravel` for full documentation.
    
    See Also
    --------
    numpy.ravel : equivalent function
    
    ndarray.flat : a flat iterator on the array.

# Okay, just tell me everything there is to know about numpy.ndarray
# (Click the "output" button to see the novel-length output)
help(rolls)

（输出太长了，描述太多，你可以查看在线文档）

运算符重载

[3, 4, 1, 2, 2, 1] + 10

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_19/2144087748.py in <module>
----> 1 [3, 4, 1, 2, 2, 1] + 10

TypeError: can only concatenate list (not "int") to list

多蠢的问题，这当然不对。
但是这样可以吗？

rolls + 10

array([15, 15, 13, 14, 15, 11, 12, 12, 11, 11])

我们可能会认为Python会严格监控其核心语法的运算符的行为方式，例如+、<，in，==、或方括号索引和切片。但事实上，它采取一个不干涉的方法。当你定义一个新类型，你可以选择如何为它添加额外工作，或该类型的对象等于其他的类型。
这是一些numpyarrays和Python运算符的奇妙互动

# At which indices are the dice less than or equal to 3?
rolls <= 3

array([False, False,  True, False, False,  True,  True,  True,  True,
        True])

xlist = [[1,2,3],[2,4,6],]
# Create a 2-dimensional array
x = numpy.asarray(xlist)
print("xlist = {}\nx =\n{}".format(xlist, x))

xlist = [[1, 2, 3], [2, 4, 6]]
x =
[[1 2 3]
 [2 4 6]]

# Get the last element of the second row of our numpy array
x[1,-1]

# Get the last element of the second sublist of our nested list?
xlist[1,-1]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_19/3020169379.py in <module>
      1 # Get the last element of the second sublist of our nested list?
----> 2 xlist[1,-1]

TypeError: list indices must be integers or slices, not tuple

numpy的ndarray是专门为多维数据而生的，所以它定义了自己的索引逻辑。让我们能够在各个维度中通过索引访问元组。

当 1 + 1 不再等于 2？

其实还可以更奇怪，你可能听过（甚至用过）一个流行的深度学习Python库：tensorflow。它将运算符重载做到了极致：

import tensorflow as tf
# Create two constants, each with value 1
a = tf.constant(1)
b = tf.constant(1)
# Add them together to get...
a + b

2021-09-13 19:57:05.691148: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-09-13 19:57:05.691269: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-09-13 19:57:10.718479: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-13 19:57:10.721541: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-09-13 19:57:10.721584: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-13 19:57:10.721611: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (afebc7c86ed6): /proc/driver/nvidia/version does not exist
2021-09-13 19:57:10.723864: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 19:57:10.725410: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<tf.Tensor: shape=(), dtype=int32, numpy=2>

a + b不等于2，等于（用tensorflow的文档的话说）

操作的产出之一的一个象征性的句柄。它不保存该操作的输出的值，而是提供了计算这些值在TensorFlow tf的一种手段tf.Session.。

要明白这一点很重要：库模块会经常采用不明显或很神奇的方式进行运算符重载。
能够明白Python对于整数、字符串和列表实行的运算符重载不代表你就可以弄明白对 tensorflow 的Tensor 或 numpy 的 ndarray 或 pandas 的 DataFrame 实行的运算符重载。
下面的例子就看着很含糊：

# Get the rows with population over 1m in South America
df[(df['population'] > 10**6) & (df['continent'] == 'South America')]

但是为什么就能这么重载呢。上面的例子共展示了5个不同的重载后的运算符。每一个是怎么运行的？当出错时，明白这一点就会很有帮助。
好奇这是怎么实现的
你曾经调用help()查询一个对象的帮助文档时，你见过哪些有双横线的名字吗？

print(dir(list))

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

这实际上与运算符重载直接相关。
当Python程序员想定义运算符是怎么操作他们定义的对象时，他们通过实现这些有两个横线开头，两个横线结尾的特殊名字如 __lt__, __setattr__ __contains__。总的来说，这些函数对于Python有特殊意义。
比如，表达式x in [1, 2, 3]实际上和函数__contains__有关。在幕后，他就等同于这个比较丑的形式：[1, 2, 3].__contains__(x)。
如果你想学习更多，你可以查看Python官方手册，里面有很多很多这样的特殊函数
你在这节课不会去自己编写（要有时间就好了），但我希望你今后有机会定义属于自己的奇怪的对象和他们的方法。