python versions compatibility_用于Elasticsearch中的DataFrames,大数据,机器学习和ETL的Python客户端和工具包...

Note, this project is still very much a work in progress and in an alpha state; input and contributions welcome!

PyPI

Conda Forge

Package Status

License

Build Status

What is it?

Eland is a Python Elasticsearch client for exploring and analyzing data residing in Elasticsearch with a familiar Pandas-compatible API.

Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

For example, to explore data in a large Elasticsearch index, simply create an eland DataFrame from an Elasticsearch index pattern, and explore using an API that mirrors a subset of the pandas.DataFrame API:

>>> import eland as ed

>>> # Connect to 'flights' index via localhost Elasticsearch node

>>> df = ed.DataFrame('localhost:9200', 'flights')

>>> df.head()

AvgTicketPrice Cancelled ... dayOfWeek timestamp

0 841.265642 False ... 0 2018-01-01 00:00:00

1 882.982662 False ... 0 2018-01-01 18:27:00

2 190.636904 False ... 0 2018-01-01 17:11:14

3 181.694216 True ... 0 2018-01-01 10:33:28

4 730.041778 False ... 0 2018-01-01 05:13:00

[5 rows x 27 columns]

>>> df.describe()

AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek

count 13059.000000 13059.000000 ... 13059.000000 13059.000000

mean 628.253689 7092.142457 ... 511.127842 2.835975

std 266.386661 4578.263193 ... 334.741135 1.939365

min 100.020531 0.000000 ... 0.000000 0.000000

25% 410.008918 2470.545974 ... 251.739008 1.000000

50% 640.387285 7612.072403 ... 503.148975 3.000000

75% 842.262193 9735.660463 ... 720.505705 4.239865

max 1199.729004 19881.482422 ... 1902.901978 6.000000

[8 rows x 7 columns]

>>> df[['Carrier', 'AvgTicketPrice', 'Cancelled']]

Carrier AvgTicketPrice Cancelled

0 Kibana Airlines 841.265642 False

1 Logstash Airways 882.982662 False

2 Logstash Airways 190.636904 False

3 Kibana Airlines 181.694216 True

4 Kibana Airlines 730.041778 False

... ... ... ...

13054 Logstash Airways 1080.446279 False

13055 Logstash Airways 646.612941 False

13056 Logstash Airways 997.751876 False

13057 JetBeats 1102.814465 False

13058 JetBeats 858.144337 False

[13059 rows x 3 columns]

>>> df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head()

AvgTicketPrice Cancelled ... dayOfWeek timestamp

8 960.869736 True ... 0 2018-01-01 12:09:35

26 975.812632 True ... 0 2018-01-01 15:38:32

311 946.358410 True ... 0 2018-01-01 11:51:12

651 975.383864 True ... 2 2018-01-03 21:13:17

950 907.836523 True ... 2 2018-01-03 05:14:51

[5 rows x 27 columns]

>>> df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])

DistanceKilometers AvgTicketPrice

sum 9.261629e+07 8.204365e+06

min 0.000000e+00 1.000205e+02

std 4.578263e+03 2.663867e+02

>>> df[['Carrier', 'Origin', 'Dest']].nunique()

Carrier 4

Origin 156

Dest 156

dtype: int64

>>> s = df.AvgTicketPrice * 2 + df.DistanceKilometers - df.FlightDelayMin

>>> s

0 18174.857422

1 10589.365723

2 381.273804

3 739.126221

4 14818.327637

...

13054 10219.474121

13055 8381.823975

13056 12661.157104

13057 20819.488281

13058 18315.431274

Length: 13059, dtype: float64

>>> print(s.info_es())

index_pattern: flights

Index:

index_field: _id

is_source_field: False

Mappings:

capabilities:

es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name

NaN script_field_None False double None float64 True True True script_field_None

Operations:

tasks: []

size: None

sort_params: None

_source: ['script_field_None']

body: {'script_fields': {'script_field_None': {'script': {'source': "(((doc['AvgTicketPrice'].value * 2) + doc['DistanceKilometers'].value) - doc['FlightDelayMin'].value)"}}}}

post_processing: []

>>> pd_df = ed.eland_to_pandas(df)

>>> pd_df.head()

AvgTicketPrice Cancelled ... dayOfWeek timestamp

0 841.265642 False ... 0 2018-01-01 00:00:00

1 882.982662 False ... 0 2018-01-01 18:27:00

2 190.636904 False ... 0 2018-01-01 17:11:14

3 181.694216 True ... 0 2018-01-01 10:33:28

4 730.041778 False ... 0 2018-01-01 05:13:00

[5 rows x 27 columns]

See docs and demo_notebook.ipynb for more examples.

Where to get it

Eland can be installed from PyPI via pip:

$ python -m pip install eland

Eland can also be installed from Conda Forge with Conda:

$ conda install -c conda-forge eland

The source code is currently available on GitHub.

Versions and Compatibility

Python Version Support

Officially Python 3.6 and above.

eland depends on pandas version 1.0.0+.

Elasticsearch Versions

eland is versioned like the Elastic stack (eland 7.5.1 is compatible with Elasticsearch 7.x up to 7.5.1)

A major version of the client is compatible with the same major version of Elasticsearch.

No compatibility assurances are given between different major versions of the client and Elasticsearch. Major differences likely exist between major versions of Elasticsearch, particularly around request and response object formats, but also around API urls and behaviour.

Connecting to Elasticsearch

eland uses the Elasticsearch low level client to connect to Elasticsearch. This client supports a range of [connection options and authentication mechanisms] (https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch).

Basic Connection Options

>>> import eland as ed

>>> # Connect to flights index via localhost Elasticsearch node

>>> ed.DataFrame('localhost', 'flights')

>>> # Connect to flights index via localhost Elasticsearch node on port 9200

>>> ed.DataFrame('localhost:9200', 'flights')

>>> # Connect to flights index via localhost Elasticsearch node on port 9200 with : credentials

>>> ed.DataFrame('http://:@localhost:9200', 'flights')

>>> # Connect to flights index via ssl

>>> es = Elasticsearch(

'https://:@localhost:443',

use_ssl=True,

verify_certs=True,

ca_certs='/path/to/ca.crt'

)

>>> ed.DataFrame(es, 'flights')

>>> # Connect to flights index via ssl using Urllib3HttpConnection options

>>> es = Elasticsearch(

['localhost:443', 'other_host:443'],

use_ssl=True,

verify_certs=True,

ca_certs='/path/to/CA_certs',

client_cert='/path/to/clientcert.pem',

client_key='/path/to/clientkey.pem'

)

>>> ed.DataFrame(es, 'flights')

Connecting to an Elasticsearch Cloud Cluster

>>> import eland as ed

>>> from elasticsearch import Elasticsearch

>>> es = Elasticsearch(cloud_id="", http_auth=('',''))

>>> es.info()

{'name': 'instance-0000000000', 'cluster_name': 'bf900cfce5684a81bca0be0cce5913bc', 'cluster_uuid': 'xLPvrV3jQNeadA7oM4l1jA', 'version': {'number': '7.4.2', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96', 'build_date': '2019-10-28T20:40:44.881551Z', 'build_snapshot': False, 'lucene_version': '8.2.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}

>>> df = ed.read_es(es, 'reviews')

Why eland?

Naming is difficult, but as we had to call it something:

eland: elastic and data

eland: 'Elk/Moose' in Dutch (Alces alces)

Elandsgracht: Amsterdam street near Elastic's Amsterdam office

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值