一、tags设定
Python pandas dataframe to Influxdb with column and other tags
AUTHOR manish DATE August 6, 2020
Python pandas is very powerful data processing library. It can take large data from variety of sources (CSVs, databases, excel etc) and process it in memory.
After processing the data you may have to dump it in influxdb. In influxdb each measurement has tags and fields. While writing pandas dataframe to influxdb we can define these tags.
We can use specific columns of data frame as tags along with some default tags. Let us understand this with a simple implementation.
Consider following CSV, sample.csv, having schools data of a City, State, country. Which means city, state and country are fixed tag and additionally we will make “School”, “Class” as dataframe row tags.
TimeStamp,School,Class,Students,Teachers,Absent
5/9/20 5:45:00 PM GMT,USSMS,UKG,30,2,3
5/9/20 5:45:00 PM GMT,USSMS,10,50,3,2
5/9/20 5:45:00 PM GMT,USSMS,8,40,2,0
5/9/20 6:45:00 PM GMT,St Thomas,Nur,30,2,4
5/9/20 6:45:00 PM GMT,St Thomas,6,55,3,4
5/9/20 7:45:00 PM GMT,St Thomas,9,34,2,2
Now let us look at the code, pyinflx.py.
#!/usr/bin/python3
import pandas as pd
import json
from influxdb import DataFrameClient
dbhost = 'localhost'
dbport = 8086
dbuser = 'admin'
dbpasswd = 'xxxxx'
dbname = 'schooldb'
protocol = 'line'
# Use only following fields from CSV.
Fields = ['TimeStamp','School','Class','Students','Absent']
# Define tag fields
datatags = ['School','Class']
# Define fixed tags
fixtags = {"Country": "India", "State": "Haryana", "City": "Kurukshetra"}
# Read data from CSV without index and parse 'TimeStamp' as date.
df = pd.read_csv("sample.csv", sep=',', index_col=False, parse_dates=['TimeStamp'], usecols=Fields)
# Set 'TimeStamp' field as index of dataframe
df.set_index('TimeStamp', inplace = True)
print(df.head())
client = DataFrameClient(dbhost, dbport, dbuser, dbpasswd, dbname)
# Write data to "SchoolData" measurement of "schooldb" database.
client.write_points(df,"SchoolData",tags=fixtags,tag_columns=datatags,protocol=protocol)
Fire the code !
./pyinflx.py
School Class Students Absent
TimeStamp
2020-05-09 17:45:00+00:00 USSMS UKG 30 3
2020-05-09 17:45:00+00:00 USSMS 10 50 2
2020-05-09 17:45:00+00:00 USSMS 8 40 0
2020-05-09 18:45:00+00:00 St Thomas Nur 30 4
2020-05-09 18:45:00+00:00 St Thomas 6 55 4
Check data in influxdb
> select * from SchoolData
name: SchoolData
time Absent City Class Country School State Students
---- ------ ---- ----- ------- ------ ----- --------
2020-05-09T17:45:00Z 2 Kurukshetra 10 India USSMS Haryana 50
2020-05-09T17:45:00Z 3 Kurukshetra UKG India USSMS Haryana 30
2020-05-09T17:45:00Z 0 Kurukshetra 8 India USSMS Haryana 40
2020-05-09T18:45:00Z 4 Kurukshetra 6 India St Thomas Haryana 55
2020-05-09T18:45:00Z 4 Kurukshetra Nur India St Thomas Haryana 30
2020-05-09T19:45:00Z 2 Kurukshetra 9 India St Thomas Haryana 34
Check tags in influxdb
> show tag keys
name: SchoolData
tagKey
------
City
Class
Country
School
State
二、设置dataframe的index为influx的time格式
"""
Import VIX - CBOE Volatility Index - from "vix-daily.csv" file into InfluxDB 2.0
https://datahub.io/core/finance-vix#data
"""
from collections import OrderedDict
from csv import DictReader
import ciso8601
import requests
import rx
from pytz import UTC
from rx import operators as ops
from influxdb_client import InfluxDBClient, WriteOptions
from influxdb_client.client.write.point import EPOCH
_progress = 0
def parse_row(row: OrderedDict):
"""Parse row of CSV file into LineProtocol with structure:
CSV format:
date,symbol,open,close,low,high,volume
2016-01-05,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
2016-01-06,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2016-01-07,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
2016-01-08,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
2016-01-11,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0
2016-01-12,WLTW,115.510002,115.550003,114.5,116.059998,1098000.0
2016-01-13,WLTW,116.459999,112.849998,112.589996,117.07,949600.0
...
:param row: the row of CSV file
:return: Parsed csv row to LineProtocol
"""
global _progress
_progress += 1
if _progress % 10000 == 0:
print(_progress)
time = (UTC.localize(ciso8601.parse_datetime(row["date"])) - EPOCH).total_seconds() * 1e9
return f'financial-analysis,symbol={row["symbol"]} ' \
f'close={row["close"]},high={row["high"]},low={row["low"]},open={row["open"]} ' \
f'{int(time)}'
def main():
parse_row.progress = 0
url = "https://github.com/influxdata/influxdb-client-python/wiki/data/stock-prices-example.csv"
response = requests.get(url, stream=True)
data = rx \
.from_iterable(DictReader(response.iter_lines(decode_unicode=True))) \
.pipe(ops.map(lambda row: parse_row(row)))
client = InfluxDBClient(url="http://localhost:8086", token="my-token", org="my-org", debug=False)
write_api = client.write_api(write_options=WriteOptions(batch_size=50_000, flush_interval=10_000))
write_api.write(bucket="my-bucket", record=data)
write_api.close()
query = '''
from(bucket:"my-bucket")
|> range(start: 0, stop: now())
|> filter(fn: (r) => r._measurement == "financial-analysis")
|> filter(fn: (r) => r.symbol == "AAPL")
|> filter(fn: (r) => r._field == "close")
|> drop(columns: ["_start", "_stop", "table", "_field","_measurement"])
'''
result = client.query_api().query_data_frame(query=query)
print(result.head(100))
"""
Close client
"""
client.close()
# %%
if __name__ == '__main__':
main()
三、读写dataframe双向
Getting Started with InfluxDB and Pandas
By Anais Dotis-Georgiou / January 16, 2020 / Community, Developer, Flux, InfluxDB, Tutorial / 3 Comments
3 minutes
InfluxData prides itself on prioritizing developer happiness. A large part of maintaining developer happiness is providing client libraries that allow users to interact with the database through the language and library of their choosing. Data analysis is the task most broadly associated with Python use cases, accounting for 58% of Python tasks, so it makes sense that Pandas is the second most popular library for Python users. The 2.0 InfluxDB Python Client Data supports Pandas DataFrames to invite those data scientists to use InfluxDB with ease.
In this tutorial, we’ll learn how to query our InfluxDB instance and return the data as a DataFrame. We’ll also explore some data science resources that exist as a part of the Client repo. To learn about how to get started with the InfluxDB Python Client Library, please take a look at this blog.
Me eagerly consuming Pandas and InfluxDB Documentation. Photo by Sid Balachandran on Unsplash.
Data science resources
A variety of data science resources have been included in the InfluxDB Python Client repo to help you take advantage of the Pandas functionality of the client. I encourage you to take a look at the example notebooks. They are a collection of Jupyter Notebooks providing examples with a variety of time series data science and analytics solutions, e.g. how to integrate Tensorflow and Keras for predictions.
From InfluxDB to a DataFrame
Import the client and Pandas:
from influxdb_client import InfluxDBClient
import pandas as pd
Copy
Supply auth parameters:
my_token = my-token
my_org = "my-org"
bucket = "system"
Copy
Write your Flux query:
query= '''
from(bucket: "system")
|> range(start:-5m, stop: now())
|> filter(fn: (r) => r._measurement == "cpu")
|> filter(fn: (r) => r._field == "usage_user")
|> filter(fn: (r) => r.cpu == "cpu-total")'''
Copy
Query InfluxDB and return a Dataframe:
client = InfluxDBClient(url="http://localhost:9999", token=my_token, org=my_org, debug=False)
system_stats = client.query_api().query_data_frame(org=my_org, query=query)
display(system_stats.head())
Copy
From DataFrame to InfluxDB
Write DataFrame to InfluxDB::
from influxdb_client import InfluxDBClient, Point, WriteOptions
from influxdb_client.client.write_api import SYNCHRONOUS
# Preparing Dataframe:
system_stats.drop(columns=['result', 'table','start','stop'])
# DataFrame must have the timestamp column as an index for the client.
system_stats.set_index("_time")
_write_client.write(bucket.name, record=system_stats, data_frame_measurement_name='cpu',
data_frame_tag_columns=['cpu'])
Copy
Close client:
_write_client.__del__()
client.__del__()
Copy
Pandas to complement Flux with InfluxDB
Although Flux has many of the data transformation capabilities that Pandas does, InfluxDB values developers’ time. If you are dealing with a smaller dataset, you might not have much incentive to do those transformations on the server side or learn Flux. Hopefully this Pandas capability can help you execute your time series analysis faster. As always, if you run into hurdles, please share them on our community site or Slack channel. We’d love to get your feedback and help you with any problems you run into.