PyJava
This library is an ongoing effort towards bringing the data exchanging ability
between Java/Scala and Python. PyJava introduces Apache Arrow as the exchanging data format,
this means we can avoid ser/der between Java/Scala and Python which can really speed up the
communication efficiency than traditional way.
When you invoke python code in Java/Scala side, PyJava will start some python workers automatically
and send the data to python worker, and once they are processed, send them back. The python workers are reused
by default.
The initial code in this lib is from Apache Spark.
Install
Setup python(>= 3.6) Env(Conda is recommended):
pip uninstall pyjava && pip installpyjava
Setup Java env(Maven is recommended):
tech.mlsql
pyjava-2.4_2.12
0.2.8.0
Using python code snippet to process data in Java/Scala
With pyjava, you can run any python code in your Java/Scala application.
val envs = new util.HashMap[String, String]()
// prepare python environment
envs.put(str(PythonConf.PYTHON_ENV), "source activate dev && export ARROW_PRE_0_15_IPC_FORMAT=1 ")
// describe the data which will be transfered to python
val sourceSchema = StructType(Seq(StructField("value", StringType)))
val batch = new ArrowPythonRunner(
Seq(ChainedPythonFunctions(Seq(PythonFunction(
"""
|import pandas as pd
|import numpy as np
|
|def process():
| for item in context.fetch_once_as_rows():
| item["value1"] = item["value"] + "_suffix"
| yield item
|
|context.build_result(process())
""".stripMargin, envs, "python", "3.6")))), sourceSc