PyFlink安装与helloword入门示例

在Flink上,为了充分利用python数据分析的能力,像PySpark那样,需要安装PyFlink。

首先进入之前创建的虚拟环境pyspark_env

conda activate pyspark_env

可以通过PyPi安装(需耐心等待下载安装过程)

 pip install apache-flink==1.16.3

注意:从Flink 1.11版本开始, PyFlink 作业支持在 Windows 系统上运行,因此您也可以在 Windows 上开发和调试 PyFlink 作业了。参考:环境安装 | Apache Flink

一个Flink程序由3部分构成,分别为Source、Transformation和Sink

Source:数据源。readTextFile(), socketTextStream()等,包括4类:本地集合、文件、网络套接字、消息队列或分布式日志流式数据源、自定义Source

Transformation:数据转换。map(), flatMap(),等算子

Sink:数据输出。可将结果发送到HDFS、文本文档、MySQL、Elasticsearch等,如writeAsText()算子。

简单打印HDFS文件内容

from pyflink.datastream import StreamExecutionEnvironment
import os



def printdata(input):
    os.environ["HADOOP_CLASSPATH"] = os.popen("hadoop classpath").read()

    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_parallelism(1)

    data_stream = env.read_text_file(input)   #source

    data_stream.print()   #sink
    
    env.execute()


if __name__ == '__main__':

    input = "hdfs://master:9000/input/flinkwords.txt"
    printdata(input)

虚拟环境中直接  python3  test_pyflink_wc0.py  执行结果如下:

如果需要打印本地文件。将input的协议改为 file:///  即可

单词词频统计,批处理方式
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.common import Types
import os



def printdata(input):
    os.environ["HADOOP_CLASSPATH"] = os.popen("hadoop classpath").read()

    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_runtime_mode(RuntimeExecutionMode.BATCH)
    env.set_parallelism(1)

    ds = env.read_text_file(input)   #source

    def split(line):
        yield from line.split()

    ds = ds.flat_map(split)  \
               .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()]))  \
               .key_by(lambda i:i[0])  \
               .reduce(lambda i, j: (i[0], i[1] + j[1]))      #transformation


    ds.print()   #sink
    
    env.execute()


if __name__ == '__main__':

    input = "hdfs://master:9000/input/flinkwords.txt"
    printdata(input)
    

在虚拟环境中,执行结果如下:

系统自带的例子

稍微改下:


from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import (FileSource, StreamFormat, FileSink,
                                                       OutputFileConfig, RollingPolicy)
import os

word_count_data = ["To be, or not to be,--that is the question:--",
                   "Whether 'tis nobler in the mind to suffer",
                   "The slings and arrows of outrageous fortune",
                   "Or to take arms against a sea of troubles,",
                   "And by opposing end them?--To die,--to sleep,--",
                   "No more; and by a sleep to say we end",
                   "The heartache, and the thousand natural shocks",
                   "That flesh is heir to,--'tis a consummation",
                   "Devoutly to be wish'd. To die,--to sleep;--",
                   "To sleep! perchance to dream:--ay, there's the rub;",
                   "For in that sleep of death what dreams may come,",
                   "When we have shuffled off this mortal coil,",
                   "Must give us pause: there's the respect",
                   "That makes calamity of so long life;",
                   "For who would bear the whips and scorns of time,",
                   "The oppressor's wrong, the proud man's contumely,",
                   "The pangs of despis'd love, the law's delay,",
                   "The insolence of office, and the spurns",
                   "That patient merit of the unworthy takes,",
                   "When he himself might his quietus make",
                   "With a bare bodkin? who would these fardels bear,",
                   "To grunt and sweat under a weary life,",
                   "But that the dread of something after death,--",
                   "The undiscover'd country, from whose bourn",
                   "No traveller returns,--puzzles the will,",
                   "And makes us rather bear those ills we have",
                   "Than fly to others that we know not of?",
                   "Thus conscience does make cowards of us all;",
                   "And thus the native hue of resolution",
                   "Is sicklied o'er with the pale cast of thought;",
                   "And enterprises of great pith and moment,",
                   "With this regard, their currents turn awry,",
                   "And lose the name of action.--Soft you now!",
                   "The fair Ophelia!--Nymph, in thy orisons",
                   "Be all my sins remember'd."]


def word_count(input_path, output_path):
    #创建执行环境对象
    os.environ["HADOOP_CLASSPATH"] = os.popen("hadoop classpath").read()
    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_runtime_mode(RuntimeExecutionMode.BATCH)
    # write all the data to one file
    env.set_parallelism(1)
    

    # define the source
    if input_path is not None:
        ds = env.from_source(
            source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
                                                       input_path)
                             .process_static_file_set().build(),
            watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
            source_name="file_source"
        )
    else:
        print("Executing word_count example with default input data set.")
        print("Use --input to specify file input.")
        ds = env.from_collection(word_count_data)

    def split(line):
        yield from line.split()

    # compute word count
    ds = ds.flat_map(split) \
           .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \
           .key_by(lambda i: i[0]) \
           .reduce(lambda i, j: (i[0], i[1] + j[1]))

    # define the sink
    if output_path is not None:
        ds.sink_to(
            sink=FileSink.for_row_format(
                base_path=output_path,
                encoder=Encoder.simple_string_encoder())
            .with_output_file_config(
                OutputFileConfig.builder()
                .with_part_prefix("prefix")
                .with_part_suffix(".ext")
                .build())
            .with_rolling_policy(RollingPolicy.default_rolling_policy())
            .build()
        )
    else:
        print("Printing result to stdout. Use --output to specify output path.")
        ds.print()

    # submit for execution
    env.execute()


if __name__ == '__main__':

    input = "hdfs://master:9000/input/flinkwords.txt"
    output = ""
    
    word_count(input, None)


结果如下:

至此,简单示例已完成,当然还有Table API方式,有待后续探索。。。。。。

  • 7
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值