spark针对sqlserver_使用適用於 SQL Server 和 Azure SQL 的 Apache Spark 連接器

使用適用於 SQL Server 和 Azure SQL 的 Apache Spark 連接器Use the Apache Spark Connector for SQL Server and Azure SQL

11/04/2019

本文內容

適用於 SQL Server 和 Azure SQL 的 Apache Spark 連接器 (英文) 是一種高效能連接器,可讓您在巨量資料分析中使用交易資料,並且保存結果以供臨機操作查詢或報告使用。The Apache Spark Connector for SQL Server and Azure SQL is a high-performance connector that enables you to use transactional data in big data analytics and persists results for ad-hoc queries or reporting. 此連接器可讓您使用內部部署或雲端中的任何 SQL 資料庫,作為 Spark 作業的輸入資料來源或輸出資料接收器。The connector allows you to use any SQL database, on-premises or in the cloud, as an input data source or output data sink for Spark jobs. 此連接器使用 SQL Server 大量寫入 API。The connector uses SQL Server bulk write APIs. 使用者可以以選擇性參數傳遞任何大量寫入參數,再由連接器依現狀將其傳遞至基礎 API。Any bulk write parameters can be passed as optional parameters by the user and are passed as-is by the connector to the underlying API. 如需有關大量寫入作業的詳細資訊,請參閱搭配 JDBC 驅動程式使用大量複製。For more information about bulk write operations, see Using bulk copy with the JDBC driver.

此連接器預設包含在 SQL Server 巨量資料叢集中。The connector is included by default in SQL Server Big Data Clusters.

請在開放原始碼存放庫 (英文) 中深入了解此連接器。Learn more about the connector at the open source repository. 如需範例,請參閱範例 (英文)。For examples, see samples.

寫入新的 SQL 資料表Write to a new SQL Table

警告

在 overwrite 模式下,如果資料表預設已存在於資料庫中,連接器會先加以卸除。In overwrite mode, the connector first drops the table if it already exists in the database by default. 請謹慎使用此選項,以免發生非預期的資料遺失。Use this option with due care to avoid unexpected data loss.

使用 overwrite 模式時,如果您未使用 truncate 選項,在重新建立資料表時將會遺失索引。When using mode overwrite if you do not use the option truncate, on re-creation of the table, indexes will be lost. 例如,資料行存放區資料表會變成堆積。For example, a columnstore table becomes a heap. 如果您想要維護現有的索引,請同時指定 truncate 選項並將其值設定為 true。If you want to maintain existing indexing please also specify option truncate with value true. 例如 .option("truncate",true)For example .option("truncate",true)

server_name = "jdbc:sqlserver://{SERVER_ADDR}"

database_name = "database_name"

url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "table_name"

username = "username"

password = "password123!#" # Please specify password here

try:

df.write \

.format("com.microsoft.sqlserver.jdbc.spark") \

.mode("overwrite") \

.option("url", url) \

.option("dbtable", table_name) \

.option("user", username) \

.option("password", password) \

.save()

except ValueError as error :

print("Connector write failed", error)

附加至 SQL 資料表Append to SQL Table

try:

df.write \

.format("com.microsoft.sqlserver.jdbc.spark") \

.mode("append") \

.option("url", url) \

.option("dbtable", table_name) \

.option("user", username) \

.option("password", password) \

.save()

except ValueError as error :

print("Connector write failed", error)

指定隔離等級Specify the isolation level

此連接器在對資料庫執行大量插入時,依預設會使用 READ_COMMITTED 隔離等級。This connector by default uses READ_COMMITTED isolation level when performing the bulk insert into the database. 如果您想要將此設定覆寫為另一個隔離等級,請使用 mssqlIsolationLevel 選項,如下所示。If you wish to override this to another isolation level, please use the mssqlIsolationLevel option as shown below.

.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \

從 SQL 資料表讀取Read from SQL Table

jdbcDF = spark.read \

.format("com.microsoft.sqlserver.jdbc.spark") \

.option("url", url) \

.option("dbtable", table_name) \

.option("user", username) \

.option("password", password).load()

非 Active Directory 模式Non-Active Directory mode

在非 Active Directory 模式安全性中,每個使用者都有使用者名稱與密碼,在連接器具現化期間必須將這些使用者名稱和密碼提供為參數,以執行讀取和/或寫入。In non-Active Directory mode security, each user has a username and password which need to be provided as parameters during the connector instantiation to perform read and/or writes.

非 Active Directory 模式的範例連接器具現化如下。An example connector instantiation for non-Active Directory mode is below. 執行指令碼之前,請將 ? 取代為您帳戶的值。Before you run the script, replace the ? with the value for your account.

# Note: '?' is a placeholder for a necessary user-specified value

connector_type = "com.microsoft.sqlserver.jdbc.spark"

url = "jdbc:sqlserver://master-p-svc;databaseName=?;"

writer = df.write \

.format(connector_type)\

.mode("overwrite")

.option("url", url) \

.option("user", ?) \

.option("password",?)

writer.save()

Active Directory 模式Active Directory mode

在 Active Directory 模式安全性中,使用者在產生金鑰表檔案之後,必須在連接器具現化期間提供 principal 和 keytab 作為參數。In Active Directory mode security, after a user has generated a key tab file, the user needs to provide the principal and keytab as parameters during the connector instantiation.

在此模式中,驅動程式會將金鑰表檔案載入至個別的執行程式容器。In this mode, the driver loads the keytab file to the respective executor containers. 然後,執行程式會使用主體名稱和金鑰表來產生權杖,以用來建立用於讀取/寫入的 JDBC 連接器。Then, the executors use the principal name and keytab to generate a token that is used to create a JDBC connector for read/write.

Active Directory 模式的連接器具現化範例如下。An example connector instantiation for Active Directory mode is below. 執行指令碼之前,請將 ? 取代為您帳戶的值。Before you run the script, replace the ? with the value for your account.

# Note: '?' is a placeholder for a necessary user-specified value

connector_type = "com.microsoft.sqlserver.jdbc.spark"

url = "jdbc:sqlserver://master-p-svc;databaseName=?;integratedSecurity=true;authenticationScheme=JavaKerberos;"

writer = df.write \

.format(connector_type)\

.mode("overwrite")

.option("url", url) \

.option("principal", ?) \

.option("keytab", ?)

writer.save()

後續步驟Next steps

有 SQL Server 巨量資料叢集的意見反應或功能建議嗎?Have feedback or feature recommendations for SQL Server Big Data Clusters?

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值