sql server作业_SQL Server作业性能–报告

本文介绍了如何收集SQL Server作业性能指标,包括利用存储过程收集数据、创建报告,以及如何分析作业运行时间和调度详情。通过这些指标,可以监控作业性能,预防问题,优化SQL Server资源管理和调度。
摘要由CSDN通过智能技术生成

sql server作业

描述 (Description)

Once collected, job performance metrics can be used for a variety of reporting needs, from locating jobs that are not performing well to finding optimal release windows, scheduling maintenance, or trending over time. These techniques allow us to maintain insight into parts of SQL Server that are often not monitored enough and prevent job-related emergencies before they become emergencies.

一旦收集了工作绩效指标,就可以将其用于各种报告需求,从查找表现不佳的工作到找到最佳发布窗口,安排维护或随时间推移而趋向。 这些技术使我们能够深入了解通常未受到足够监控SQL Server部分,并在与工作有关的紧急情况发生之前将其预防。

介绍 (Introduction)

Collecting job performance metrics provides us with the opportunity to then report on that data. In that realm, our imaginations are the only limiting factor.

收集工作绩效指标为我们提供了机会,然后报告该数据。 在这个领域,我们的想象力是唯一的限制因素。

We can create further fact tables to store aggregated metrics, perform gaps/islands analysis in order to find optimal job times or busy times, and ready data for consumption by reporting tools, such as SSRS or Power BI.

我们可以创建更多的事实表来存储汇总指标,执行差距/孤岛分析以找到最佳的工作时间或繁忙时间,并准备好数据供报告工具(例如SSRS或Power BI)使用。

Using these tools & metrics, we can look at past data, in order to observe trends and forecast future job runtimes, allowing us to solve a performance problem before it becomes serious. We can use this data to alert on rogue jobs, or those that are performing well out of their typical boundaries. Data can be compared between servers or environments to hunt for differences that may indicate other unrelated processes that are not running optimally, or compare different hardware configurations on job performance.

使用这些工具和指标,我们可以查看过去的数据,以便观察趋势并预测未来的工作时间,从而使我们能够在性能问题变得严重之前解决它。 我们可以使用这些数据来提醒流氓作业或性能超出其典型边界的作业。 可以在服务器或环境之间比较数据,以寻找差异,这些差异可能表明其他不相关的进程运行不佳,或者在工作性能上比较了不同的硬件配置。

The list of applications can continue for quite a while, and will vary depending on how you use SQL Server Agent, and the volume of jobs you create.

应用程序列表可能会持续一段时间,并且会因您使用SQL Server代理的方式以及创建的作业量而异。

SQL Server代理作业 (SQL Server Agent Job)

Previously, we have completed a script that can populate our metrics tables and clean them up as needed. We’ll encapsulate this TSQL into a stored procedure: dbo.usp_get_job_execution_metrics. This stored procedure, as well as all table & index creation scripts, can be downloaded at the end of this article.

以前,我们已经完成了一个脚本,该脚本可以填充指标表并根据需要进行清理。 我们将此TSQL封装到一个存储过程中: dbo.usp_get_job_execution_metrics 。 可以在本文结尾处下载此存储过程以及所有表和索引创建脚本。

To run this regularly, I’ll place the stored procedure execution into its own SQL Server Agent job, with a single step. Deletion of old data can be moved into an independent step if desired, but to keep things simple, I’ve opted to keep it in the main stored procedure. The job looks like this:

为了定期运行此程序,我将通过一个步骤将存储过程执行放入其自己SQL Server代理作业中。 如果需要,可以将删除旧数据移到一个独立的步骤中,但是为了简化起见,我选择将其保留在主存储过程中。 工作看起来像这样:

The advanced tab indicates that the job will complete and report success after the one (and only) step completes:

“高级”选项卡指示一个(唯一)步骤完成后,作业将完成并报告成功:

The schedule for this job is set to run it every 15 minutes. Feel free to adjust as needed based on the frequency of job executions and metrics needs on your system:

该作业的时间表设置为每15分钟运行一次。 可以根据系统上作业的执行频率和指标需求随时进行调整:

At this point, we have job collection tables, a collection stored procedure, and a job that can run regularly to collect and update our data. The last step is to consider how we will report on this data, and build the appropriate solution to present this data to us in a meaningful fashion!

至此,我们有了作业收集表,收集存储过程以及可以定期运行以收集和更新数据的作业。 最后一步是考虑如何对这些数据进行报告,并构建适当的解决方案以有意义的方式向我们展示这些数据!

报告工作绩效指标 (Reporting on Job Performance Metrics)

We’re now collecting job metrics, which can be very useful for monitoring and validating job history, but we can do much more with this data. To illustrate this, we will walk through a variety of metrics, showing how they can be calculated and returned to a user, report, or dashboard. The final version of this is included in a stored procedure that is attached to this article.

现在,我们正在收集作业指标,这对于监视和验证作业历史记录非常有用,但是我们可以用这些数据做更多的事情。 为了说明这一点,我们将遍历各种指标,展示如何计算它们并将其返回给用户,报告或仪表板。 本文所附的存储过程中包含此程序的最终版本。

Now that we are ready to go, let’s consider a handful of metrics to report on:

现在我们准备好开始了,让我们考虑一些报告指标:

    • Allows us to trend job performance over time in order to find patterns that require attention.

      使我们能够随时间变化工作绩效,以便找到需要关注的模式。
    • Useful for scheduling new jobs based on existing schedules.

      用于根据现有计划安排新作业。
    • Helps in planning downtime and understand when quiet times are.

      帮助计划停机时间并了解安静时间。
    • Catch problems before they become critical.

      在问题变得严重之前就抓住它们。

作业平均运行时间 (Job Runtime Averages)

To facilitate the collection of this data, we can create a new table to store these aggregated metrics:

为了便于收集这些数据,我们可以创建一个新表来存储这些汇总指标:

 
CREATE TABLE dbo.fact_daily_job_runtime_metrics
	(	fact_daily_job_runtime_metrics_ID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_fact_daily_job_runtime_metrics PRIMARY KEY CLUSTERED,
		Sql_Agent_Job_Id UNIQUEIDENTIFIER NOT NULL,
		Job_Run_Date DATE NOT NULL,
		Job_Run_Count INT NOT NULL,
		Job_Run_Failure_Count INT NOT NULL,
		Job_Run_Time_Minimum INT NULL,
		Job_Run_Time_Maximum INT NULL,
		Job_Run_Time_Average INT NULL
	);
 
CREATE NONCLUSTERED INDEX IX_fact_daily_job_runtime_metrics_Sql_Agent_Job_Id ON dbo.fact_daily_job_runtime_metrics (Sql_Agent_Job_Id);
CREATE NONCLUSTERED INDEX IX_fact_daily_job_runtime_metrics_Job_Run_Date ON dbo.fact_daily_job_runtime_metrics (Job_Run_Date);
 

With this table created, we can add some TSQL to our stored procedure to populate it:

创建此表后,我们可以向存储过程中添加一些TSQL来填充它:

 
	DECLARE @Last_Complete_Averages_Collection_Date DATE;
	SELECT
		@Last_Complete_Averages_Collection_Date = MAX(fact_daily_job_runtime_metrics.Job_Run_Date)
	FROM dbo.fact_daily_job_runtime_metrics;
 
	SELECT @Last_Complete_Averages_Collection_Date = ISNULL(@Last_Complete_Averages_Collection_Date, '1/1/1900');
 
	DELETE fact_daily_job_runtime_metrics
	FROM dbo.fact_daily_job_runtime_metrics
	WHERE fact_daily_job_runtime_metrics.Job_Run_Date >= @Last_Complete_Averages_Collection_Date;
 
	INSERT INTO dbo.fact_daily_job_runtime_metrics
		(Sql_Agent_Job_Id, Job_Run_Date, Job_Run_Count, Job_Run_Failure_Count, Job_Run_Time_Minimum, Job_Run_Time_Maximum, Job_Run_Time_Average)
	SELECT
		fact_job_run_time.Sql_Agent_Job_Id,
		CAST(fact_job_run_time.Job_Start_Datetime AS DATE) AS Job_Run_Date,
		COUNT(*) AS Job_Run_Count,
		SUM(CASE WHEN fact_job_run_time.Job_Status = 'Failure' THEN 1 ELSE 0 END) AS Job_Run_Failure_Count,
		MIN(fact_job_run_time.Job_Duration_Seconds) AS Job_Run_Time_Minimum,
		MAX(fact_job_run_time.Job_Duration_Seconds) AS Job_Run_Time_Maximum,
		AVG(fact_job_run_time.Job_Duration_Seconds) AS Job_Run_Time_Average
	FROM dbo.fact_job_run_time
	WHERE fact_job_run_time.Job_Start_Datetime >= @Last_Complete_Averages_Collection_Date
	GROUP BY fact_job_run_time.Sql_Agent_Job_Id, CAST(fact_job_run_time.Job_Start_Datetime AS DATE);  	
 

Note that we remove an extra day of data prior to population that is from the last populated date. This is done as a safeguard against incomplete data, as we will need to recalculate averages on any data that is still being updated at the time of the job run. For reporting convenience, the job_id can be replaced with the job name, if desired.

请注意,从上次填充日期开始,我们将在填充之前删除额外的一天数据。 这样做是为了防止数据不完整,因为我们将需要重新计算在作业运行时仍在更新的任何数据的平均值。 为了方便报告,如果需要,可以用作业名称替换job_id

The result of this script on my local server is a pile of job data that tells me about job executions per day per job:

此脚本在本地服务器上的结果是一堆作业数据,这些数据告诉我每个作业每天的作业执行情况:

This data is aggregated by date, but could easily be updated to compute averages over a given hour, week, month, or other time period that is convenient. Similarly, we could create multiple fact tables to track metrics over multiple periods. Once data is aggregated, it may no longer be of use to you, in which case cleanup of older data can be performed more aggressively to improve performance and reduce disk usage.

此数据按日期汇总,但可以轻松更新以计算给定的小时,周,月或其他方便的时间段内的平均值。 同样,我们可以创建多个事实表来跟踪多个时期的指标。 数据汇总后,可能不再对您有用,在这种情况下,可以更积极地执行较旧数据的清理,以提高性能并减少磁盘使用率。

作业时间表详细信息 (Job Schedule Details)

Returning a short list of all job/schedule relationships from our existing tables is almost trivial, now that the data is formatted in a friendly fashion:

现在,以一种友好的方式格式化数据,从我们现有的表中返回所有作业/计划关系的简短列表几乎是微不足道的:

 
	SELECT
		dim_sql_agent_job.Sql_Agent_Job_Name,
		dim_sql_agent_schedule.Schedule_Name,
		dim_sql_agent_schedule.Schedule_Occurrence,
		dim_sql_agent_schedule.Schedule_Occurrence_Detail,
		dim_sql_agent_schedule.Schedule_Frequency,
		fact_sql_agent_schedule_assignment.Next_Run_Datetime
	FROM dbo.fact_sql_agent_schedule_assignment
	INNER JOIN dbo.dim_sql_agent_job
	ON fact_sql_agent_schedule_assignment.Sql_Agent_Job_Id = dim_sql_agent_job.Sql_Agent_Job_Id
	INNER JOIN dbo.dim_sql_agent_schedule
	ON fact_sql_agent_schedule_assignment.Schedule_Id = dim_sql_agent_schedule.Schedule_Id
	WHERE dim_sql_agent_job.Is_Enabled = 1
	AND dim_sql_agent_job.Is_Deleted = 0
	AND dim_sql_agent_schedule.Is_Enabled = 1
	AND dim_sql_agent_schedule.Is_Deleted = 0
	AND CURRENT_TIMESTAMP BETWEEN dim_sql_agent_schedule.Schedule_Start_Date AND dim_sql_agent_schedule.Schedule_End_Date
	ORDER BY fact_sql_agent_schedule_assignment.Next_Run_Datetime ASC;
 

The result is a list of all job/schedule pairings, along with the next run time for the job:

结果是所有作业/时间表配对的列表,以及该作业的下一个运行时间:

We could also use a gaps/islands analysis on the job runtime data in order to determine the longest stretches of time when no jobs are running. Create a Dim_Time table, first, to store a joining table of minutes throughout the day:

我们还可以对作业运行时数据使用间隙/孤岛分析,以便确定无作业运行时的最长时间段。 首先创建一个Dim_Time表,以存储全天的分钟联接表:

 
CREATE TABLE dbo.Dim_Time
(	Dim_Time TIME CONSTRAINT PK_Dim_Time PRIMARY KEY CLUSTERED);
 
DECLARE @Current_Time TIME = '00:00';
INSERT INTO dbo.Dim_Time
	(Dim_Time)
SELECT
	@Current_Time;
 
SELECT @Current_Time = DATEADD(MINUTE, 1, @Current_Time);
 
WHILE @Current_Time <= '23:59' AND @Current_Time > '00:00'
BEGIN
	INSERT INTO dbo.Dim_Time
		(Dim_Time)
	SELECT
		@Current_Time;
 
	SELECT @Current_Time = DATEADD(MINUTE, 1, @Current_Time);
END
 

This data set could be changed to break down times on seconds, hours, or other time parts, if desired, including the addition of many days/dates/years. With this data, we can join duration data for a given day to it and get a picture of job activity over the course of a day:

如果需要,可以更改此数据集以将时间分解为秒,小时或其他时间部分,包括添加许多天/日期/年。 有了这些数据,我们可以将给定日期的工时数据加入其中,并获得一天中的工作活动情况:

 
DECLARE @Date_to_Check DATE = '1/10/2017';
DECLARE @Number_Of_Concurrent_Jobs INT = 0;
 
SELECT
	Dim_Time
FROM dbo.Dim_Time
WHERE NOT EXISTS (SELECT * FROM dbo.fact_job_run_time 
							WHERE Dim_Time.Dim_Time BETWEEN CAST(fact_job_run_time.Job_Start_Datetime AS TIME) AND CAST(fact_job_run_time.Job_End_Datetime AS TIME)
							AND CAST(fact_job_run_time.Job_Start_Datetime AS DATE) = @Date_to_Check)
ORDER BY Dim_Time.Dim_Time;
 

We can take this logic a bit further and analyze our set of times and report back on how many jobs run at any one time and filter accordingly. This would allow for a bit more intelligent scheduling where we could report on periods with nothing running, 1 job running, 2 jobs running, etc…Presumably, the more processes we tolerate running, the more windows of availability there will be for the scheduling of new jobs. To get a data set that shows each time (by second) and the number of concurrent jobs, we can run the following TSQL:

我们可以更进一步地研究这种逻辑,并分析一组时间,并报告一次同时运行了多少作业并进行相应过滤。 这将使调度更加智能化,在此情况下,我们可以报告没有运行,有1个作业正在运行,2个作业正在运行等的时间段……大概,我们允许运行的进程越多,调度的可用性窗口就越多新工作。 要获取显示每次(按秒)和并发作业数的数据集,我们可以运行以下TSQL:

 
DECLARE @Date_to_Check DATE = '1/10/2017';
WITH CTE_JOB_DATA AS (
	SELECT
		CAST(fact_job_run_time.Job_Start_Datetime AS TIME) AS Job_Start_time,
		CAST(fact_job_run_time.Job_End_Datetime AS TIME) AS Job_End_time
	FROM dbo.fact_job_run_time
	WHERE CAST(fact_job_run_time.Job_Start_Datetime AS DATE) = @Date_to_Check),
CTE_DIM_TIME AS (
	SELECT
		Dim_Time,
		COUNT(*) AS Number_Of_Jobs_Running
	FROM dbo.Dim_Time
	INNER JOIN CTE_JOB_DATA
	ON Dim_Time.Dim_Time BETWEEN CTE_JOB_DATA.Job_Start_time AND CTE_JOB_DATA.Job_End_time
	GROUP BY Dim_Time)
SELECT
	*
INTO #Job_Run_Data
FROM CTE_DIM_TIME;
 
SELECT * FROM #Job_Run_Data; 
 

This script will take all times in the dim_time table and compare them to our job performance data, returning a count for each minute of the jobs running at that time, only for those times in which jobs were running:

该脚本将把所有时间都放在dim_time表中,并将它们与我们的工作绩效数据进行比较,仅针对当时正在运行的那些时间,返回当时运行的每一分钟的计数:

From here, we could constrain the results to allow for one job running (or 2, or 3) and perform an islands analysis on it in order to determine the optimal time to run a job. For this example, we’ll allow for a single running job. To facilitate a more efficient query, the results from above will be used, pulling from the temp table, rather than creating one monster query with both aggregation and islands analysis contained within:

从这里开始,我们可以将结果约束为允许一个作业运行(或2或3),并对其进行孤岛分析,以确定运行作业的最佳时间。 对于此示例,我们将允许一个正在运行的作业。 为了促进更有效的查询,将使用上面的结果,从temp表中提取结果,而不是创建一个包含聚合和孤岛分析的怪物查询:

 
WITH CTE_JOB_RUN_DATA AS (
	SELECT
		Dim_Time.Dim_Time,
		ISNULL(Job_Run_Data.Number_Of_Jobs_Running, 0) AS Number_Of_Jobs_Running
	FROM Dim_Time
	LEFT JOIN #Job_Run_Data Job_Run_Data
	ON Dim_Time.Dim_Time = Job_Run_Data.Dim_Time
),
ISLAND_START AS
(
	SELECT
		Job_Run_Data.Dim_Time,
		ROW_NUMBER() OVER(ORDER BY Job_Run_Data.Dim_Time ASC) AS Row_Num
	FROM CTE_JOB_RUN_DATA Job_Run_Data
	WHERE Job_Run_Data.Number_Of_Jobs_Running <= 1
	AND EXISTS (
		SELECT *
		FROM CTE_JOB_RUN_DATA Previous_Run
		WHERE Previous_Run.Dim_Time = DATEADD(MINUTE, -1, Job_Run_Data.Dim_Time)
		AND Previous_Run.Number_Of_Jobs_Running > 1)
),
ISLAND_END AS
(
	SELECT
		Job_Run_Data.Dim_Time,
		ROW_NUMBER() OVER(ORDER BY Job_Run_Data.Dim_Time ASC) AS Row_Num
	FROM CTE_JOB_RUN_DATA Job_Run_Data
	WHERE Job_Run_Data.Number_Of_Jobs_Running <= 1
	AND EXISTS (
		SELECT *
		FROM CTE_JOB_RUN_DATA Next_Run
		WHERE Next_Run.Dim_Time = DATEADD(MINUTE, 1, Job_Run_Data.Dim_Time)
		AND Next_Run.Number_Of_Jobs_Running > 1)
)
SELECT
	ISLAND_START.Dim_Time AS Job_Time_Island_Start,
	ISLAND_END.Dim_Time AS Job_Time_Island_End,
	(SELECT COUNT(*) FROM Dim_Time Island_Time_Minutes WHERE Island_Time_ Minutes.Dim_Time BETWEEN ISLAND_START.Dim_Time AND ISLAND_END.Dim_Time) AS Island_Time_Minutes
FROM ISLAND_START
INNER JOIN ISLAND_END
ON ISLAND_START.Row_Num = ISLAND_END.Row_Num
ORDER BY ISLAND_START.Dim_Time;
 

The result of this query is a set of acceptable times to potentially schedule a new job, the start and end times for each window, and the length of the window (In minutes). If we didn’t have a reliable dim_time table, or a uniform increment of minutes as we do here, an additional ROW_NUMBER could be added to CTE_JOB_RUN_DATA to normalize messy data and allow for easy analysis across it. The results look like this:

该查询的结果是一组可以安排新作业的可接受时间,每个窗口的开始和结束时间以及窗口的长度(以分钟为单位)。 如果我们没有可靠的dim_time表,或者没有统一的分钟增量,则可以在CTE_JOB_RUN_DATA中添加额外的ROW_NUMBER以规范杂乱数据并对其进行轻松分析。 结果如下:

This isn’t the most useful view as we are getting lots of duplication. What we ideally want to see are the longest time windows first. To get this ordering, we can implement another CTE or put the above results into a second temp table, from where we can freely query the small result set. For this example, I’ve put the above data into a temp table called #Job_Runtime_Windows, and then run the query below:

这不是最有用的视图,因为我们正在重复很多。 我们理想地希望看到的是最长的时间窗。 为了获得这种排序,我们可以实现另一个CTE或将以上结果放入第二个临时表中,从中我们可以自由查询小的结果集。 对于此示例,我将上面的数据放入了一个名为#Job_Runtime_Windows的临时表中,然后运行以下查询:

 
SELECT
	*
FROM #Job_Runtime_Windows
ORDER BY Island_Time_Minutes DESC;
 

The results show a specific time frame that appears ideal for the addition of a new job:

结果显示了一个特定的时间范围,该时间范围似乎是添加新工作的理想选择:

长期和短期的工作 (Long and Short Running Jobs)

Another area of concern are jobs that run for an abnormal amount of time. To accurately alert or report on these, we need to have a fairly good idea of what is normal or not normal, both in terms of absolute and relative metrics comparisons.

另一个令人关注的领域是运行时间异常的作业。 要准确地警告或报告这些情况,就绝对指标和相对指标的比较而言,我们需要对什么是正常还是不正常有一个很好的了解。

For example, we could create a rule that states, “Any job that runs for 50% longer than its average time should be flagged as long-running”. If a job typically takes 500ms seconds to execute, and one day suddenly takes 1s, we likely won’t want an alert firing, as the difference is still very small. In other words, we would want to consider a threshold for runtime increases to ensure we don’t get false-alarms, such as only considering jobs that take 5 minutes or more to execute.

例如,我们可以创建一条规则,规定:“任何运行时间比平均时间长50%的工作都应标记为长时间运行”。 如果一项工作通常需要500毫秒的时间来执行,而一天突然要花1秒钟,那么我们可能不希望触发警报,因为差异仍然很小。 换句话说,我们希望考虑运行时间增加的阈值,以确保不会收到错误警报,例如仅考虑需要花费5分钟或更长时间才能执行的作业。

Earlier, we wrote a script that would populate fact_daily_job_runtime_metrics, which provides us with a table of daily run stats, which can be used as a baseline to compare against. Since that table includes average values and counts, we can compute averages over any time span (weeks, months, etc…). We then can look at all job runs for today, and report on any that are taking too long to run:

之前,我们编写了一个脚本,该脚本将填充fact_daily_job_runtime_metrics ,该脚本为我们提供了每日运行统计信息表,可用作比较的基准。 由于该表包括平均值和计数,因此我们可以计算任何时间范围(周,月等)的平均值。 然后,我们可以查看今天的所有作业运行情况,并报告所有花费太长时间才能运行的作业:

 
WITH CTE_AVERAGE_JOB_RUNTIME AS (
	SELECT
		dim_sql_agent_job.Sql_Agent_Job_Name,
		dim_sql_agent_job.Sql_Agent_Job_Id,
		SUM(fact_daily_job_runtime_metrics.Job_Run_Count) AS Job_Run_Count,
		SUM(fact_daily_job_runtime_metrics.Job_Run_Time_Average * fact_daily_job_runtime_metrics.Job_Run_Count) / SUM(fact_daily_job_runtime_metrics.Job_Run_Count) AS Job_Run_Time_Average
	FROM dbo.fact_daily_job_runtime_metrics
	INNER JOIN dbo.dim_sql_agent_job
	ON fact_daily_job_runtime_metrics.Sql_Agent_Job_Id = dim_sql_agent_job.Sql_Agent_Job_Id
	GROUP BY dim_sql_agent_job.Sql_Agent_Job_Id, dim_sql_agent_job.Sql_Agent_Job_Name)
SELECT
	CTE_AVERAGE_JOB_RUNTIME.Sql_Agent_Job_Name,
	fact_job_run_time.Job_Start_Datetime,
	fact_job_run_time.Job_End_Datetime,
	fact_job_run_time.Job_Duration_Seconds,
	fact_job_run_time.Job_Status,
	CTE_AVERAGE_JOB_RUNTIME.Job_Run_Count,
CTE_AVERAGE_JOB_RUNTIME.Job_Run_Time_Average
FROM dbo.fact_job_run_time
INNER JOIN CTE_AVERAGE_JOB_RUNTIME
ON CTE_AVERAGE_JOB_RUNTIME.Sql_Agent_Job_Id = fact_job_run_time.Sql_Agent_Job_Id
WHERE fact_job_run_time.Job_Start_Datetime >= CAST(CURRENT_TIMESTAMP AS DATE)
AND CTE_AVERAGE_JOB_RUNTIME.Job_Run_Time_Average > 15
AND fact_job_run_time.Job_Duration_Seconds > CTE_AVERAGE_JOB_RUNTIME.Job_Run_Time_Average * 2; 
 

This script computes an average all-time for our duration data. If desired, we could constrain it to the past week, month, quarter, or whatever other time frame seems appropriate for “recent” data. We then compare today’s job runtimes to the average and if any individual job took more than double the average, a row is returned. We intentionally ignore any jobs that average 15 seconds or less, since they would likely cause unnecessary noise. This filter can also be adjusted to be more or less aggressive, to omit specific types of jobs, or otherwise clean up results such that no false alerts are generated. The results on my local server look like this:

该脚本为我们的工期数据计算平均时间。 如果需要,我们可以将其限制为过去的一周,一个月,一个季度或其他适合“最新”数据的时间范围。 然后,我们将今天的作业运行时间与平均值进行比较,如果任何单个作业花费的平均值超过平均值的两倍,则会返回一行。 我们故意忽略平均15秒或更短的任何作业,因为它们可能会引起不必要的噪音。 还可以将该过滤器调整为更具攻击性,以省略特定类型的作业,或者以其他方式清除结果,从而不会生成错误警报。 我的本地服务器上的结果如下所示:

The results show each instance of a job that took longer than double the average run time (27 seconds in this case) and some pertinent details for it. This reporting can be sent out whenever needed, and could even be alerted on, if such a need arose. The metrics that determine a long running job are completely customizable. We can similarly filter for short running jobs—those that execute extremely quickly, and therefore may not be performing the usual amount of work. Adjusting this is as simple as changing the criteria of the WHERE clause above to:

结果显示,每个作业实例花费的时间超过平均运行时间的两倍(在这种情况下为27秒),并且还包含一些相关细节。 可以在需要时发送此报告,如果需要,甚至可以发出警报。 可以确定长时间运行的指标是完全可定制的。 我们可以类似地筛选短期运行的作业,这些作业执行得非常快,因此可能无法执行通常的工作量。 调整此方法就像将上面的WHERE子句的标准更改为:

  • Run for a different time frame, multiple days, a fraction of a day, etc…

    运行不同的时间范围,数天,一小部分时间等...
  • Set a minimum or maximum threshold to check.

    设置一个最小或最大阈值进行检查。
  • Eliminate edge cases, such as jobs that are supposed to be very quick.

    消除边缘情况,例如本应非常快的工作。
  • Create more liberal boundaries for jobs that are known to be erratic. Statistics such as standard deviation can be useful in better gauging how inconsistent results are, in order to avoid hard-coding job details.

    为已知不稳定的工作创建更多自由边界。 诸如标准偏差之类的统计信息可用于更好地衡量结果的不一致程度,从而避免对工作细节进行硬编码。

性能 (Performance)

With queries that are full of aggregation, common table expressions, table scans, and tiered queries, it is only natural to inquire about performance.

对于充满聚合,通用表表达式,表扫描和分层查询的查询,自然而然地询问性能。

In general, the processes that write this data are relatively speedy and will do what they need to quickly and without introducing any latency or resource drain on the system they are run. The reporting queries generally rely on table scans and have the potential to get slow. While not problematic, this is the primary reason that we separate our reporting data into new, customized tables and generate further reporting tables as we determine a need for more metrics. For example, we create dbo.fact_daily_job_runtime_metrics and store daily averages in this table, rather than run our aggregations directly against our more granular data, or against the MSDB system views.

通常,写入此数据的进程相对较快,将快速执行所需的操作,而不会在运行的系统上引入任何延迟或资源消耗。 报告查询通常依赖于表扫描,并且可能变慢。 尽管没有问题,但这是我们将报告数据分为新的自定义表并在确定需要更多指标时生成进一步报告表的主要原因。 例如,我们创建dbo.fact_daily_job_runtime_metrics并将每日平均值存储在该表中,而不是直接针对更细粒度的数据或针对MSDB系统视图运行聚合。

This provides us with more control, and the ability to design and structure the metrics tables to meet our custom needs. Include only the columns we need, with supporting indexes, and reports that are helpful. Any extraneous data in MSDB can be left out, and we only need to maintain as much data as we wish. Oftentimes, the granular data we store in fact_job_run_time and fact_step_job_run_time can be aggregated into more compact tables, such as the daily runtime metrics table referenced above. Once this data is crunched for the day, we need only keep it for a short while and then delete it. For some use-cases, a week or two may be all that is necessary to keep. If all we care about are metrics and will never review the detail data, then a single day of retention may be sufficient.

这为我们提供了更多的控制权,并提供了设计和构建指标表以满足我们的自定义需求的能力。 仅包括我们需要的列,支持索引和有用的报告。 MSDB中的任何无关数据都可以被忽略,我们只需要维护所需的数据就可以了。 通常,我们存储在fact_job_run_timefact_step_job_run_time中的粒度数据可以聚合到更紧凑的表中,例如上面引用的每日运行时指标表。 当天处理完这些数据后,我们只需要保留一小会儿,然后将其删除即可。 对于某些用例,可能需要保留一两个星期。 如果我们只关心指标,并且永远不会查看详细数据,那么保留一天就足够了。

By controlling the data size and maintaining only the most useful metrics and relevant data, we can ensure that our reports run quickly. Even the Job duration/islands analysis, comprised of 3 cascading CTEs, can be fast, so long as the underlying data is kept simple and streamlined. Consider moving data to temporary tables when crunching more complex metrics, instead repeatedly accessing a large fact table.

通过控制数据大小并仅保留最有用的指标和相关数据,我们可以确保报告能够快速运行。 即使由3个级联CTE组成的Job Duration / island分析也可以快速进行,只要基本数据保持简单且精简即可。 在处理更复杂的指标时,考虑将数据移至临时表,而不是重复访问大型事实表。

In no examples here was performance a significant concern, but knowing how to deal with large reporting tables effectively can help in keeping things moving along efficiently. We do not want to suffer the irony of a reporting job that monitors job performance and becomes the resource hog on our server 🙂

在任何情况下,性能都不是一个重要的问题,但是了解如何有效处理大型报告表可以帮助保持工作进展顺利。 我们不想遭受报告工作的讽刺,该报告工作监视工作绩效并成为我们服务器上的资源消耗🙂

客制化 (Customization)

We can easily customize what metrics we collect, as shows previously, but our ability to tailor reports to our own SQL Server environments is even more significant. Data presented here is the tip of the iceberg. With the underlying data present, we could delve into many other areas, such as failed job details, runtime of job steps, automatic or semi-automatic job scheduling, and much more! The techniques to accomplish tasks such as these will be the same as presented here.

正如前面所示,我们可以轻松地自定义收集哪些指标,但是我们根据自己SQL Server环境定制报表的能力就更加重要。 这里提供的数据只是冰山一角。 利用存在的基础数据,我们可以深入研究其他许多领域,例如失败的作业详细信息,作业步骤的运行时间,自动或半自动作业调度等! 完成此类任务的技术将与此处介绍的相同。

Be creative and always start with questions prior to building a reporting structure. Decide exactly what you are looking for and build the collection routines and reporting infrastructure to answer those questions. If anything I’ve presented is unnecessary, feel free to remove it.

要有创造力,在建立报告结构之前总是要先提出问题。 准确确定您要寻找的内容,并构建收集例程和报告基础结构以回答这些问题。 如果我提供的任何内容都是不必要的,请随时将其删除。

结论 (Conclusion)

The techniques above demonstrate some simple ways in which we can collect useful job performance metrics, such as calculating averages over the course of a day. They also show how we can apply more advanced TSQL towards scheduling insight, using an islands analysis over job runtime data in order to determine when the most or fewest jobs are running.

上面的技术演示了一些简单的方法,可以用来收集有用的工作绩效指标,例如计算一天中的平均值。 它们还展示了我们如何通过对作业运行时数据进行孤岛分析来确定何时运行最多或最少的作业,从而可以将更高级的TSQL应用于调度洞察。

If you come up with any slick ways to use or report on this data, feel free to contact me and let me know! I love seeing the creative ways in which seemingly simple problems can be turned into elegant or brilliant solutions!

如果您想出任何巧妙的方法来使用或报告此数据,请随时与我联系并让我知道! 我喜欢看到创造性的方法,将看似简单的问题转化为优雅的解决方案!

翻译自: https://www.sqlshack.com/reporting-job-performance-sql-server/

sql server作业

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值