Scope Introduction

最新推荐文章于 2022-09-22 15:50:08 发布

JAYL_

最新推荐文章于 2022-09-22 15:50:08 发布

阅读量273

点赞数

分类专栏： tools

本文链接：https://blog.csdn.net/u011598193/article/details/106480706

版权

tools 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Introduction

Core Concepts

At a first level of approximation, a Scope script maps some input to some output vai a processing step:

INPUT -> PROCESS -> OUTPUT

Inputs and Outputs are Cosmos Streams. A stream is a file-like data stored in Cosmos. If you run a script locally as stream is just a normal file like any other on your machine.
A typical Scope script has at least one Input and at least one Output.
During the processing phrase an input stream is transformed into a rowset. Rowsets are how Scope internally passes data during script execution. The rowse may be transformed to other rowsets.

Script walkthrough

Below is an example of the concepturally smallest Scope script.

rs0 = EXTRACT
        FirstName : string,
        LastName : string,
        Age : int
      FROM 
        @"/my/Sample_data/test_input.tsv"
      USING DefaultTextExtractor();

OUTPUT rs0
    TO @"/my/output.tsv"
    USING DefaultTextOutputter();

Script Structure

There are two Scope statements. The first uses EXCTRACT command to build a RowSet called rs0 from an input stream and the second statement uses the OUTPUT command to write that RowSet to an output stream.

Reading Data

The EXTRACT command transforms a stream into a rowset.
The EXTRACT command uses DefaultTextExtractor to parse the stream. By default DefaultTxtExtractor assumes it is reading a TSV (tab-separated-value) file.
The DefaultTextExtractor command does not know the schema of the file it is reading, so we must always provide the schema.

Writing Data

The rows that are read from the EXTRACT statement flow into a rowset we named rs0.
The OUTPUT command sends the rows from rowset rs0 to an output stream.
The OUTPUT command uses DefaultTextOutputter to format the row into a TSV format

Scope Keywords are UpperCase
Calling DefaultTextExtractor without any arguments is equivalent to DefaultTextExtractor( delimiter: ‘t’ ) which explicitly identifies that the file is delimited by tabs.
Assume that all stream names are case-sensitive. Sure, when running locally it doesn’t matter but when running in the cluster it does. Likewise, stick to the “/” separator. It works both locally and for remote execution unlike “” which only works on local execution.

Streams

Ultimately reading data means reading from a stream. There are two kinds: Structured and Unstructured.

StructuredStreams

Structured streams are streams that are optimized for storage and access. These are special Cosmos format that Cosmos understands. The format contains the schema. Because Cosmos knows about the schema in Structured Streams there is a special syntax to read and write to them (EXTRACT and OUTPUT aren’t used)

UnstructuredStreams

These are any stream that isn’t structured. The key characteristic of these streams is that Cosmos does NOT inherently know about their schema or format. Thus, the schema must be defined by the user.

Views

A VIEW simplifies reading from streams. A great example of why you would use a VIEW is that it can take on the responsibility of reading multiple rowsets, doing all the joining, filtering, etc for you.

Extractor and Outputter

As mentioned already, unstructured streams require an extractor for reading and an outputter for writing. For TSV/CSV data Scope comes with a DefaultTextExtractor and a DefaultTextOutputter. Also users can create their own custom Extractors and Outputters.

Data types

Native data type

Scope supports most .NET data types. The full list is shown below.

Numeric: Byte, sbyte, int uint, long, ulong, float, double, decimal, short, ushort
Miscellaneous: bool, Guid, datetime, byte[]
Text: char, string
Complex: MAP<K, V >, ARRAY<V >
Nullable: bool?, Guid?, datetime? byte? sbyte? int?..

User-Defined data types

Scope also allows you to create your own user-defined types that you can use for columns in rowsets.

Scope expressions

Expressions in SELECT

Specific columns can be picked for the output rowset. In the example below two columns are retrieved and a third is calculated. Note that when an expression is used to calculate a value then you must assign that column a name via the AS keyword.

rs1 =
SELECT Start,
       Region,
       Duration + 1.0 AS Duration2
FROM searchlog;

CS blocks and user-defined functions

A calculation can be implemented in C# code then later used in an expression. The code can be stored in a separate DLL or as part of a #CS block as shown below:

rs1 = 
    SELECT Start,
           Region,
           MyHelper.SecondsToMinutes(Duration) AS DurationInMinutes
    FROM searchlog;

OUTPUT rs1
TO "/my/output.txt";

#CS
public static class MyHelper
{
    public static double SecondsToMinutes(int seconds)
    {
        double minutes = seconds/60.0;
        return minutes;
    }
}
#ENDCS

Filtering on calculated Columns in SELECT: WHERE versus HAVING

As in SQL, keep in mind that WHERE operates on input rows and HAVING on output rows.

LINQ and lambdas in expression

In the example below, the Where() LINQ extension method is used to filter for a particular set of URLs. Where(), like most LINQ methods, returns an IEnumerable<T> value. Because Scope does not support interfaces as column types, the ToList() LINQ extension method is used to convert it to a collection type that Scope does support.

rs1 =
    SELECT Urls.Split(';').Where(u=> u.StartsWith("http:")).ToList() AS HttpUrls
    FROM searchlog;

rs2 =
    SELECT string.Join( ";" , HttpUrls) AS HttpUrls
    FROM rs1;

NOTE: DefaultTextOutputter does not automatically serialize Collection types such as List. You’ll need to convert this back to a type that it supports such as a string.

Order of evaluation for expressions

There’s a pattern C# developers are used to, as shown below:

if ((QueryString!=null) && (QueryString.StartsWith("bing"))
{
// do something
}

This pattern depends on a C# behavior (common to many languages) called “short-circuiting”. Simply put, in the above example, when the code runs there’s no logical reason to check both conditions if the first one returns false. Short circuiting is useful because evaluating each condition may be expensive. Thus, it is a technique that compilers use to improve the performance of C# code.
When trying to do the same thing in Scope there are two paths you can pick. Both are valid expressions, but one will cause problems that may not be obvious at first.
The right choice: use && to keep the desired short-circuiting behavior

rs1 =
    SELECT *
    FROM data
    WHERE ((Name!=null) && (Name.StartsWith("bing"));

The wrong choice: use AND which does NOT match the short-circuiting behavior.

rs1 =
    SELECT *
    FROM data
    WHERE ((Name!=null) AND (Name.StartsWith("bing"));

The second translation that uses AND will sometimes fail saying that a NullReference has occurred. (sometimes it might work on your local box but might fail in the cluster)
The reason is simple and by-design: with AND/OR Scope will try to perform certain optimizations that result in better performance – for example it may evaluate the second part of the expression first because it assumes that there is no relationship between the two conditions.
This is a standard optimization technique and the same thing is done in many systems such as SQL. The gain this optimization provides in performance is well worth the occasional confusion it causes for new Scope users – so this behavior will never change.

Summary: if you need this short-circuiting behavior use && and ||.

As an alternative you can use the SQL-like ALL/ANY operators which are equivalent to &&/||.
You CANNOT circumvent the order of evaluation by using multiple statements
Of course, then you’ll be tempted to write your script by splitting apart the expression as shown below.

rs1 =
    SELECT *
    FROM data
    WHERE Name!=null;

rs2 =
    SELECT *
    FROM rs1 
    WHERE Name.StartsWith("bing");

The assumption here is：

First I’ll get the non-null objects and then I can avoid the null reference issue.

This won’t work either. Scope is declarative language not an imperative one. Just because rs1 is defined earlier than rs2 in the script above it does NOT imply that the WHERE condition in rs1 is evaluated before the WHERE in rs2. Scope reserves the right to combine multiple statements together and perform optimizations. You MUST use the && operator if you want to perform short-circuiting.

Logical Operators in Expressions

	C#	SQL
Logical AND with short-circuiting	(a && b && c)	ALL(a, b, c)
Logical OR with short-circuiting	(a \|\| b \|\| c)	ANY(a,cb, c)
Logical NOT	!a	NOT(a)

Sorting

You can sort rowsets by using the ORDER BY operator. Specifying the ASC and DESC keyword controls whether the sort is ascending or descending, respectively.

TOP N

rs1 = 
    SELECT TOP 5
        Region,
        Duration
    FROM searchlog 
    ORDER BY Duration DESC;

Grouping and aggregation

Grouping, in essence, collapses multiple rows into single rows based on some criteria. Hand-in-hand with performing a grouping operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done).

rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM searchlog
    GROUP BY Region;

OUTPUT rs1
TO "/my/Outputs/output.txt";

System-Defined aggregates

Scope contains several common aggregation functions:

string
AVG
COUNT
COUNTIF
ANY_VALUE
LAST
LIST
MAX
MIN
STDEV*

Getting a value with ANY_VALUE and FIRST

ANY_VALUE gets a value for that column with no implications about where inside that rowset the value came from. It could be the first value, the last value, or any value in between. It is useful in some scenarios where you don’t care which value you receive as long as you get one.

rs1 =
    SELECT
        ANY_VALUE(Start) AS FirstStart,
        Region
    FROM searchlog
    GROUP BY Region;

FIRST is badly named. It does NOT guarantee you will receive the first value in a rowset. Instead it behaves exactly like ANY_VALUE. Avoid using FIRST, instead use ANY_VALUE.

COUNTIF

ARGMAX

ARGMAX(a, b) = Find the row with the maximum value for column a, from that row return the value for b

Regular Expression

Regular expressions provide advanced text matching capabilities in a terse specification. Regular Expressions are supported in the WHERE and HAVING clauses

Finding simple patterns

// Find all the sessions where the query contained the word pizza (but not pizzeria, for example)

rs1 =
    SELECT
        Start,
        Region,
        Duration
    FROM searchlog
    WHERE REGEX(@"\bpizza.*\b").IsMatch(Query);

OUTPUT rs1
TO "/my/output.txt"

Extracting a REGEX match

Sometimes you’ll need to “pull” out a substring and promote that to a column.

For example, if there is a column called Name and its value can look like “--------Cosmos01------”, “foooCosmos11bar,” etc. and we want to pull out the “Cosmos” parts, then REGEX and Scope make this pretty easy to do.

rs1 =
    SELECT
        Name,
        REGEX(@"Cosmos[0-9]*").Match(Name).Value AS CosmosCluster
    FROM data;

Breaking rows apart with CROSS APPLY

Let’s examine the search log again.

rs1 =
    SELECT
        Region,
        Urls
    FROM searchlog;

The query returns something ike this:

region	Urls
en-us	A;B;C
en-gb	D;E;F

The Urls column contains strings, but each string is a semicolon-separated list of URLs. What happens if we want to break apart the Urls field so that only a URL is present on every row? For example, below is what we want to see:

region	Urls
en-us	A
en-us	B
en-us	C
en-gb	D
en-gb	E
en-gb	F

This is a perfect job for the CROSS APPLY operator.

rs1 =
    SELECT
        Region,
        Urls
    FROM searchlog;

rs2 =
    SELECT
        Region,
        SplitUrls AS Url
    FROM rs1
    CROSS APPLY Urls.Split(';') AS SplitUrls;

The transformation above is possible to perform programmatically with PROCESSORS – but CROSS APPLY is always preferred to custom processors.

JAYL_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scope Introduction

文章目录IntroductionCore ConceptsScript walkthroughScript StructureReading DataWriting DataStreamsStructuredStreamsUnstructuredStreamsViewsExtractor and OutputterData typesNative data typeUser-Defined data typesScope expressionsExpressions in SELECTCS blocks a
复制链接

扫一扫