Scope Introduction

Introduction

Core Concepts

At a first level of approximation, a Scope script maps some input to some output vai a processing step:

INPUT -> PROCESS -> OUTPUT

Inputs and Outputs are Cosmos Streams. A stream is a file-like data stored in Cosmos. If you run a script locally as stream is just a normal file like any other on your machine.
A typical Scope script has at least one Input and at least one Output.
During the processing phrase an input stream is transformed into a rowset. Rowsets are how Scope internally passes data during script execution. The rowse may be transformed to other rowsets.

Script walkthrough

Below is an example of the concepturally smallest Scope script.

rs0 = EXTRACT
        FirstName : string,
        LastName : string,
        Age : int
      FROM 
        @"/my/Sample_data/test_input.tsv"
      USING DefaultTextExtractor();

OUTPUT rs0
    TO @"/my/output.tsv"
    USING DefaultTextOutputter();
Script Structure

There are two Scope statements. The first uses EXCTRACT command to build a RowSet called rs0 from an input stream and the second statement uses the OUTPUT command to write that RowSet to an output stream.

Reading Data
  • The EXTRACT command transforms a stream into a rowset.
  • The EXTRACT command uses DefaultTextExtractor to parse the stream. By default DefaultTxtExtractor assumes it is reading a TSV (tab-separated-value) file.
  • The DefaultTextExtractor command does not know the schema of the file it is reading, so we must always provide the schema.
Writing Data
  • The rows that are read from the EXTRACT statement flow into a rowset we named rs0.
  • The OUTPUT command sends the rows from rowset rs0 to an output stream.
  • The OUTPUT command uses DefaultTextOutputter to format the row into a TSV format

Scope Keywords are UpperCase
Calling DefaultTextExtractor without any arguments is equivalent to DefaultTextExtractor( delimiter: ‘t’ ) which explicitly identifies that the file is delimited by tabs.
Assume that all stream names are case-sensitive. Sure, when running locally it doesn’t matter but when running in the cluster it does. Likewise, stick to the “/” separator. It works both locally and for remote execution unlike “” which only works on local execution.

Streams

Ultimately reading data means reading from a stream. There are two kinds: Structured and Unstructured.

StructuredStreams

Structured streams are streams that are optimized for storage and access. These are special Cosmos format that Cosmos understands. The format contains the schema. Because Cosmos knows about the schema in Structured Streams there is a special syntax to read and write to them (EXTRACT and OUTPUT aren’t used)

UnstructuredStreams

These are any stream that isn’t structured. The key characteristic of these streams is that Cosmos does NOT inherently know about their schema or format. Thus, the schema must be defined by the user.

Views

A VIEW simplifies reading from streams. A great example of why you would use a VIEW is that it can take on the responsibility of reading multiple rowsets, doing all the joining, filtering, etc for you.

Extractor and Outputter

As mentioned already, unstructured streams require an extractor for reading and an outputter for writing. For TSV/CSV data Scope comes with a DefaultTextExtractor and a DefaultTextOutputter. Also users can create their own custom Extractors and Outputters.

Data types

Native data type

Scope supports most .NET data types. The full list is shown below.

  • Numeric: Byte, sbyte, int uint, long, ulong, float, double, decimal, short, ushort
  • Miscellaneous: bool, Guid, datetime, byte[]
  • Text: char, string
  • Complex: MAP<K, V >, ARRAY<V >
  • Nullable: bool?, Guid?, datetime? byte? sbyte? int?..
User-Defined data types

Scope also allows you to create your own user-defined types that you can use for columns in rowsets.

Scope expressions

Expressions in SELECT

Specific columns can be picked for the output rowset. In the example below two columns are retrieved and a third is calculated. Note that when an expression is used to calculate a value then you must assign that column a name via the AS keyword.

rs1 =
SELECT Start,
       Region,
       Duration + 1.0 AS Duration2
FROM searchlog;
CS blocks and user-defined functions

A calculation can be implemented in C# code then later used in an expression. The code can be stored in a separate DLL or as part of a #CS block as shown below:

rs1 = 
    SELECT Start,
           Region,
           MyHelper.SecondsToMinutes(Duration) AS DurationInMinutes
    FROM searchlog;

OUTPUT rs1
TO "/my/output.txt";

#CS
public static class MyHelper
{
    public static double SecondsToMinutes(int seconds)
    {
        double minutes = seconds/60.0;
        return minutes;
    }
}
#ENDCS
Filtering on calculated Columns in SELECT: WHERE versus HAVING

As in SQL, keep in mind that WHERE operates on input rows and HAVING on output rows.

LINQ and lambdas in expression

In the example below, the Where() LINQ extension method is used to filter for a particular set of URLs. Where(), like most LINQ methods, returns an IEnumerable<T> value. Because Scope does not support interfaces as column types, the ToList() LINQ extension method is used to convert it to a collection type that Scope does support.

rs1 =
    SELECT Urls.Split(';').Where(u=> u.StartsWith("http:")).ToList() AS HttpUrls
    FROM searchlog;

rs2 =
    SELECT string.Join( ";" , HttpUrls) AS HttpUrls
    FROM rs1;

NOTE: DefaultTextOutputter does not automatically serialize Collection types such as List. You’ll need to convert this back to a type that it supports such as a string.

Order of evaluation for expressions

There’s a pattern C# developers are used to, as shown below:

if ((QueryString!=null) && (QueryString.StartsWith("bing"))
{
// do something
}

This pattern depends on a C# behavior (common to many languages) called “short-circuiting”. Simply put, in the above example, when the code runs there’s no logical reason to check both conditions if the first one returns false. Short circuiting is useful because evaluating each condition may be expensive. Thus, it is a technique that compilers use to improve the performance of C# code.
When trying to do the same thing in Scope there are two paths you can pick. Both are valid expressions, but one will cause problems that may not be obvious at first.
The right choice: use && to keep the desired short-circuiting behavior

rs1 =
    SELECT *
    FROM data
    WHERE ((Name!=null) && (Name.StartsWith("bing"));

The wrong choice: use AND which does NOT match the short-circuiting behavior.

rs1 =
    SELECT *
    FROM data
    WHERE ((Name!=null) AND (Name.StartsWith("bing"));

The second translation that uses AND will sometimes fail saying that a NullReference has occurred. (sometimes it might work on your local box but might fail in the cluster)
The reason is simple and by-design: with AND/OR Scope will try to perform certain optimizations that result in better performance – for example it may evaluate the second part of the expression first because it assumes that there is no relationship between the two conditions.
This is a standard optimization technique and the same thing is done in many systems such as SQL. The gain this optimization provides in performance is well worth the occasional confusion it causes for new Scope users – so this behavior will never change.

Summary: if you need this short-circuiting behavior use && and ||.

As an alternative you can use the SQL-like ALL/ANY operators which are equivalent to &&/||.
You CANNOT circumvent the order of evaluation by using multiple statements
Of course, then you’ll be tempted to write your script by splitting apart the expression as shown below.

rs1 =
    SELECT *
    FROM data
    WHERE Name!=null;

rs2 =
    SELECT *
    FROM rs1 
    WHERE Name.StartsWith("bing");

The assumption here is:

First I’ll get the non-null objects and then I can avoid the null reference issue.

This won’t work either. Scope is declarative language not an imperative one. Just because rs1 is defined earlier than rs2 in the script above it does NOT imply that the WHERE condition in rs1 is evaluated before the WHERE in rs2. Scope reserves the right to combine multiple statements together and perform optimizations. You MUST use the && operator if you want to perform short-circuiting.

Logical Operators in Expressions
C#SQL
Logical AND with short-circuiting(a && b && c)ALL(a, b, c)
Logical OR with short-circuiting(a || b || c)ANY(a,cb, c)
Logical NOT!aNOT(a)
Sorting

You can sort rowsets by using the ORDER BY operator. Specifying the ASC and DESC keyword controls whether the sort is ascending or descending, respectively.

TOP N
rs1 = 
    SELECT TOP 5
        Region,
        Duration
    FROM searchlog 
    ORDER BY Duration DESC;
Grouping and aggregation

Grouping, in essence, collapses multiple rows into single rows based on some criteria. Hand-in-hand with performing a grouping operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done).

rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM searchlog
    GROUP BY Region;

OUTPUT rs1
TO "/my/Outputs/output.txt";

System-Defined aggregates

Scope contains several common aggregation functions:

  • string
  • AVG
  • COUNT
  • COUNTIF
  • ANY_VALUE
  • LAST
  • LIST
  • MAX
  • MIN
  • STDEV*
Getting a value with ANY_VALUE and FIRST

ANY_VALUE gets a value for that column with no implications about where inside that rowset the value came from. It could be the first value, the last value, or any value in between. It is useful in some scenarios where you don’t care which value you receive as long as you get one.

rs1 =
    SELECT
        ANY_VALUE(Start) AS FirstStart,
        Region
    FROM searchlog
    GROUP BY Region;

FIRST is badly named. It does NOT guarantee you will receive the first value in a rowset. Instead it behaves exactly like ANY_VALUE. Avoid using FIRST, instead use ANY_VALUE.

COUNTIF
ARGMAX

ARGMAX(a, b) = Find the row with the maximum value for column a, from that row return the value for b

Regular Expression

Regular expressions provide advanced text matching capabilities in a terse specification. Regular Expressions are supported in the WHERE and HAVING clauses

Finding simple patterns
// Find all the sessions where the query contained the word pizza (but not pizzeria, for example)

rs1 =
    SELECT
        Start,
        Region,
        Duration
    FROM searchlog
    WHERE REGEX(@"\bpizza.*\b").IsMatch(Query);

OUTPUT rs1
TO "/my/output.txt"
Extracting a REGEX match

Sometimes you’ll need to “pull” out a substring and promote that to a column.

For example, if there is a column called Name and its value can look like “--------Cosmos01------”, “foooCosmos11bar,” etc. and we want to pull out the “Cosmos” parts, then REGEX and Scope make this pretty easy to do.

rs1 =
    SELECT
        Name,
        REGEX(@"Cosmos[0-9]*").Match(Name).Value AS CosmosCluster
    FROM data;
Breaking rows apart with CROSS APPLY

Let’s examine the search log again.

rs1 =
    SELECT
        Region,
        Urls
    FROM searchlog;

The query returns something ike this:

regionUrls
en-usA;B;C
en-gbD;E;F

The Urls column contains strings, but each string is a semicolon-separated list of URLs. What happens if we want to break apart the Urls field so that only a URL is present on every row? For example, below is what we want to see:

regionUrls
en-usA
en-usB
en-usC
en-gbD
en-gbE
en-gbF

This is a perfect job for the CROSS APPLY operator.

rs1 =
    SELECT
        Region,
        Urls
    FROM searchlog;

rs2 =
    SELECT
        Region,
        SplitUrls AS Url
    FROM rs1
    CROSS APPLY Urls.Split(';') AS SplitUrls;

The transformation above is possible to perform programmatically with PROCESSORS – but CROSS APPLY is always preferred to custom processors.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值