Introduction
Core Concepts
At a first level of approximation, a Scope script maps some input to some output vai a processing step:
INPUT -> PROCESS -> OUTPUT
Inputs and Outputs are Cosmos Streams. A stream is a file-like data stored in Cosmos. If you run a script locally as stream is just a normal file like any other on your machine.
A typical Scope script has at least one Input and at least one Output.
During the processing phrase an input stream is transformed into a rowset. Rowsets are how Scope internally passes data during script execution. The rowse may be transformed to other rowsets.
Script walkthrough
Below is an example of the concepturally smallest Scope script.
rs0 = EXTRACT
FirstName : string,
LastName : string,
Age : int
FROM
@"/my/Sample_data/test_input.tsv"
USING DefaultTextExtractor();
OUTPUT rs0
TO @"/my/output.tsv"
USING DefaultTextOutputter();
Script Structure
There are two Scope statements. The first uses EXCTRACT command to build a RowSet called rs0 from an input stream and the second statement uses the OUTPUT command to write that RowSet to an output stream.
Reading Data
- The EXTRACT command transforms a stream into a rowset.
- The EXTRACT command uses DefaultTextExtractor to parse the stream. By default DefaultTxtExtractor assumes it is reading a TSV (tab-separated-value) file.
- The DefaultTextExtractor command does not know the schema of the file it is reading, so we must always provide the schema.
Writing Data
- The rows that are read from the EXTRACT statement flow into a rowset we named rs0.
- The OUTPUT command sends the rows from rowset rs0 to an output stream.
- The OUTPUT command uses DefaultTextOutputter to format the row into a TSV format
Scope Keywords are UpperCase
Calling DefaultTextExtractor without any arguments is equivalent to DefaultTextExtractor( delimiter: ‘t’ ) which explicitly identifies that the file is delimited by tabs.
Assume that all stream names are case-sensitive. Sure, when running locally it doesn’t matter but when running in the cluster it does. Likewise, stick to the “/” separator. It works both locally and for remote execution unlike “” which only works on local execution.
Streams
Ultimately reading data means reading from a stream. There are two kinds: Structured and Unstructured.
StructuredStreams
Structured streams are streams that are optimized for storage and access. These are special Cosmos format that Cosmos understands. The format contains the schema. Because Cosmos knows about the schema in Structured Streams there is a special syntax to read and write to them (EXTRACT and OUTPUT aren’t used)
UnstructuredStreams
These are any stream that isn’t structured. The key characteristic of these streams is that Cosmos does NOT inherently know about their schema or format. Thus, the schema must be defined by the user.
Views
A VIEW simplifies reading from streams. A great example of why you would use a VIEW is that it can take on the responsibility of reading multiple rowsets, doing all the joining, filtering, etc for you.
Extractor and Outputter
As mentioned already, unstructured streams require an extractor for reading and an outputter for writing. For TSV/CSV data Scope comes with a DefaultTextExtractor and a DefaultTextOutputter. Also users can create their own custom Extractors and Outputters.
Data types
Native data type
Scope supports most .NET data types. The full list is shown below.
- Numeric: Byte, sbyte, int uint, long, ulong, float, double, decimal, short, ushort
- Miscellaneous: bool, Guid, datetime, byte[]
- Text: char, string
- Complex: MAP<K, V >, ARRAY<V >
- Nullable: bool?, Guid?, datetime? byte? sbyte? int?..
User-Defined data types
Scope also allows you to create your own user-defined types that you can use for columns in rowsets.
Scope expressions
Expressions in SELECT
Specific columns can be picked for the output rowset. In the example below two columns are retrieved and a third is calculated. Note that when an expression is used to calculate a value then you must assign that column a name via the AS keyword.
rs1 =
SELECT Start,
Region,
Duration + 1.0 AS Duration2
FROM searchlog;
CS blocks and user-defined functions
A calculation can be implemented in C# code then later used in an expression. The code can be stored in a separate DLL or as part of a #CS block as shown below:
rs1 =
SELECT Start,
Region,
MyHelper.SecondsToMinutes(Duration) AS DurationInMinutes
FROM searchlog;
OUTPUT rs1
TO "/my/output.txt";
#CS
public static class MyHelper
{
public static double SecondsToMinutes(int seconds)
{
double minutes = seconds/60.0;
return minutes;
}
}
#ENDCS
Filtering on calculated Columns in SELECT: WHERE versus HAVING
As in SQL, keep in mind that WHERE operates on input rows and HAVING on output rows.
LINQ and lambdas in expression
In the example below, the Where() LINQ extension method is used to filter for a particular set of URLs. Where(), like most LINQ methods, returns an IEnumerable<T> value. Because Scope does not support interfaces as column types, the ToList() LINQ extension method is used to convert it to a collection type that Scope does support.
rs1 =
SELECT Urls.Split(';').Where(u=> u.StartsWith("http:")).ToList() AS HttpUrls
FROM searchlog;
rs2 =
SELECT string.Join( ";" , HttpUrls) AS HttpUrls
FROM rs1;
NOTE: DefaultTextOutputter does not automatically serialize Collection types such as List. You’ll need to convert this back to a type that it supports such as a string.
Order of evaluation for expressions
There’s a pattern C# developers are used to, as shown below:
if ((QueryString!=null) && (QueryString.StartsWith("bing"))
{
// do something
}
This pattern depends on a C# behavior (common to many languages) called “short-circuiting”. Simply put, in the above example, when the code runs there’s no logical reason to check both conditions if the first one returns false. Short circuiting is useful because evaluating each condition may be expensive. Thus, it is a technique that compilers use to improve the performance of C# code.
When trying to do the same thing in Scope there are two paths you can pick. Both are valid expressions, but one will cause problems that may not be obvious at first.
The right choice: use && to keep the desired short-circuiting behavior
rs1 =
SELECT *
FROM data
WHERE ((Name!=null) && (Name.StartsWith("bing"));
The wrong choice: use AND which does NOT match the short-circuiting behavior.
rs1 =
SELECT *
FROM data
WHERE ((Name!=null) AND (Name.StartsWith("bing"));
The second translation that uses AND will sometimes fail saying that a NullReference has occurred. (sometimes it might work on your local box but might fail in the cluster)
The reason is simple and by-design: with AND/OR Scope will try to perform certain optimizations that result in better performance – for example it may evaluate the second part of the expression first because it assumes that there is no relationship between the two conditions.
This is a standard optimization technique and the same thing is done in many systems such as SQL. The gain this optimization provides in performance is well worth the occasional confusion it causes for new Scope users – so this behavior will never change.
Summary: if you need this short-circuiting behavior use && and ||.
As an alternative you can use the SQL-like ALL/ANY operators which are equivalent to &&/||.
You CANNOT circumvent the order of evaluation by using multiple statements
Of course, then you’ll be tempted to write your script by splitting apart the expression as shown below.
rs1 =
SELECT *
FROM data
WHERE Name!=null;
rs2 =
SELECT *
FROM rs1
WHERE Name.StartsWith("bing");
The assumption here is:
First I’ll get the non-null objects and then I can avoid the null reference issue.
This won’t work either. Scope is declarative language not an imperative one. Just because rs1 is defined earlier than rs2 in the script above it does NOT imply that the WHERE condition in rs1 is evaluated before the WHERE in rs2. Scope reserves the right to combine multiple statements together and perform optimizations. You MUST use the && operator if you want to perform short-circuiting.
Logical Operators in Expressions
C# | SQL | |
---|---|---|
Logical AND with short-circuiting | (a && b && c) | ALL(a, b, c) |
Logical OR with short-circuiting | (a || b || c) | ANY(a,cb, c) |
Logical NOT | !a | NOT(a) |
Sorting
You can sort rowsets by using the ORDER BY operator. Specifying the ASC and DESC keyword controls whether the sort is ascending or descending, respectively.
TOP N
rs1 =
SELECT TOP 5
Region,
Duration
FROM searchlog
ORDER BY Duration DESC;
Grouping and aggregation
Grouping, in essence, collapses multiple rows into single rows based on some criteria. Hand-in-hand with performing a grouping operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done).
rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM searchlog
GROUP BY Region;
OUTPUT rs1
TO "/my/Outputs/output.txt";
System-Defined aggregates
Scope contains several common aggregation functions:
- string
- AVG
- COUNT
- COUNTIF
- ANY_VALUE
- LAST
- LIST
- MAX
- MIN
- STDEV*
Getting a value with ANY_VALUE and FIRST
ANY_VALUE gets a value for that column with no implications about where inside that rowset the value came from. It could be the first value, the last value, or any value in between. It is useful in some scenarios where you don’t care which value you receive as long as you get one.
rs1 =
SELECT
ANY_VALUE(Start) AS FirstStart,
Region
FROM searchlog
GROUP BY Region;
FIRST is badly named. It does NOT guarantee you will receive the first value in a rowset. Instead it behaves exactly like ANY_VALUE. Avoid using FIRST, instead use ANY_VALUE.
COUNTIF
ARGMAX
ARGMAX(a, b) = Find the row with the maximum value for column a, from that row return the value for b
Regular Expression
Regular expressions provide advanced text matching capabilities in a terse specification. Regular Expressions are supported in the WHERE and HAVING clauses
Finding simple patterns
// Find all the sessions where the query contained the word pizza (but not pizzeria, for example)
rs1 =
SELECT
Start,
Region,
Duration
FROM searchlog
WHERE REGEX(@"\bpizza.*\b").IsMatch(Query);
OUTPUT rs1
TO "/my/output.txt"
Extracting a REGEX match
Sometimes you’ll need to “pull” out a substring and promote that to a column.
For example, if there is a column called Name and its value can look like “--------Cosmos01------”, “foooCosmos11bar,” etc. and we want to pull out the “Cosmos” parts, then REGEX and Scope make this pretty easy to do.
rs1 =
SELECT
Name,
REGEX(@"Cosmos[0-9]*").Match(Name).Value AS CosmosCluster
FROM data;
Breaking rows apart with CROSS APPLY
Let’s examine the search log again.
rs1 =
SELECT
Region,
Urls
FROM searchlog;
The query returns something ike this:
region | Urls |
---|---|
en-us | A;B;C |
en-gb | D;E;F |
The Urls column contains strings, but each string is a semicolon-separated list of URLs. What happens if we want to break apart the Urls field so that only a URL is present on every row? For example, below is what we want to see:
region | Urls |
---|---|
en-us | A |
en-us | B |
en-us | C |
en-gb | D |
en-gb | E |
en-gb | F |
This is a perfect job for the CROSS APPLY operator.
rs1 =
SELECT
Region,
Urls
FROM searchlog;
rs2 =
SELECT
Region,
SplitUrls AS Url
FROM rs1
CROSS APPLY Urls.Split(';') AS SplitUrls;
The transformation above is possible to perform programmatically with PROCESSORS – but CROSS APPLY is always preferred to custom processors.