LINQ数据交换的另一种用法

确实厉害,不得不服。 没时间翻译,将就看看吧。

"Data is king", wrote a famous author to start explaining LINQ. As MSDN suggests – To explain LINQ in simple terms - "LINQ is a general purpose query facility (or feature) to query data. Not just relational data or xml, but all sources of informational data. Over past years we heard several LINQ flavors: LINQ to SQL, LINQ to Xml, LINQ to CSV files, LINQ to Text files, etc. All of these target a specific medium (or form) of data. The goal of LINQ is to provide a common simple interface to query the data.

Today Data comes in many different formats through variety of different channels. Such as: Database, Xml, Raw, Text, Binary, Rss feed through tcp, udp, http, ftp, etc. 'LINQ to www' is LINQ to query data from complying (REST like)web sites. To give an example, If you have a favorite Car listing website that displays thousands of pages cars in their web site. Using this LINQ2www tool you can query that web site to get desired information. For example, You can query Cars price greater than $15,000 but less than $30,000 using LINQ2www. This means, you do not need to browse manually through hundreds of continuous web pages to extract the list of cars that qualify your interest ($15,000 to $30,000). All you need to do is just write an appropriate LINQ query to extract this information. This same principle can be applied to pages of financial data, corporate accounts data, web-telephone directory, and what not.

How would you like to read?

It is best to read the article in an orderly fashion. However, not everyone is in same situation. So here is a brief guideline to get what you are looking for quickly:

1. If you are a good reader who is curious to see how this is done -- Welcome, Just skip this section and continue reading on from next section.

2. If you are a geek who can understand an article's content by just reading the title. And All you need to know today is just: how to use the LINQ2www assembly -- Jump to 'Using the code' section. You may be interested in 'Points of Interest' section too.

3. If you are a management person who would like to see a working example. All you are looking for is a demonstration of the usage of this library -- Jump to 'The Web spider application’ section: A WPF based Web spider to model data into Graphical 3D information'. Download the demo and just press 'Go' button.

4. In case, If you do not belong to any of above 3 interest category, just leave me a message using the link at the bottom of this article. Well, don't forget to detail your interest category.

The Web spider application

Prerequisite: To run the demo, you need .Net framework 3.5 installed in your machine. If it is not already installed, you may download it for free from here:

http://www.microsoft.com/downloads/details.aspx?FamilyId=333325FD-AE52-4E35-B531-508D977D32A6&displaylang=en

The web spider is a sample WPF application that uses LINQ2www assembly. This application crawls through the Who's who @ codeproject link - page by page. To see a demo of this application, download the demo project from the first line of this article - run the application. Press 'Go' button. Give it 2 to 3 minutes to see the flow of data from codeproject server to your computer's 3D bar chart. You can observe continuous update in the status bar at the bottom of the WPF application.

 

What is REST and its relationship to this library ?

'REpresentational State Transfer' is a style of software architecture for systems such as the World Wide Web. All credits to doctoral dissertation of Roy Fielding (Reference 1). LINQ2www is a tool capable of querying REST based (complying) web sites. For example, consider the first page of 'Who's who @ codeproject' web address link:

http://www.codeproject.com/script/Membership/Profiles.aspx?ml_ob=MemberId&mgtid=-1&mgm=False&pgnum=1

Now to get the second page, all we need to do is remove the page number 1 to replace it with page number 2. i.e., pgnum=1 should be replaced as pgnum=2 to get next page and so on.. -- pgnum=3 is page 3 etc. This is a RESTful web interface.

Using the above web link we can browse upto 1000 members from page 1 to page 1000. Now, consider your requirement is to just get the list of Gold members from these 1000 web pages. Manually browsing (traversing) through all the 1000 pages is tiresome and error prone. Writing a small program to automatically perform this operation is a good idea. Generalizing such a program so that it can solve similar problem can be considered next step. How about generalizing to this extent: All you have to do is just write 'Two lines of code' to get all gold members from 1000 web pages ? This is the power of LINQ. We are specializing LINQ to achieve our 'Two line code' goal, which is called LINQ2www - LINQ 2 World Wide Web. One more important thing to tell here is, we do not check or need 100% REST compliance web site. LINQ2www can work with REST like web links. The basic requirement for LINQ2www to work is all the pages need to be linked to each other through a href link. Even if it is not complying with REST in other areas, LINQ2www will work.

 

How LINQ works - Basic understanding

Understanding how LINQ works is necessary to specialize it to our needs. Let us start from a very simple example. Consider we have a pipe that carries some liquid. We have Green, Red, Blue color liquids mixed, flowing in this pipe. Someone sends this mixed liquid through this pipe continuously. All you know from your side is, if you open the faucet (tap), you will get this mixed liquid flowing. Consider, you just need the Green color liquid. You prepare a filter that will filter just Green color liquid. Now when we fix this filter in the faucet (tap), just the Green liquid will start flowing though the pipe. This filter is LINQ and the mixed liquid is raw data. LINQ helps you to extract information from raw flow of data. Based on liquid type you need different filter. Similarly based on data variety, you need different LINQ. That is how we have LINQ 2 SQL, LINQ to Xml etc.

 

How LINQ works - Little LINQ code

Let us do a small code example. Consider we have a List of names. All we need is names that start with 'S'. Let us try to write the one line LINQ code to get what we want.

Collapse Copy Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TestLinq4
{
    class Program
    {
        static void Main(string[] args)
        {
            List listOfNames = new List() 
            { 
                "Nina", "Kyle", "Steven", "Joe", "Neal", "Sanjay", "Elijah",
                "Steen", "Stech", "Donn", "Thomas", "Peter", "Steinberg"
            };

            var Result = from Name in listOfNames where Name.StartsWith("S") select Name;

            foreach (string NameReturned in Result)
            {
                System.Console.WriteLine(NameReturned);
            }
        }
    }
}

Here, all we do is just write a simple LINQ statement to get a query result. When we enumerate the Result using foreach, the query actually gets executed. The data gets filtered by our 'where' condition. Then, if it passes the 'where' condition, it is selected and returned to the NameReturned variable in the foreach.

How LINQ works - Where is the 'where' method ?

Let's take a closer look. Several questions may arise seeing the above code for the first time. One important question may be, Where is the 'where' method. And where is 'select' method and what's going on. 'Where' is a method in IEnumerable class. This is a special method called 'Extension method'. As the name suggests, you can extend a class (even sealed one) by writing static methods. Let us see how to write one soon. ‘Where’ is an important method as this is our filter.

First let us see what is an extension method and how to write one. Then, we will override (specialize) the default 'where' method in the above sample to add additional feature to it.

Consider a well known .net class - String. As we know, String is a sealed class. This means, we cannot specialize (derive from) String class. But we can extend String class by adding an Extension method. This extension method will appear as if it is an exposed method of String class. Let us write one now.

Collapse Copy Code
public static class StringExtension
{
   public static bool CompareCaseInsensitive(this string strSource, string strTarget)
   {
      if (string.Compare(strSource, strTarget, true) == 0)
      {
         return true;
      }

      return false;
   }
} 

Note that the extension method is defined inside a static class StringExtension. Also note that the static extension method's first parameter starts with 'this' keyword. i.e., it is extending String class. The below picture shows the intellisense when we have the above extension method defined.

 

After learning about extension method, it is not much difficult to figure that 'where' is an extension method for IEnumerable class. Now, let us see how our LINQ statement is converted, which improves our understanding:

Collapse Copy Code
var Result = from Name in listOfNames where Name.StartsWith("S") select Name; 

is converted to

Collapse Copy Code
var Result = listOfNames.Where(delegate(string item) { return item.StartsWith("S");} ).Select(delegate(string item){ return item; } );

LINQ2www - Challenges:

After getting enough background, let see the challenges in writing LINQ to www (World Wide Web).

1. The data is not readily available to filter. i.e., consider browsing through thousands of Web pages. The library need to first fetch the web page. Then, filter the web page's data using the filter condition (where condition in the LINQ statement). This may block the query execution for long period of time. This is not agreeable.

2. Consider everything in (1) goes fine and the user is observing continuous flow of filtered information. Now the user thinks she had enough information, and would like to stop the update. This can happen very often not only because continuous update from web site is time consuming but also the information obtained can reach a sufficient point to make the decision.

3. We do not have a standard language to query this type of data. Consider Database, we have sql like language to query that. But what about HTML data from web site. We do not have a standard query language. However, as it is REST like page, we can expect it to be of certain standard format.

4. We should provide the interface in such a way that more than one query can simultaneously execute using one http connection. This improves performance from client side and conserves the server resources.

Using the code

Let us see how to use the code. This will help us to understand how some of the challenges are met.

Let us take codeproject's Who's who web link for our example. Consider our goal is to fetch different types of membership status available in codeproject. The code to perform this query is as below.

 

Collapse Copy Code
Linq2www linq2wwwUrl = new Linq2www("http://www.codeproject.com/script/Membership/Profiles.aspx?mgtid=1&%3bmgm=True&ml_ob=MemberId&mgm=False&pgnum=1", 

"http://www.codeproject.com/script/Membership/Profiles.aspx?mgtid=1&%3bmgm=True&ml_ob=MemberId&mgm=False&pgnum=");

int CancelId = from webItem in linq2wwwUrl where webItem.GetMatchDyn(@"class=""Member(?.*?)"">(/k)", this.CallThisMethod) select 

webItem;

The first line constructor takes two arguments. First argument is the starting web address link. The next is an optional parameter that tells that: when you fetch the first page search for next page link which will look like this parameter (i.e., second parameter).

Next line is little different LINQ statement compared to regular ones. As you might have assumed we are using Regular Expression to extract the desired information from this web site. Another difference is, we pass in a callback method to call with the updates. i.e., This means, while traversing the web pages like a web spider/crawler, if it finds anything that matches this regular expression - it makes a call to the callback method that is supplied (CallThisMethod). This method will be called continuously until all the pages and their links are visited. However, the User can cancel this query anytime using the return value integer CancelId. The code for doing this is:

Collapse Copy Code
linq2wwwUrl.CancelUpdate(cancel1);

So now we know how challenges (1), (2) and (3) mentioned in previous section are resolved. We are using callback method to update the query caller. This will enable the user to cancel the update whenever it is necessary. All we need out of this LINQ query is filtered information, which is received asynchronously. As we do not have any standard language to query HTML data, we use regular expression, which is a powerful tool to query any raw text like data

LINQ2www - Override of 'where' method

Why is this LINQ a different flavor ? The LINQ2www is a different flavor as we do two unconventional things in this LINQ, which is explained in this section.

The 'where' method is the only method we override in LINQ2www. The purpose of 'where' method is little unconventional here. The 'where' method just sets the condition and callback. The unconventional part is: It is not enumerating through the data. This is because, as you know, complete data is not available at the time of 'where' method's invocation. However, it sets the necessary information in the class so that the call back will start getting the filtered information. The implementation of 'where' method is as follows:

Collapse Copy Code
public static class LinqExtnsnOvrride
{
   public static int Where(this IEnumerable enumLinq2www, Func predicate)
   {
      enumLinq2www.GetEnumerator().Reset();
      Linq2www Item = enumLinq2www.First();
       return predicate(Item); 			// To set the Condition and Callback
   }
}

The next unconventional part 'where' method does is, it returns integer. This is the Id that needs to be passed to stop getting updates in the callback method. The first and second line resets the enumerator to first item in the collection. The next line calls the callback, there by setting Regular expression to filter and setting callback function.

LINQ2www - Putting it all together

Let us put all that we walked through to explain the LINQ2www class. Let us follow the top down approach.

Collapse Copy Code
Linq2www linq2wwwUrl = new Linq2www(WebLink, webLinkTemplate);

When the LINQ2www constructor is called, we create a background thread. The job of this thread is to get the content of webLink. Store it in Multimap (Reference 3). Multimap is a sophisticated dictionary like collection class. It can store key-value pairs. First we store the weblink and its content in the multimap. Next we parse (traverse through content of) the weblink to find connecting next page. If the optional second parameter is provided, we use that as a template to find the next page. Otherwise we use the weblink to create a connecting template. Once we find next page, we again do the same thing - Store the link and content in Multimap. Then, parse through the content to find next link. We do this until we are exhausted with no more *new* link to browse.

The below is the second line in our sample explained above. This line actually fetches useful filtered information for us from the whole data. As you notice, we are passing two parameters to GetMatchDyn. First parameter is the Regular expression - the filter. The second parameter is the callback. This callback receives the filtered information continuously.

Collapse Copy Code
int CancelId = from webItem in linq2wwwUrl where webItem.GetMatchDyn(MyRegularExpression, this.CallThisMethod) select webItem;

Let us see how we do it. In the previous section, we saw the 'where' method override. Our overridden 'where' is called when the above line is executed. As we saw in the above section, the 'where' method calls the GetMatchDyn. Means, the 'where' method calls the delegate, which in turn calls the GetMatchDyn. The GetMatchDyn creates a thread. This thread reads the data stored in the Multimap. This thread moves (enumerates) item by item in the multimap to read each weblink's data. It filters the data using the regular expression passed to the GetMathDyn by the caller. Once the regular expression matches, it calls the callback method passed in by the user. Remember, this is the second parameter to the GetMatchDyn method.

 

The above picture explains what we described in this section.

Last but not least, We should provide a method to cancel the LINQ call. As we saw before, since this is continuous update from http, it can be really time consuming. The user should be able to cancel the update anytime. This can be performed easily by using the return value (integer) from the LINQ call we made. The below line does this:

Collapse Copy Code
linq2wwwUrl.CancelUpdate(CancelId);

This simple method is defined as below:

public bool CancelUpdate(int ThreadId)
{
   bool retVal = false;

   Regex regDet = threadDetails.GetFirstItem(ThreadId);
   if (regDet != null)
   {
      retVal = threadDetails.Remove(ThreadId);
   }

   lock (this) Monitor.PulseAll(this);

   return retVal;

}

All we do is: we remove the Regular expression that we stored when GetMatchDyn method was called previously. Then we trigger the thread created by the GetMatchDyn method. This thread, when tries to read the regular expression (that it is tied will get a null value. This is because, we just now removed this before trigerring the thread. The thread that was created by the GetMatchDyn method will close down gracefully. Hence the callback will stop receiving any more update.

Points of Interest

1. This project can be easily extended to search any type of web page, not just REST like www.

2. If someone creative can bring up an easy language (easier than Regular expression) to query html data, this project can be extended to support that language by implementing IQueryable interface.

3. The LINQ2www is a kind of web robot. So we need to comply with certain standards (Reference 5)

4. Unfortunately we use Regular Expression as a tool to filter data. Not everyone is familiar using Regular expression. In case, if you are new to Regular expression, Reference 6 is a good introductory read. After reading this, You may consider reading Reference 7 or 8. I found Reference 7 to be useful quick reference though.

5. The WPF 3D bar chart used in the sample application source code also is available in codeproject - Reference 4.

6. The Multimap source code is available in Reference 3.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值