html Extracting the Text

Parsing the HTML Document

Now that we have a way to populate the class, let’s take a look at making sense of what’s there. If you’re not familiar with Regular Expressions, you will have need to look them up to fully understand what follows. There have been numerous books written on regular expressions, and even a quick look at the basics is beyond the scope of what space permits here. However, we can take a look at a very simple example that will at least whet your appetite for more. We will use several regular expressions to parse the HTML Document, but the simplest is the following:
<[^>]>
This tells the regular expression engine to find a pattern that begins with a less-than sign, ends with a greater-than sign and includes everything in between. This regular expression will match any tag. Now let’s look at matching an entire block, such as a <HEAD></HEAD> entry. We can match everything in the HEAD block with the following regular expression:
<HEAD[^>]*>.*</HEAD/s*>
A special note is warranted here. You need to be aware of the conventions of the language in question when assigning strings such as the one above. C# will normally interpret the “/s” portion as an escape sequence. Embedded quotes are also a potential issue with either VB.Net or C#. With C#, it is possible to use an extra slash (“//s”) but it is generally easier to prefix the string with an at-sign (“@”), which will tell C# to ignore escape sequences and is similar to how a quoted string is encoded by VB.Net. The following table illustrates the point:

Desired Regular Expression/s”/n
C# (Method 1)“//s/”//n”
C# (Method 2)@”/s””/s”
VB.Net“/s””/n”

Using the power of regular expressions, we can extract the raw text from our document using only a few lines of code. This type of regular expression for the HEAD tag will work for most, but not all, HTML tags. Some tags do not need an explicit closing tag – such as <BR>. Other tags, such as a comment tag, use a closing tag which is not in the standard form – a slash followed by the tag name. We will deal with the problem of non-closed blocks later, but for now we will write a function that will accept a tag name, returning the proper regular expression that will match that tag’s contents, assuming there is a corresponding closing tag:

private string
	GetExpressionForTagContents
	(string strTagName)
{
	string strPatternTag;
	if (strTagName == “!”)
		strPatternTag = “<!.*?->”;
	else
		if (string.Compare(strTagName,
			“!doctype”, true) == 0)
		strPatternTag =
			“<!doctype.*?>”;
	else
		if (String.Compare(
		strTagName, “br”, true) == 0)
			strPatternTag =
				@”<br/s*/?/s*>”;
	else
		strPatternTag
			= @”<(“ + strTagName +
		@”)(>|/s+[^>]*>).*?<//1/s*>”;
	return(strPatternTag);
}
If there no closing tag is present, this match will fail and if so, we will take everything that lies between the tag in question and the less-than sign – or end of file. Let’s put all of this together and write a routine that will accept any tag name, and return the contents irrespective of the existence of a closing tag.
private string GetTagByName (string
	strTagName, string strSource)
{
	string strPatternTag =
		GetExpressionForTagContents(
			strTagName);
	string strPatternTagNoClose
		= “<” + strTagName +
			@”(>|/s+[^>]*>)[^<]”;
	RegexOptions opts
		= RegexOptions.IgnoreCase |
			RegexOptions.Singleline;
	Match m;
	string strGetTagByName;
	m = System.Text.RegularExpressions.
		Regex.Match(strSource,
			strPatternTag, opts);
	if (m.Value == “”)
	{
	m = System.Text.RegularExpressions.
			Regex.Match(strSource,
			strPatternTagNoClose, opts);
		if (m == null)
			strGetTagByName = strSource;
		else
			strGetTagByName = m.Value;
	}
	else
		strGetTagByName = m.Value;
	return(strGetTagByName);
}
We can now pull out any single tag, or the first of multiple tags, given its name. As you may have noticed, the above function was declared as private so it is for in-class use only. Soon we will write a public function called GetTagsByName that will return an array of tag contents and which will be suitable for returning the data for any tag, whether it is found one time or many times. First, let’s add properties to return the HEAD, TITLE and BODY. For the HEAD and TITLE properties, we can just return the results of GetTagByName, passing the appropriate tag name. For the BODY, we need to be a bit more careful. It is possible that we could encounter a document with no HEAD or BODY tag at all. In such a case, GetTagByName will return a null string and will simply return the entire document:
public string Head
{
	get
	{
		return(GetTagByName(
			“Head”, m_strSource));
	}
}

public string Title
{
	get
	{
		return(GetTagByName(
			“Title”, m_strSource));
	}
}

public string Body
{
	get
	{
		string strBody
			= GetTagByName(“Body”,
				m_strSource);
		if (strBody == “”)
			strBody = this.Source;
		return(strBody);
	}
}

Extracting the Text

We now have everything to pull out the raw text from the document. The text will be found in the BODY portion of the document so we will start with this. Then any comments plus anything found between SCRIPT blocks will be stripped. Next we will remove all remaining tags. Everything left over should be the textual content of the page.

Having extracted the text from the surrounding tags, we still have the possibility of encoded ISO characters like the famous &NBSP;. We will write a routine (not listed) to search for each of these and replace the ISO character with its corresponding ASCII equivalent. Lastly, use a while loop to remove any repeated blanks. The code listing for the Text property as well as a helper function to strip the comments appears below.

public string Text
{
	get
	{
		Regex r = new Regex(“”);
		RegexOptions opts
			= RegexOptions.IgnoreCase |
				RegexOptions.Singleline;
		string strText = this.Body;
		strText =
			StripComments(strText);
		string strPattern =
			GetExpressionForTagContents(
			“SCRIPT”);
		strText =
			Regex.Replace(strText,
			strPattern, “”, opts);
		strPattern = @”<[^>]*>”;
		strText =
			Regex.Replace(strText,
			strPattern, “ “, opts);
		strText = ISOtoASCII(strText);
		strText = Regex.Replace(
			strText,”&amp;”,”&”,opts);
		System.Text.
			RegularExpressions.
			MatchCollection m;
		do
		{
			strText =
				Regex.Replace(strText,
							@”/s/s”, “ “);
			m = Regex.Matches(
				strText, @”/s/s”);
		}
		while (m.Count > 0);
		return(strText.Trim());
	}
}

string StripComments()
{
	Regex r = new Regex(
		GetExpressionForTagContents(
			“!”));
	return( r.Replace(m_strSource,
		“”));
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 从RADARSAT地球观测卫星的光盘中提取原始SAR数据,需要进行以下步骤: 首先,将RADARSAT光盘插入计算机的光盘驱动器中。然后,打开计算机上的数据提取软件,例如Envi、Matlab等。 接下来,在软件界面上选择“打开文件”或“提取数据”选项。在弹出的窗口中,浏览并选择RADARSAT光盘的存储位置。 软件会读取光盘上的文件目录。在目录中找到包含SAR数据的文件。通常,这些文件具有SAR或RAW的文件扩展名。 选择需要提取的SAR数据文件,并选择提取数据的格式。通常,可以选择将数据保存为ENVI、GeoTIFF或其他常见格式。 点击“提取数据”或“保存”按钮,软件将开始提取SAR数据并将其保存到计算机上指定的目标文件夹中。 提取过程可能需要一些时间,具体取决于光盘上存储的数据量和计算机的处理能力。完成后,可以关闭提取软件并从光盘驱动器中取出光盘。 现在,我们已经从RADARSAT光盘中成功提取了原始SAR数据。这些数据可以用于后续分析、处理和应用,例如地质勘探、环境监测等。 ### 回答2: 从Radarsat光盘中提取原始SAR数据需要遵循以下步骤: 首先,将Radarsat光盘插入计算机的光驱中,并打开光盘的文件夹或浏览器。 然后,在文件夹或浏览器中浏览Radarsat光盘的内容,找到存储原始SAR数据的文件夹。这个文件夹通常称为“RAW”或“SAR_DATA”。 接下来,打开该文件夹,并查看其中是否有任何子文件夹或文件。如果有多个子文件夹,可能需要根据具体的需求选择正确的文件夹。 然后,从所选文件夹中复制或提取原始SAR数据。可以选择将其复制到计算机的特定文件夹中,或提取到外部存储设备(如硬盘驱动器或USB闪存驱动器)中。 最后,确保成功提取原始SAR数据后,可以关闭Radarsat光盘文件夹,安全地退出光驱,并将光盘取出。 通过按照上述步骤,您可以从Radarsat光盘中成功提取出您所需的原始SAR数据。 ### 回答3: 要从Radarsat CD提取原始SAR数据,你可以按照以下步骤操作: 1. 首先,确保你已经插入了Radarsat CD到计算机的光驱或外部设备中。 2. 打开计算机上的文件资源管理器或任何可以读取光盘的软件。 3. 导航到Radarsat CD在计算机上的存储位置,并打开光盘中的文件夹。 4. 在Radarsat CD的文件夹中,你可能会找到多个文件,这些文件可能包括元数据、图像数据和其他相关文件。 5. 找到包含原始SAR数据的文件,这通常是以扩展名为.raw或.*sar的文件。 6. 右键单击原始SAR文件,然后选择"复制"选项。 7. 在计算机上选择一个合适的存储位置,如硬盘或其他外部存储设备。 8. 在所选的存储位置上右键单击,然后选择"粘贴"选项,将原始SAR文件从Radarsat CD复制到计算机上的存储位置。 9. 等待复制过程完成,这可能需要一些时间取决于文件的大小。 10. 一旦复制完成,你现在可以在所选的存储位置中找到原始的SAR数据文件,你可以将其用于后续的分析、处理或其他用途。 请注意,以上步骤仅适用于从Radarsat CD里提取原始SAR数据。具体的步骤可能会因不同的操作系统或软件版本而有所不同。因此,在操作前最好查阅相关的用户手册或软件文档以获取详细的指导。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值