原理:用正则表达式匹配出文章中的图片路径,然后对图片路径进行分析,有些是样式美工的图片,有些是文章内的图片,加以对待,将符合条件的图片进行删除
新建一个ASPX页面
<%@ Page Language="C#" AutoEventWireup="true" CodeFile="Default2.aspx.cs" Inherits="Default2" ValidateRequest="false" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title>无标题页</title>
</head>
<body>
<form id="form1" runat="server">
<div align="center">
<asp:TextBox ID="TextBox1" runat="server" Height="283px" TextMode="MultiLine"
Width="800px"></asp:TextBox>
<br />
<asp:Button ID="Button1" runat="server" onclick="Button1_Click" Text="Button" />
</div>
</form>
</body>
</html>
using System;
using System.Text.RegularExpressions;
public partial class Default2 : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
}
/// 获得图片路径
/// </summary>
/// <param name="str">内容</param>
/// <returns>string 结果</returns>
private string getPicUrl(string str)
{
string content = "", regstr = "", url = "";
content = str + "";
//regstr = @"<img.*src=([\""\']?)(.\S+)\1.*>"; 匹配出整个<img xxxx />
regstr = @"src=([\""\']?)(.\S+)\1.(?:jpg|bmp|gif)(?:)"; // 匹配出 src
content = Regex_Execute(regstr, content);
content = content.Replace("'", "");
content = content.Replace("\"", "");
url = content.Replace("src=", "");
return url;
}
/// <summary>
/// 正则表达式匹配
/// </summary>
/// <param name="patrn">正则表达式</param>
/// <param name="str">内容</param>
/// <returns>string 结果</returns>
private string Regex_Execute(string patrn, string str)
{
string values = "";
Regex rx = new Regex(patrn);
MatchCollection mc = rx.Matches(str);
foreach (Match match in mc)
{
values = values + match.Value + "|";
}
return values;
}
protected void Button1_Click(object sender, EventArgs e)
{
string htmlText = this.TextBox1.Text;
string strPaths = getPicUrl(htmlText);
Response.Write(strPaths);
}
}
页面运行后将以下内容模仿HTML内容输入进去进行匹配测试
设计了以下各种常见的代码障碍(单引号,双引号,脚本,样式...)
<img id=img src="/images/reallydo1.jpg">
<img id=img src=/images/reallydo2.jpg/> 后来证明这句是错误的HTML,图片不能正常显示.
<img id=img src=/images/reallydo3.jpg />
<img id=gif src=http://jorkin.reallydo.com/images/reallydo4.gif />
<img id=img /src="/imagesreallydo5.bmp" class=go>
<img id=img src="/images/reallydo6.jpg" class=go/>
<img id=jpg src="/images.gif/reallydo7.jpg" class=go />
<img id=img src="http://www.qhwins.com/images/reallydo8.jpg" class=go />
<IMG id=png src=/reallydo.jpg/reallydo9.jpg onclick='' class=go>
<img id=img src=/images/reallydo10.jpg onclick='>' class=go/>
<img id=bmp src=/images/reallydo11.jpg onclick='<' class='go' />
<img id=img src=http://www.qhwins.com/images/reallydo12.jpg onclick='<' class='go' />
<img onclick="" id=img src='/images/reallydo13.jpg' class=go>
<img id=img src='/images/reallydo14.jpg' onblur=">" class='go'/>
<img id=img onfocus="<" src='/images/reallydo15.jpg' class=go />
<img id=img onclick=">" src='http://www.qhwins.com/images/reallydo16.jpg' class=go />
<IMG id=img src='http://www.qhwins.com/images/reallydo17.jpg' onclick="<" class=go />
<img border=0 onclick="if(this.width>=690) window.open('http://qhwins.com/images/jorkin18.gif');" onload="if(this.width>'29')this.width='25';if(this.height>'28')this.height='88';" src='http://qhwins.com/images/reallydo19.gif'>
<img src='../reallydo21.gif' onclick="if(this.width>=14) window.open('../jorkin.jpg20.gif');" onload="if(this.width>'82')this.width='222';if(this.height>'1024')this.height='1024';" border=0>
<IMG src="http://qhwins.com/img/sign.asp"> 这个是动态图片,非常规的图片扩展名.
<IMG src="http://qhwins.com/img/sign.asp" style="solid 1px #820222;">
已证明为错误的正则:
<img.*src=([\""\']?)(.\1\S+).*>
<img.+?src=[\'|\"](.+?)[\'|\"].+?>
<img(.+?)src=('|\")?([^\s]+?)('|\"|\/|'\/|\"\/)?(\s|>)
以下正则目前测试还未发现错误:
<img.*src=([\""\']?)(.\S+)\1.*>
$2 为 IMG 的 SRC 地址
已发现BUG:
BUG-001:SRC内如果有空格的话,也会出错.不过一般情况下是不会有的.会被Encode为%20.
BUG-002:不能最小匹配,必须一整行只有一个<img>标签时才能有效.
而改为<img.*src=([\""\']?)(.\S+)\1.*?>个别障碍不能正常通过.
如发现其它BUG请在下边评论中反馈给我.
测试中:
PatternStr = "\s[on].+?=([\""|\'])(.+?)\1"
RepStr = ""
PatternStr=">"
RepStr=">"& vbNewLine
PatternStr = "<img.*src=([\""\']?)(.\1\S+).*?>"
RepStr = "<img src=$2 border=0>"
PatternStr=">"& vbNewLine
RepStr=">"
用户名 : 王晖先生
注册日期 : 2008-1-24
所在地 :
发帖数 : 1236
个性签名
好好做事
标签
标题 : re:C# 中利用 DirectSound 录音 [2008-11-8 21:35:25]
输入结果
/images/reallydo1.jpg|/images/reallydo2.jpg|/images/reallydo3.jpg|http://jorkin.reallydo.com/images/reallydo4.gif|/imagesreallydo5.bmp|/images/reallydo6.jpg|/images.gif/reallydo7.jpg|http://www.qhwins.com/images/reallydo8.jpg|/reallydo.jpg/reallydo9.jpg|/images/reallydo10.jpg|/images/reallydo11.jpg|http://www.qhwins.com/images/reallydo12.jpg|/images/reallydo13.jpg|/images/reallydo14.jpg|/images/reallydo15.jpg|http://www.reallydo.com/images/reallydo16.jpg|http://www.qhwins.com/images/reallydo17.jpg|http://qhwins.com/images/reallydo19.gif|../reallydo21.gif| |