主页 > 未分类 > 正则表达式随笔[原创]

正则表达式随笔[原创]

2010年11月12日 发表评论 查看评论
任何一个正规表示式里的特殊字元都会先被解译、处理
 
1.1正则表达式匹配顺序
       Matching this or that
 
       Sometimes we would like to our regexp to be able to match different possible words or
       character strings.  This is accomplished by using the alternation metacharacter "|".
       To match "dog" or "cat", we form the regexp "dog|cat".  As before, perl will try to
       match the regexp at the earliest possible point in the string.  At each character posi-
       tion, perl will first try to match the first alternative, "dog".  If "dog" doesn't
       match, perl will then try the next alternative, "cat".  If "cat" doesn't match either,
       then the match fails and perl moves to the next position in the string.  Some examples:
 
           "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
           "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
 
       Even though "dog" is the first alternative in the second regexp, "cat" is able to match
       earlier in the string.
 
           "cats"          =~ /c|ca|cat|cats/; # matches "c"
           "cats"          =~ /cats|cat|ca|c/; # matches "cats"
 
       Here, all the alternatives match at the first string position, so the first alternative
       is the one that matches.  If some of the alternatives are truncations of the others,
       put the longest ones first to give them a chance to match.
 
           "cab" =~ /a|b|c/ # matches "c"
                            # /a|b|c/ == /[abc]/
 
 
           "abcde" =~ /(abd|abc)(df|d|de)/;
 
       0   Start with the first letter in the string 'a'.
 
       1   Try the first alternative in the first group 'abd'.
 
       2   Match 'a' followed by 'b'. So far so good.
 
       3   'd' in the regexp doesn't match 'c' in the string – a dead end.  So backtrack two
           characters and pick the second alternative in the first group 'abc'.
 
       4   Match 'a' followed by 'b' followed by 'c'.  We are on a roll and have satisfied the
           first group. Set $1 to 'abc'.
 
       5   Move on to the second group and pick the first alternative 'df'.
 
       6   Match the 'd'.
 
       7   'f' in the regexp doesn't match 'e' in the string, so a dead end.  Backtrack one
           character and pick the second alternative in the second group 'd'.
 
       8   'd' matches. The second grouping is satisfied, so set $2 to 'd'.
 
       9   We are at the end of the regexp, so we are done! We have matched 'abcd' out of the
           string "abcde".
 
 
 
       "ST"
           Consider two possible matches, "AB" and "A'B'", "A" and "A'" are substrings which can be matched by
           "S", "B" and "B'" are substrings which can be matched by "T".
 
           If "A" is better match for "S" than "A'", "AB" is a better match than "A'B'".
 
           If "A" and "A'" coincide: "AB" is a better match than "AB'" if "B" is better match for "T" than "B'".
 
       "S|T"
           When "S" can match, it is a better match than when only "T" can match.
 
           Ordering of two matches for "S" is the same as for "S".  Similar for two matches for "T".
 
       "S{REPEAT_COUNT}"
           Matches as "SSS…S" (repeated as many times as necessary).
 
       "S{min,max}"
           Matches as "S{max}|S{max-1}|…|S{min+1}|S{min}".
 
       "S{min,max}?"
           Matches as "S{min}|S{min+1}|…|S{max-1}|S{max}".
 
       "S?", "S*", "S+"
           Same as "S{0,1}", "S{0,BIG_NUMBER}", "S{1,BIG_NUMBER}" respectively.
 
       "S??", "S*?", "S+?"
           Same as "S{0,1}?", "S{0,BIG_NUMBER}?", "S{1,BIG_NUMBER}?" respectively.
 
       "(?>S)"
           Matches the best match for "S" and only that.
 
       "(?=S)", "(?<=S)"
           Only the best match for "S" is considered.  (This is important only if "S" has capturing parentheses,
           and backreferences are used somewhere else in the whole regular expression.)
 
       "(?!S)", "(?<!S)"
           For this grouping operator there is no need to describe the ordering, since only whether or not "S"
           can match is important.
 
       "(??{ EXPR })"
           The ordering is the same as for the regular expression which is the result of EXPR.
 
       "(?(condition)yes-pattern|no-pattern)"
           Recall that which of "yes-pattern" or "no-pattern" actually matches is already determined.  The order-
           ing of the matches is the same as for the chosen subexpression.
 
 
 
 
 
1.2正则范围操作符
 
我无法对应到超过一行的内容,哪里出了问题?
若不是你的字串里少了换行字元,就是你在模式里用了错误的修饰子。
有很多方法将多行的资料结合成一个字串。如果你希望在读入输入资料时自动得到 这项功能,你得重
新设定 $/变数 (若为段落,设成 '';若要将整个档案读进一字 串,设成 undef ),以容许你一次能
读入一行以上的输入。
请参考 prelre,其中有选择 /s或 /m (或二者都用)的说明: /s让万用字元 (“.'')能对应到换行字
元【译注:通常换行字元不在 “.'' 的对应范围内】, /m则让 “^''和 “$''两个符号能够对应到任
何换行字元的前後,而不只是像平常 那样只能对应到字串头尾。你所需要确定的是你的确有个多行的
字串。
例如说,以下这个程式会侦测出同一段落里重覆的字,即使它们之间有换行符号相隔 (但是不能隔
段)。在这个例子里,我们不需要用到 /s,因为我们并未在任何要跨行对应的正规表示式中使用
“.''。我们亦无需使用 /m,因为我们不想让 “^''或 “$''去对应 到字串中每个换行字元前後的位
置。但重点是,我们得把 $/ 设成与内定值相异的值,否则我们实际上是无法读入一个多行的资料的。
$/ = ''; #读入一整段,而非仅是一行。
while ( <> ) {
while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
print "在段落 $.找到重复的字 $1\n";
}
}
以下的程式能找出开头为 “From ''的句子 (许多邮件处理程式都会用到这个功能):
$/ = ''; #读入一整段,而非仅是一行。
while ( <> ) {
while ( /^From /gm ) { # /m使得 ^也会对应到 \n之後
print "开头为 From的段落 $.\n";
}
}
以下的程式会抓出在一个段落里所有夹在 START与 END之间的东西。
undef $/; #把整个档案读进来,而非只是一行或一段
while ( <> ) {
while ( /START(.*?)END/sm ) { # /s使得 .能跨越行界
print "$1\n";
}
}
 
 
1.3零宽度断言
       "(?=pattern)"
                 A zero-width positive look-ahead assertion.  For example, "/\w+(?=\t)/" matches a word followed
                 by a tab, without including the tab in $&.
 
       "(?!pattern)"
                 A zero-width negative look-ahead assertion.  For example "/foo(?!bar)/" matches any occurrence
                 of "foo" that isn't followed by "bar".  Note however that look-ahead and look-behind are NOT the
                 same thing.  You cannot use this for look-behind.
 
                 If you are looking for a "bar" that isn't preceded by a "foo", "/(?!foo)bar/" will not do what
                 you want.  That's because the "(?!foo)" is just saying that the next thing cannot be "foo"–and
                 it's not, it's a "bar", so "foobar" will match.  You would have to do something like
                 "/(?!foo)…bar/" for that.   We say "like" because there's the case of your "bar" not having
                 three characters before it.  You could cover that this way: "/(?:(?!foo)…|^.{0,2})bar/".
                 Sometimes it's still easier just to say:
 
                     if (/bar/ && $' !~ /foo$/)
 
 
 
1.4正则表达式里的条件判断
 
 
 
1.5正则表达式里的代码
 
 
 
 
1.6正则表达式调式
 
 
 
1.7数组去重
$aa="asdfjljodiangadlianef";
1 while($aa=~ s/(\w)(.*)\1/$1$2/g);
\1  \2
@unique = grep { ++$count{$_} < 2 } qw(a b a c d d e f g f h h);
print "@unique\n"
 
perl -e '$aa="asdfjljodiangadlianef";1 while($aa=~ s/(\w)(.*)\1/$1$2/g);print $aa;'
perl -e '$aa="asdfjljodiangadlianef";@aa= split //,$aa;@unique = grep { ++$count{$_} < 2 } @aa;print "@unique\n";'
 
"\Q" and "\E"
 
 
 
1.7正则表达式嵌套使用
 
 
       "(?#text)"
                 A comment.  The text is ignored.  If the "/x" modifier enables whitespace formatting, a simple
                 "#" will suffice.  Note that Perl closes the comment as soon as it sees a ")", so there is no
                 way to put a literal ")" in the comment.
 
       "(?imsx-imsx)"
                 One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by
                 "-") for the remainder of the pattern or the remainder of the enclosing pattern group (if any).
                 This is particularly useful for dynamic patterns, such as those read in from a configuration
                 file, read in as an argument, are specified in a table somewhere, etc.  Consider the case that
                 some of which want to be case sensitive and some do not.  The case insensitive ones need to
                 include merely "(?i)" at the front of the pattern.  For example:
 
                     $pattern = "foobar";
                     if ( /$pattern/i ) { }
 
                     # more flexible:
 
                     $pattern = "(?i)foobar";
                     if ( /$pattern/ ) { }
 
                 These modifiers are restored at the end of the enclosing group. For example,
 
                     ( (?i) blah ) \s+ \1
 
                 will match a repeated (including the case!) word "blah" in any case, assuming "x" modifier, and
                 no "i" modifier outside this group.
 my $a='adDIlldD';
if ($a=~/((?i)addi)llD/){
 
print "$1 adf\n";
 
}else{
print "asdf \n";
}
 
 
 
 
    "(?:pattern)"
       "(?imsx-imsx:pattern)"
                 This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make
                 backreferences as "()" does.  So
 
                     @fields = split(/\b(?:a|b|c)\b/)
 
                 is like
 
                     @fields = split(/\b(a|b|c)\b/)
 
                 but doesn't spit out extra fields.  It's also cheaper not to capture characters if you don't
                 need to.
 
                 Any letters between "?" and ":" act as flags modifiers as with "(?imsx-imsx)".  For example,
 
                     /(?s-i:more.*than).*million/i
 
                 is equivalent to the more verbose
 
                     /(?:(?s-i)more.*than).*million/i
8:正则表达式,命名空间的使用
p{…}  and [[:…:]]
http://bbs.chinaunix.net/thread-1827506-1-1.html
       But that isn't going to match; at least, not the way you're hoping.  It claims that there is no 123 in the
       string.  Here's a clearer picture of why that pattern matches, contrary to popular expectations:
 
           $x = 'ABC123';
           $y = 'ABC445';
 
           print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
           print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
 
           print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
           print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
 
       This prints
 
           2: got ABC
           3: got AB
           4: got ABC
 
       You might have expected test 3 to fail because it seems to a more general purpose version of test 1.  The
       important difference between them is that test 3 contains a quantifier ("\D*") and so can use backtracking,
       whereas test 1 will not.  What's happening is that you've asked "Is it true that at the start of $x, fol-
       lowing 0 or more non-digits, you have something that's not 123?"  If the pattern matcher had let "\D*"
       expand to "ABC", this would have caused the whole pattern to fail.
 
       The search engine will initially match "\D*" with "ABC".  Then it will try to match "(?!123" with "123",
       which fails.  But because a quantifier ("\D*") has been used in the regular expression, the search engine
       can backtrack and retry the match differently in the hope of matching the complete regular expression.
 
       The pattern really, really wants to succeed, so it uses the standard pattern back-off-and-retry and lets
       "\D*" expand to just "AB" this time.  Now there's indeed something following "AB" that is not "123".  It's
       "C123", which suffices.
 
       We can deal with this by using both an assertion and a negation.  We'll say that the first part in $1 must
       be followed both by a digit and by something that's not "123".  Remember that the look-aheads are zero-
       width expressions–they only look, but don't consume any of the string in their match.  So rewriting this
       way produces what you'd expect; that is, case 5 will fail, but case 6 succeeds:
 
           print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
           print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
 
           6: got ABC
 
       In other words, the two zero-width assertions next to each other work as though they're ANDed together,
       just as you'd use any built-in assertions:  "/^$/" matches only if you're at the beginning of the line AND
       the end of the line simultaneously.  The deeper underlying truth is that juxtaposition in regular expres-
       sions always means AND, except when you write an explicit OR using the vertical bar.  "/ab/" means match
       "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not
       a zero-width assertion, but a one-width assertion.
 

原创文章,转载请注明: 转载自肚腩照明月'blog

本文链接地址: 正则表达式随笔[原创]

文章的脚注信息由WordPress的wp-posturl插件自动生成


  1. 本文目前尚无任何评论.

SEO Powered by Platinum SEO from Techblissonline