Skip to content

python中的正则表达式处理类——re

本文梳理了 python 中的正则表达式处理类——re,介绍了其基础惯用法。

re#

re

help#

>>> import re
>>> help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.

    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matches the string 'last'.

howto#

regex-howto

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

flags#

  • re.I(re.IGNORECASE): Perform case-insensitive matching;
  • re.M(re.MULTILINE): When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline).
  • re.S(re.DOTALL): Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

re.Ire.M 分别对应 javascript RegExp 中的 /i/m 标志。

正则表达式可以包含一些可选标志修饰符来控制匹配的模式。
修饰符被指定为一个可选的标志。多个标志可以通过按位 OR(|) 来指定。
re.I | re.M 被设置成 I 和 M 标志。

APIs#

re.search(pattern, string, flags=0)

re.search 扫描整个字符串并返回 第一个 成功的匹配。

匹配成功将返回一个匹配的对象 match object,否则返回 None。
我们可以使用 group(num) 或 groups() 匹配对象函数来获取匹配表达式。

match#

re.match(pattern, string, flags=0)
re.fullmatch(pattern, string, flags=0)
  • re.match 尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话就返回 None。
  • re.fullmatch 匹配整个字符串。

re.match 只匹配字符串的开始,如果字符串开始不符合正则表达式,则匹配失败,函数返回None;
而 re.search 则匹配整个字符串,直到找到一个匹配。

split#

re.split(pattern, string, maxsplit=0, flags=0)

split 方法按照能够匹配的子串将字符串分割后返回列表。

findall#

re.findall(pattern, string, flags=0)

在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。

注意: match 和 search 是匹配 一次,findall 则是匹配 所有

sub#

re.sub(pattern, repl, string, count=0, flags=0)
re.subn(pattern, repl, string, count=0, flags=0)

log_line_extractor#

传入日志第一行,正则匹配猜测出平台类型。

log_line_ios = '2019-09-29 14:46:49.301 Debug|1031|23627|:96|IMPDT_MBR_Engine||start: role = 1'
log_line_android = '19-09-22 15:09:09|1569136148648[4600]4794|W|VasQuickUpdateEngine_Native|[2019-09-22 15:09:08][1569136148648][debug   ][unnamed thread:4794][MBR_Engine:96]: start: role = 1'
log_line_windows = r'''"36098","2","2019/09/29 17:12:44:884","183323671","XP.MSGBAK.MBR_Engine","start: role = 1","10400","18892","95"'''

暂不考虑日期及时间的合法性,从行首匹配各个平台的日志格式。

test cases#

log_line_i_m = 'test 2019-09-29 14:46:49.301 mid'
log_line_i_e = 'test end 2019-09-29 14:46:49.301'
log_line_i = '2019-09-29 14:46:49.301 Debug|1031|23627|:96|IMPDT_MBR_Engine||start: role = 1'

log_line_a_m = 'test 19-09-22 15:09:09| mid'
log_line_a_e = 'test end 19-09-22 15:09:09|'
log_line_a = '19-09-22 15:09:09|1569136148648[4600]4794|W|VasQuickUpdateEngine_Native|[2019-09-22 15:09:08][1569136148648][debug   ][unnamed thread:4794][MBR_Engine:96]: start: role = 1'

log_line_w_m = r'''test ,"2019/09/29 17:12:44:884", mid'''
log_line_w_e = r'''test end ,"2019/09/29 17:12:44:884",'''
log_line_w = r'''"36098","2","2019/09/29 17:12:44:884","183323671","XP.MSGBAK.MBR_Engine","start: role = 1","10400","18892","95"'''

RegExp#

以下为 javascript 对应的 literal expression,可在 vscode 中利用 Regex Previewer 插件进行 Test Regex:

let irem = /(?:^)\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} /m
let iren = /(?:^|\n)\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} /
let arem = /(?:^)\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\|/m
let aren = /(?:^|\n)\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\|/
let wre = /,"\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}\:\d{3}",/ // 可匹配中部
let wrem = /(?:^)"\d+","\d","\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}\:\d{3}",/m

re#

从行首 match 匹配各个平台的日志格式。

由于 re 中的 match 本身就是从行首开始匹配,故无需指定 re.M 匹配行头 (?:^)

#moi = re.match(r'(?:^)\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} ', log_line_i, re.M)
#roi = re.compile(r'(?:^|\n)\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} ')
roi = re.compile(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} ')
moi = roi.match(log_line_i)     # not None, non-group, moi.group(0)
moi_m = roi.match(log_line_i_m) # None
moi_e = roi.match(log_line_i_e) # None

roa = re.compile(r'\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\|')
moa = roa.match(log_line_a)     # not None, non-group, moi.group(0)
moa_m = roa.match(log_line_a_m) # None
moa_e = roa.match(log_line_a_e) # None

row = re.compile(r'"\d+","\d","\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}\:\d{3}",')
mow = row.match(log_line_w)     # not None, non-group, moi.group(0)
mow_m = row.match(log_line_w_m) # None
mow_e = row.match(log_line_w_e) # None

提取为 try_to_guess_platform 函数:

import enum
import re


class PLATFORM(enum.Enum):
    ios = 0
    android = 1
    windows = 2


def try_to_guess_platform(log_line: str) -> int:
    roi = re.compile(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} ')
    roa = re.compile(r'\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\|')
    row = re.compile(r'"\d+","\d","\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}\:\d{3}",')
    if (roi.match(log_line)):
        return PLATFORM.ios.value
    if (roa.match(log_line)):
        return PLATFORM.android.value
    if (row.match(log_line)):
        return PLATFORM.windows.value
    return -1

以下为测试代码:

try_to_guess_platform(log_line_i)   # 0: ios
try_to_guess_platform(log_line_i_m) # -1
try_to_guess_platform(log_line_i_e) # -1

try_to_guess_platform(log_line_a)   # 1: android
try_to_guess_platform(log_line_a_m) # -1
try_to_guess_platform(log_line_a_e) # -1

try_to_guess_platform(log_line_w)   # 2: windows
try_to_guess_platform(log_line_w_m) # -1
try_to_guess_platform(log_line_w_e) # -1

refs#

用python正则表达式提取字符串
通用正则表达式与python中的正则匹配

Comments