Python 規則表達式 (RegEx)：文字處理的終極武器 (第 15 章)

規則表達式 (Regular Expression，簡稱 RegEx) 是一種用來描述字串模式的語言。它看起來像一堆亂碼 (例如 ^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$)，但一旦掌握，你就能用極短的程式碼完成複雜的文字搜尋與取代任務。

Python 內建了 re 模組來支援 RegEx。

1. 基礎語法速查

符號	意義	範例
`.`	任意字元 (換行除外)	`a.c` 匹配 “abc”, “a@c”
`^`	字串開頭	`^Hello`
`$`	字串結尾	`World$`
`*`	重複 0 次或多次	`ab*` 匹配 “a”, “ab”, “abb”
`+`	重複 1 次或多次	`ab+` 匹配 “ab”, “abb” (不含 “a”)
`?`	重複 0 次或 1 次	`https?` 匹配 “http”, “https”
`\d`	數字 (Digit)	`\d{3}` 匹配 “123”
`\w`	字母、數字或底線	`\w+` 匹配變數名稱
`[]`	字元集合 (Character Set)	`[a-z]` 匹配小寫字母

2. Python `re` 模組四大天王

2.1 `re.match()`：從頭匹配

只檢查字串的開頭是否符合模式。

import re

pattern = r"Hello"
text = "Hello World"

result = re.match(pattern, text)
if result:
    print("匹配成功:", result.group())  # Hello
else:
    print("匹配失敗")

注意：如果不使用 r"" (Raw String)，反斜線 \ 必須轉義 (寫成 \\)，非常麻煩。寫 RegEx 請務必加上 r 前綴。

2.2 `re.search()`：搜尋全文

掃描整個字串，回傳第一個符合的位置。

pattern = r"\d+"  # 匹配數字
text = "Order ID: 12345, Price: $99"

match = re.search(pattern, text)
if match:
    # 只會抓到第一個 "12345"
    print(f"找到數字: {match.group()}")

2.3 `re.findall()`：找出所有符合

回傳一個包含所有符合字串的 List。

pattern = r"\d+"
text = "Order 123, Qty 5, Price 99"

matches = re.findall(pattern, text)
print(matches)  # ['123', '5', '99']

2.4 `re.sub()`：搜尋並取代

比字串的 replace() 更強大，因為它支援模式匹配。

# 將日期格式從 YYYY-MM-DD 改為 MM/DD/YYYY
text = "Today is 2023-12-25"
# 使用 Group () 來捕捉年、月、日
pattern = r"(\d{4})-(\d{2})-(\d{2})"

# \2, \3, \1 代表第 2, 3, 1 個括號捕捉到的內容
new_text = re.sub(pattern, r"\2/\3/\1", text)
print(new_text)  # Today is 12/25/2023

3. 實戰演練：驗證 Email

這是一個經典的面試題。

def is_valid_email(email):
    # 簡單版 Email RegEx
    # ^       : 開頭
    # [\w\.-]+: 使用者名稱 (字母、數字、點、減號)
    # @       : @ 符號
    # [\w\.-]+: 網域名稱
    # \.      : 點
    # \w+     : 頂級域名 (com, org, tw)
    # $       : 結尾
    pattern = r"^[\w\.-]+@[\w\.-]+\.\w+$"
    
    if re.match(pattern, email):
        return True
    return False

emails = ["test@example.com", "user.name@site.org", "invalid-email@", "no_at_sign.com"]

for e in emails:
    print(f"{e}: {is_valid_email(e)}")

4. 貪婪 (Greedy) vs 非貪婪 (Non-Greedy)

預設情況下，* 和 + 是貪婪的，會盡可能匹配更多字元。加上 ? 變成非貪婪。

HTML 解析範例：<div>Hello</div><div>World</div>

r"<div>.*</div>" (貪婪): 會匹配從第一個 <div> 到最後一個 </div> 的整串字串。
r"<div>.*?</div>" (非貪婪): 會分別匹配 <div>Hello</div> 和 <div>World</div>。

5. 總結

RegEx 是文字處理的瑞士刀，雖然學習曲線陡峭，但學會後終身受用。

本章重點回顧：

Raw String: 寫 RegEx 記得用 r"pattern"。
re.search vs re.match: match 只看開頭，search 掃描全文。
re.findall: 抓出所有符合的資料。
Groups: 使用 () 括號來提取子字串。

到此，我們結束了 Batch 3 (Intermediate Core Topics) 的旅程！

Batch 3 重點回顧：

Ch 11: 進階資料結構 (collections)
Ch 12: 迭代器與生成器 (yield)
Ch 13: 裝飾器 (@decorator)
Ch 14: 函數式編程 (lambda, map)
Ch 15: 規則表達式 (re)

下一階段，我們將進入 Batch 4 (OOP & Advanced Topics)，探討更多物件導向的深層魔法！

延伸閱讀：

Python 官方文件：re module
Regex101 (線上測試 RegEx 的神器)

Python 規則表達式 (RegEx)：文字處理的終極武器 (第 15 章)

1. 基礎語法速查

2. Python `re` 模組四大天王

2.1 `re.match()`：從頭匹配

2.2 `re.search()`：搜尋全文

2.3 `re.findall()`：找出所有符合

2.4 `re.sub()`：搜尋並取代

3. 實戰演練：驗證 Email

4. 貪婪 (Greedy) vs 非貪婪 (Non-Greedy)

5. 總結

LATEST POST

國際真菌學會發布2030年全球組織胞漿菌病防治新目標

第一夫人主持安理會背後的科技與教育權力遊戲

伊朗戰爭成為AI輔助作戰的大規模試驗場

TAG

CATEGORIES

Python 規則表達式 (RegEx)：文字處理的終極武器 (第 15 章)

1. 基礎語法速查

2. Python re 模組四大天王

2.1 re.match()：從頭匹配

2.2 re.search()：搜尋全文

2.3 re.findall()：找出所有符合

2.4 re.sub()：搜尋並取代

3. 實戰演練：驗證 Email

4. 貪婪 (Greedy) vs 非貪婪 (Non-Greedy)

5. 總結

LATEST POST

國際真菌學會發布2030年全球組織胞漿菌病防治新目標

第一夫人主持安理會背後的科技與教育權力遊戲

伊朗戰爭成為AI輔助作戰的大規模試驗場

TAG

CATEGORIES

訂閱我們的電子報

2. Python `re` 模組四大天王

2.1 `re.match()`：從頭匹配

2.2 `re.search()`：搜尋全文

2.3 `re.findall()`：找出所有符合

2.4 `re.sub()`：搜尋並取代