第六章字符串及正则表达式

字符串的常用方法

字符串是Python中的不可变数据类型，在Python中一切皆对象，字符串对象本身就有一些常用的方法。

字符串的常用操作：

方法名	描述说明
str.lower()	将str字符串全部转成小写字母，结果为一个新的字符串
str.upper()	将str字符串全部转成大写字母，结果为一个新的字符串
str.split(sep=None)	把str按照指定的分隔符sep进行分隔，结果为列表类型
str.count(sub)	结果为sub这个字符串在str中出现的次数
str.find(sub)	查询sub这个字符串在str中是否存在，如果不存在结果为-1，如果存在，结果为sub首次出现的索引
str.index(sub)	功能与find()相同，区别在于要查询的子串sub不存在时，程序报错
str.startswith(s)	查询字符串str是否以子串s开头
str.endswith(s)	查询字符串str是否以子串s结尾

练习：

# 大小写转换。lower()、upper()都会返回新的字符串
s1 = 'HelloWorld'
new_s2 = s1.lower()
print(s1, new_s2)

new_s3 = s1.upper()
print(s1, new_s3)

# 字符串的分隔。split()返回列表类型
e_mail = 'lxl@163.com'
lst = e_mail.split('@')
print('邮箱名：', lst[0], '邮件服务器名：', lst[1])

# 统计子串出现的次数
print(s1.count('o'))

# 检索操作
print(s1.find('o'))  # o在字符串s1中首次出现的位置
print(s1.find('p'))  # -1,表示没有找到

print(s1.index('o'))
# print(s1.index('p'))  # ValueError: substring not found

# 判断前缀和后缀
print(s1.startswith('H'))  # True
print(s1.startswith('P'))  # False

print('demo.py'.endswith('.py'))  # True
print('text.txt'.endswith('.txt'))  # True

在这里插入图片描述

字符串是Python中很重要的数据类型，其方法还有很多：

方法名	描述说明
str.replace(old,news)	使用news替换字符串s中所有的old字符串，结果是一个新的字符串
str.cemter(width,fillchar)	字符串str在指定的宽度范围内居中，可以使用fillchar进行填充
str.join(iter)	在iter中的每个元素的后面都增加一个新的字符串str
str.strip(chars)	从字符串中去掉左侧和右侧chars中列出的字符串
str.lstrip(chars)	从字符串中去掉左侧chars中列出的字符串
str.rstrip(chars)	从字符串中去掉右侧chars中列出的字符串

练习：

s = 'HelloWorld'

# 字符串的替换
new_s = s.replace('o', '你好')  # 第三个参数是替换次数，如果没写则默认全部替换
print(new_s)

# 字符串在指定的宽度范围内居中
print(s.center(20))
print(s.center(20, '*'))

# 去掉字符串左右的空格。返回新的字符串对象
s = '     Hello     World     '
print(s.strip())  # 去除字符串左侧和右侧的空格
print(s.lstrip())  # 去除字符串左侧的空格
print(s.rstrip())  # 去除字符串右侧的空格

# 去掉指定的字符。与顺序无关
s3 = 'dl-HelloWorld'
print(s3.strip('ld'))  # 在去除字符时与顺序无关。写ld、dl都一样，只要包含ld中的字符，都会去掉
print(s3.lstrip('ld'))
print(s3.rstrip('ld'))

在这里插入图片描述

格式化字符串

前面提到使用 + 号可以进行字符串的连接，但仅限于字符串与字符串之间的连接，不能将字符串与其他数据类型进行连接。而引入了格式化字符串后就可以将字符串与其他数据类型进行连接。

格式化字符串的三种方式：

1、占位符 (类似C语言，这里只列出三种，实际上有很多)

%s：字符串格式
%d：十进制整数格式
%f：浮点数格式

2、f-string

Python 3.6 引入的格式化字符串的方式，以 { } 标明被替换的字符

3、str.format()方法

模版字符串.format(逗号分隔的参数)

练习：

# 1、使用占位符进行格式化
name = '马冬梅'
age = 18
score = 98.5
print('姓名：%s,年龄：%d,成绩：%f' % (name, age, score))  # 元组
print('姓名：%s,年龄：%d,成绩：%.1f' % (name, age, score))  # %.1f表示保留一位小数

# f-string
print(f'姓名：{name},年龄：{age},成绩：{score}')

# 字符串的format方法
# 0、1、2对应的是format当中参数的索引位置
print('姓名：{0},年龄：{1},成绩：{2}'.format(name, age, score))
print('姓名：{2},年龄：{0},成绩：{1}'.format(age, score, name))

在这里插入图片描述

format详细格式控制

使用字符串 format( ) 方法有更加精细的控制输出格式。

格式化字符串的详细格式：

名	描述说明
:	引导符号
填充	用于填充单个字符
对齐方式	< 左对齐 > 右对齐 ^ 居中对齐
宽度	字符串的输出宽度
，	数字的千位分隔符 (3位一个逗号)
.精度	浮点数小数部分的精度或字符串的最大输出长度
类型	整数类型:b\d\o\x\X 浮点数类型:e\E\f%

其中"类型"中 e\E 表示科学计数法；f 表示精度；%表示百分数。

练习：

s = 'helloworld'
print('{0:*<20}'.format(s))  # 字符串的显示宽度为20，左对齐，空白部分使用*填充
print('{0:*>20}'.format(s))  # 右对齐
print('{0:*^20}'.format(s))  # 居中对齐

# 居中对齐还有center()方法
print(s.center(20, '*'))

# 千位分隔符(只适用于整数和浮点数)
print('{0:,}'.format(987654321))
print('{0:,}'.format(987654321.4321))

# 浮点数小数部分的精度
print('{0:.2f}'.format(3.1415926))
# 字符串类型，表示是最大的显示长度
print('{0:.5}'.format('helloworld'))

# 整数类型
a = 425
print('二进制：{0:b},十进制：{0:d},八进制：{0:o},十六进制：{0:x},十六进制{0:X}'.format(a))

# 浮点数类型
b = 3.1415926
print('{0:.2f},{0:.2e},{0:.2E},{0:.2%}'.format(b))

在这里插入图片描述

字符串的编码和解码

在这里插入图片描述

字符串的编码就是将字符串转换成bytes；字符串的解码就是将bytes转换成字符串。

字符串的编码：
将str类型转换成bytes类型，需要使用到字符串的encode()方法。
语法格式：

str.encode( encoding = ‘utf-8’, errors = ‘strict/ignore/replace’ )

字符串的解码：
将bytes类型转换成str类型，需要使用到bytes类型的decode()方法。
语法格式：

bytes.decode(encoding = ‘utf-8’, errors = ‘strict/ignore/replace’ )

在进行编码的过程中出错的解决方案：

ignore：忽略。
strict：严格的。遇到转不了的字符程序直接报错。
replace：替换。遇到转不了的字符程序会使用 ? 替换无法转换的字符。

需要说明的是编码格式与解码格式必须是一样的，使用utf-8编码，则必须使用utf-8解码。

s = '伟大的中国梦'
# 编码 str-->bytes
scode = s.encode(errors='replace')  # 默认的编码格式是 utf-8。在utf-8中文占3个字节
print(scode)

scode_gbk = s.encode('gbk', errors='replace')  # gbk中文占2个字节
print(scode_gbk)

# 解码 bytes-->str
print(scode_gbk.decode('gbk', errors='replace'))
print(bytes.decode(scode_gbk, 'gbk'))
print(bytes.decode(scode, 'utf-8'))

在这里插入图片描述

数据验证的方法

数据的验证是指程序对用户输入的数据进行合法性验证。

数据验证的方法：

方法名	描述说明
str.isdigit()	所有字符都是数字(阿拉伯数字)
str.isnumeric()	所有字符都是数字
str.isalpha()	所有字符都是字母(包含中文字符)
str.isalnum()	所有字符都是数字或字母(包含中文字符)
str.islower()	所有字符都是小写
str.isupper()	所有字符都是大写
str.istitle()	所有字符都是首字母大写
str.isspace()	所有字符都是空白字符(\n、\t等)

注意：
isdigit()只能识别十进制阿拉伯数字；isnumeric()能识别阿拉伯数字、罗马数字、中文数字如一二三四。

# isdigit()只能识别十进制的阿拉伯数字
print('123'.isdigit())  # True
print('0b1010'.isdigit())  # 二进制 False
print('一二三四'.isdigit())  # 中文数字 False
print('IIIIII'.isdigit())  # 罗马数字 False
print('-' * 50)

# 所有字符都是数字。isnumeric()能识别阿拉伯数字、罗马数字、中文数字
print('123'.isnumeric())  # True
print('一二三四'.isnumeric())  # True
print('Ⅰ'.isnumeric())  # True
print('0b1010'.isdigit())  # False
print('壹贰叁'.isnumeric())  # True
print('-' * 50)

# 所有字符都是字母(包含中文字符)
print('hello你好'.isalpha())  # True
print('hello你好123'.isalpha())  # False
print('hello你好一二三'.isalpha())  # True
print('hello你好ⅠⅡⅢ'.isalpha())  # False
print('hello你好壹贰叁'.isalpha())  # True
print('-' * 50)

# 所有字符都是数字或字母(包含中文字符)
print('hello你好'.isalnum())  # True
print('hello你好123'.isalnum())  # True
print('hello你好一二三'.isalnum())  # True
print('hello你好ⅠⅡⅢ'.isalnum())  # True
print('hello你好壹贰叁'.isalnum())  # True
print('-' * 50)

# 判断字符的大小写
print('HelloWorld'.islower())  # False
print('helloworld'.islower())  # True
print('hello你好'.islower())  # True
print('-' * 50)

print('HelloWorld'.isupper())  # False
print('HELLOWORLD'.isupper())  # True
print('HELLO你好'.isupper())  # True
print('-' * 50)

# 所有字符都是首字母大写。当且仅当首字母是大写时返回True，如存在非首字母大写则会返回False
print('HelloWorld'.istitle())  # False 。存在非首字母大写，故返回False
print('Helloworld'.istitle())  # True
print('Hello World'.istitle())  # True 。这里是两个单词
print('Hello world'.istitle())  # False 。这里是两个单词，第二个单词的首字母没有大写
print('-' * 50)

# 判断是否都是空白字符
print('\t'.isspace())  # True
print(' '.isspace())  # True
print('\n'.isspace())  # True

字符串的拼接操作

前面提到可以使用 + 号拼接字符串。还有以下三种方式拼接字符串：
1、使用str.join()方法进行拼接字符串
2、直接拼接
3、使用格式化字符串进行拼接

s1 = 'hello'
s2 = 'world'
# 使用+进行拼接
print(s1 + s2)

# 使用字符串join()方法
# 把s1、s2放在一个列表中，join()会对列表当中的元素进行一个拼接
print(''.join([s1, s2]))  # 使用空字符串进行拼接

print('*'.join(['hello', 'world', 'python', 'java', 'php']))
print('你好'.join(['hello', 'world', 'python', 'java', 'php']))

# 直接拼接
print('hello''world')

# 使用格式化字符串进行拼接
print('%s%s' % (s1, s2))
print(f'{s1}{s2}')
print('{0}{1}'.format(s1, s2))

在这里插入图片描述

字符串的去重操作

# 字符串去重操作
s = 'helloworldhelloworlddlrowolleh'
# 使用字符串拼接及not in
new_s = ''
for item in s:
    if item not in new_s:
        new_s += item  # 拼接操作
print(new_s)

# 使用索引+ not in
new_s2 = ''
for i in range(len(s)):
    if s[i] not in new_s2:  # 索引
        new_s2 += s[i]  # 拼接操作
print(new_s2)

# 通过集合去重+列表排序
# 集合的性质：唯一性、无序性。因此转换成集合后需要将其按照原来的顺序进行排序
new_s3 = set(s)
lst = list(new_s3)
lst.sort(key=s.index)  # 作为参数不能有()，调用时才有()
print(''.join(lst))  # 使用join()方法进行字符串拼接

在这里插入图片描述

正则表达式

正则表达式是一个特殊的字符序列，它能够帮助用户非常便捷的检查一个字符串是否符合某种模式。(比如用户名是否由数字和字母组成，密码是否是6位等都可以用正则表达式验证)

元字符：
具有特殊意义的专用字符。例如 ^ 和 $ 分别表示匹配的开始和结束。

以下表格列举了一些其他的元字符，这里只列举出了一部分，还有其他的元字符未列举出来。

元字符	描述说明	举例	结果
.	匹配任意字符(除\n)	‘p\nytho\tn’	p、y、t、h、o、\t、n
\w	匹配字母、数字、下划线	‘python\n123’	p、y、t、h、o、n、1、2、3
\W	匹配非字母、非数字、非下划线	‘python\n123’	\n
\s	匹配任意空白字符	‘python\t123’	\t
\S	匹配任意非空白字符	‘python\n123’	p、y、t、h、o、n、1、2、3
\d	匹配任意十进制数	‘python\n123’	1、2、3

限定符：
用于限定匹配的次数。

限定符	描述说明	举例	结果
?	匹配前面的字符0次或1次	colou?r	可以匹配color或colour
+	匹配前面的字符1次或多次	colou+r	可以匹配colour或colouu…r
*	匹配前面的字符0次或多次	colou*r	可以匹配color或colouu…r
{n}	匹配前面的字符n次	colou{2}r	可以匹配colouur
{n,}	匹配前面的字符最少n次	colou{2,}r	可以匹配colouur或colouuu…r
{n,m}	匹配前面的字符最少n次，最多m次	colou{2,4}r	可以匹配colouur或colouuur或colouuuur

其他字符：

其他字符	描述说明	举例	结果
区间字符 [ ]	匹配 [ ] 中所指定的字符	[.?!] [0-9]	匹配标点符号点、问号、感叹号匹配0、1、2、3、4、5、6、7、8、9
排除字符 ^	匹配不在 [ ] 中指定的字符	[^0-9]	匹配除0、1、2、3、4、5、6、7、8、9之外的字符
选择字符 \|	用于匹配 \| 左右的任意字符	\d{15} \| \d{18}	匹配15位身份证或18位身份证
转义字符	同Python中的转义字符	.	将 . 作为普通字符使用(原本 . 是元字符，匹配任意字符。除\n)
[\u4e00-\u9fa5]	匹配任意一个汉字
分组( )	改变限定符的作用	six\|fourth (six\|four)th	匹配six或fourth 匹配sixth或fourth

re模块

re模块：
Python中的内置模块，用于实现Python中的正则表达式操作。

函数	功能描述
re.match(pattern,string,flags=0)	用于从字符串的开始位置进行匹配，如果起始位置匹配成功，结果为Match对象，否则结果为None
re.search(pattern,string,flags=0)	用于在整个字符串中国搜索第一个匹配的值，如果匹配成功，结果为Match对象，否则结果为None
re.findall(pattern,string,flags=0)	用于在整个字符串搜索所有符合正则表达式的值，结果是一个列表类型
re.sub(pattern,repl,string,count,flags=0)	用于实现对字符串中指定子串的替换，返回字符串类型
re.split(pattern,string,maxsplit,flags=0)	字符串中的split()方法功能相同，都是分隔字符串，返回列表类型

match函数

import re  # 导入re模块

# match()用于从字符串的开始位置进行匹配，如果起始位置匹配成功，结果为Match对象，否则结果为None
pattern = '\d\.\d+'  # + 限定符，\d 0-9 数字出现1次或多次
s = 'I study Python 3.11 every day'  # 待匹配字符串
match = re.match(pattern, s, re.I)  # re.I表示忽略大小写，Ignore
print(match)  # None

s2 = '3.11 Python I study every day'
match2 = re.match(pattern, s2)
print(match2)  # <re.Match object; span=(0, 4), match='3.11'> ,指[0,4)

print('匹配值的起始位置：', match2.start())
print('匹配值得结束位置：', match2.end())
print('匹配区间的位置元素：', match2.span())
print('待匹配的字符串：', match2.string)
print('匹配的数据：', match2.group())

在这里插入图片描述

search函数和findall函数

search()函数的使用：

import re  # 导入re模块

# search()用于在整个字符串中国搜索第一个匹配的值，如果匹配成功，结果为Match对象，否则结果为None
pattern = '\d\.\d+'  # + 限定符，\d 0-9 数字出现1次或多次
s = 'I study Python3.11 every day Python2.7 I love you'  # 待匹配字符串
match = re.search(pattern, s)
print(match)

s2 = '3.10Python I love you'
match2 = re.search(pattern, s2)
print(match2)

s3 = 'I study Python every day'
match3 = re.search(pattern, s3)
print(match3)  # None

print(match.group())
print(match2.group())

在这里插入图片描述

findall()函数的使用：

import re  # 导入re模块

# findall()用于在整个字符串搜索所有符合正则表达式的值，结果是一个列表类型
pattern = '\d\.\d+'  # + 限定符，\d 0-9 数字出现1次或多次
s = 'I study Python3.11 every day Python2.7 I love you'  # 待匹配字符串
s2 = 'I study Python3.10 every day'
s3 = 'I study Python'
lst = re.findall(pattern, s)
lst2 = re.findall(pattern, s2)
lst3 = re.findall(pattern, s3)
print(lst)
print(lst2)
print(lst3)

在这里插入图片描述

sub函数和split函数

import re  # 导入re模块

# sub()用于实现对字符串中指定子串的替换,返回字符串类型
pattern = '黑客|破解|反爬'  # 选择字符 | ，用于匹配 | 左右的任意字符
s = '我想学习Python，想破解一些VIP视频，Python可以实现无底线反爬吗？'
new_s = re.sub(pattern, '***', s)
print(new_s)

# split()字符串中的split()方法功能相同，都是分隔字符串，返回列表类型
s2 = 'https://www.bilibili.com/?p=74&spm_id'
pattern2 = '[?|&]'  # 区间字符[],匹配[]中所指定的字符；选择字符 | ，用于匹配 | 左右的任意字符
lst = re.split(pattern2, s2)
print(lst)

在这里插入图片描述

章节习题

在这里插入图片描述

本题关键在于搞清楚模式串的含义：

pattern = r’13[4-9]\d{8}’ 表示以13开头，第三位是[4-9]，再加上由0-9的8位整数

import re

pattern = r'13[4-9]\d{8}'  # 以13开头，第三位是[4-9]，再加上由0-9的8位整数
lst = ['13809876543', '15109876543', '13278965439', '13198765432']
for item in lst:
    match = re.search(pattern, item)
    if match != None:
        print(match.group())
    else:
        print(match)

在这里插入图片描述

以下代码的运行结果：

import re

pattern = r'\s*@'
s = '@中国 @美国 @俄罗斯'
lst = re.split(pattern, s)
print(lst)

在这里插入图片描述
限定符*会匹配0个或多个(包括1个)。

练习

练习一

判断车牌归属地。

需求：
使用列表存储N个车牌号码，通过遍历列表及字符串的切片操作判断车牌的归属地。

lst = ['京A888', '津A666', '冀A999']
for item in lst:
    area = item[0:1]  # 字符串切片
    print(item, '归属地为：', area)

在这里插入图片描述

练习二

统计字符串中出现指定字符的次数。

需求：
声明一个字符串，内容为‘HelloPython，HelloJava，hellophp’，用户从键盘录入要查询的字符（不区分大小写），要求统计出要查找的字符在字符串中出现的次数。

分析：不区分大小写，只需要将其都转成小写或都转成大写即可。

s = 'HelloPython,HelloJava,hellophp'
word = input('请输入要统计的字符：')
print('{0}在{1}一共出现了{2}次'.format(word, s, s.upper().count(word)))

在这里插入图片描述

练习三

格式化输出商品的名称和单价。

需求：
使用列表存储一些商品数据，使用循环遍历输出商品信息，要求对商品的编号进行格式化为6位，单价保留2位小数，并在前面添加人民币符号输出。

lst = [
    ['01', '电风扇', '美的', 500],
    ['02', '洗衣机', 'TCL', 1000],
    ['03', '微波炉', '老板', 400]
]
print('编号\t\t名称\t\t\t品牌\t\t单价')
for item in lst:
    for i in item:
        print(i, end='\t\t')
    print()  # 换行

# 格式化
for item in lst:
    item[0] = '0000' + item[0]
    item[3] = '¥{0:.2f}'.format(item[3])

# 格式化后输出
print('编号\t\t\t名称\t\t\t品牌\t\t单价')
for item in lst:
    for i in item:
        print(i, end='\t\t')
    print()  # 换行

在这里插入图片描述

练习四

提取文本中所有图片的链接地址。

需求：
从给定的文本中使用正则表达式提取出所有的图片链接地址。

import re

s = ''
pattern = ''
lst = re.findall(pattern, s)
for item in lst:
    print(item)

第六章 字符串及正则表达式