Python爬虫之urllib库

1、urllib库的介绍

可以实现HTTP请求，我们要做的就是指定请求的URL、请求头、请求体等信息

urllib库包含如下四个模块

request：基本的HTTP请求模块，可以模拟请求的发送。
error：异常处理模块。
parse：工具模块：如拆分、解析、合并
robotparse：主要用来识别网站的rebots.txt文件。（用的较少）

2、发送请求

2.1、urlopen

urllib.request模块提供了基本的构造HTTP请求的方法，可以模拟浏览器的请求发起过程，同时还具有处理授权验证（Authentication）、重定向（Redireaction）、浏览器Cookie以及一些其他功能。

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com')
print(response.read().decode('utf-8'))
print(type(response))	#输出响应的类型 HTTPResponse类型
print(response.status)	#输出响应的状态码 200
print(response.getheaders())	#输出响应的头信息
print(response.getheader('Server'))		#调用getheader方法，传参server得其值

2.1.1、data参数

data参数是可选的。在添加该参数时，需要使用bytes方法将参数转化为字节流编码格式的内容，即bytes类型。如果传递了这个参数，那么它的请求方式就不再是GET，而是POST。

import urllib.parse
import urllib.request

# www.httpbin.org可以提供HTTP请求测试
data = bytes(urllib.parse.urlencode({'name':'germey'}),encoding='utf-8')
response = urllib.request.urlopen('https://www.httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

# {..."form":{"name":"germey"...}}表明是模拟表单提交
# 这里传递了一个参数name，值是germey，需要将其转码成bytes类型。转码时采用了bytes方法，该方法的第一个参数得是str类型，因此用urllib.parse模块李的urlencode方法将字典参数转化为字符串；第二个参数用于指定编码格式。

2.1.2、timeout参数

timeout参数用于设置超时时间，单位为秒，意思是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定该参数，则会使用全局默认时间。这个参数支持HTTP、HTTPS、FTP请求。

import urllib.request

response = urllib.request.urlopen('https://www.httpbin.org/get',timeout=0.1)
print(response.read())

# urllib.error.URLError: <urlopen error timed out> 超时
# 通过设置此超时时间，实现当一个网页长时间未响应时，跳过对它的抓取。

2.1.3、其他参数

context参数，该参数必须是ss1.SSLContext类型，用来指定SSL的设置。
cafile和capath这两个参数分别用来指定CA证书和其路径，这两个在请求HTTPS链接会有用
cadefault参数已经弃用，默认值为False

2.2、Request

通过构造Request类型的对象，一方面可以将请求独立成一个对象，另一方面可更加丰富灵活的配置参数。

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

构造方法：

class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

第一个参数 url用于请求URL，这是必传参数，其他都是可选参数。
第二个参数data如果要传参数，必须是bytes类型的。如果数据是字典，先用urllib.parse模块里的urlencode方法进行编码。
第三个参数headers是一个字典这就是请求头，构造请求时，既可以通过headers参数直接狗仔此项，也可以通过调用请求实例的add_header方法添加。
添加请求头最常见的方法就是修改User-Agent来伪装成浏览器。默认为Python-usrllib。若伪装成火狐浏览器，则设置User-Agent为：Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
第四个参数origin_req_host值的是请求方的host名称或IP地址。
第五个参数unverifiable表示请求是否是无法验证的，默认为False,意思是用户没有足够的权限来接受这个请求的结果。
第六个参数method是一个字符串，用来指示请求使用的方法，例：GET、POST、PUT等。

2.3、高级用法

各种处理器Handler。会有各种Handler子类继承BaseHandler类：

HTTPDefaultErrorHandler用来处理HTTP响应错误，所以错误类型都会抛出HTTPRrror类型的异常。
HTTPRedirectHandler用于处理重定向。
HTTPCookieProcessor用于处理Cookie。
ProxyHandler用于设置代理，代理默认为空。
HTTPPasswordMgr用于管理密码，它维护用户名密码对照表。
HTTPBasicAuthHandler用于管理认证，比如一个链接在打开时需要认证。

另一个重要类OpenerDirector，简称opener、Opener类可以提供open方法。利用Handler类来构建Opener类。

2.3.1、验证*

# 处理访问时需要用户名密码登录验证的方法。
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username = 'admin'
password = 'admin'
url = 'https://ssr3.scrape.center/'

p = HTTPPasswordMgrWithDefaultRealm()	
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)	
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)
    
# 首先实例化了一个HTTPBasicAuthHandler对象auth_handler，其参数是HTTPPasswordMgrWithDefaultRealm对象，它利用add_password方法添加用户名和密码。这样就建立了用来处理验证的Handler类。

2.3.2、代理*

# 添加代理
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler = ProxyHandler({
    'http':'http//127.0.0.1:8080',
    'https':'https://127.0.0.1:8080'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

2.3.3、Cookie*

# 1、获取网站的Cookie
import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()	# 声明Cookiejar对象，
handler = urllib.request.HTTPCookieProcessor(cookie)	# 构建一个handler
opener = urllib.request.build_opener(handler)	# 构建opener
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value) 
    
# 2、输出文件格式的内容cookie
import urllib.request,http.cookiejar

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
# 若要保存LWP格式，修改为
cookie = http.cookiejar.LWPCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

# 3、读取内容、以LWP格式为例：
import urllib.request,http.cookiejar

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) # 获取cookie内容
handler = urllib.request.HTTPCookieProcessor(cookie) # 构建handler类和opener类
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

3、异常处理

urllib库中的error模块定义了由request模块产生的异常。出现问题request则抛出error模块定义的异常。

3.1、URLError

URLError类来自urllib库中的error模块，继承自OSError类，是error异常模块处理的基类，由request模块产生的异常都可以来处理。

具有一个属性reason，返回错误的原因。

from urllib import request,error
try:
    response = request.urlopen('https://cuiqingcai.com/404')
except error.URLError as e:
    print(e.reason)		# Not Found
# 捕获异常，避免程序中止

3.2、HTTPError

HTTPError是URLError的子类，专门来处理HTTP请求的错误，例如认证请求失败等，有三个属性：

code：返回HTTP状态码。
reason：返回错误原因。
headers：返回请求头。

from urllib import request,error
try:
    response = request.urlopen('https://cuiqingcai.com/404')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
    
# 这里捕获了HTTPError的异常，输出了reason、code和headers属性。
# 有时、reason属性返回的不一定是字符串，也可能是一个对象。

4、解析链接-parse模块

urllib库里还提供parse模块，此模块定义了处理URL的标准接口，例如实现URL各部分的抽取、合并以及链接转换。

4.1、urlparse*

该方法可以实现URL的识别和分段。

from urllib.parse import urlparse
result = urlparse('htps://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

# <class 'urllib.parse.ParseResult'>
# ParseResult(scheme='htps', netloc='www.baidu.com', path='/index.html;user', params='', query='id=5', fragment='comment')
# 解析结果是一个ParseResult类型的对象，包含六部分:scheme、netloc、path、params、query和fragment。

urlparse的API用法：

urllib.parse.urlparse(urlstring,scheme='',allow_fragment=True)

urlstring：待解析的URL。
scheme：默认的协议（例如http或https）。
allow_fragment：是否忽略fragment。

4.2、urlunparse*

对立方法urlunparse，用于构造URL。接收的参数是一个可迭代对象，其长度必须是6，否则抛出参数数量不足或过多的问题。

from urllib.parse import urlunparse

data = ['https','www.baidu.com','index.html','user','a=6','comment'] #列表类型，也可其它
print(urlunparse(data))  # https://www.baidu.com/index.html;user?a=6#comment 
# 成功构造URL

4.3、urlsplit*

此方法和urlparse方法相似，不再单独解析params这一部分（params会合并到path中），返回5个结果。

from urllib.parse import urlsplit

result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)    # SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

# 也是一个元组，可以用属性名和索引获取
print(result.scheme,result[0])  # https https

4.5、urlunsplit*

将链接各部分组成完整链接。传入的参数为可迭代对象，例如列表、元组，参数长度为5

from urllib.parse import urlunsplit

data = ['https','www.baidu.com','index.html','a=6','comment']
print(urlunsplit(data))  # https://www.baidu.com/index.html?a=6#comment

4.6、urljoin

生成链接方法。先提供一个base_url（基础链接）作为第一个参数，将新的链接作为第二个参数。urljoin会解析base_url的scheme、netloc和path这三个内容。

from urllib.parse import urljoin

print(urljoin('https://www.baidu.com','FAQ.html'))
print(urljoin('https://www.baidu.com','https://cuiqingcai.com/FAQ.html'))
print(urljoin('https://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))
print(urljoin('https://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('https://www.baidu.com','https://cuiqingcai.com/FAQ.html/index.php'))
print(urljoin('https://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com?#comment','?category=2'))

#若干新链接不存在base_url中的三项，就予以补充；如果存在就用新链接里的，base_url是不起作用的。

4.7、urlencode

urlencode构造GET请求参数。

from urllib.parse import urlencode

params = {
    'name':'germey',
    'age':25
}
base_url = 'https://www.baidu.com?'
url = base_url + urlencode(params)
print(url)   

# https://www.baidu.com?name=germey&age=25 成功将字段类型转化为GET请求参数

4.8、parse_qs

反序列化。可以将一串GET请求参数转回字典。

from urllib.parse import parse_qs

query = 'name=germy&age=25'
print(parse_qs(query))

#{'name': ['germy'], 'age': ['25']}

4.9、parse_qsl

将参数转化为由元组组成的列表

from urllib.parse import parse_qsl

query = 'name=germy&age=25'
print(parse_qsl(query))

# [('name', 'germy'), ('age', '25')]

4.10、quote

将内容转化为URL编码格式。当URL中带有中文参数时，有可能导致乱码问题，此时用quote方法可以将中文字符转为URL编码。

from urllib.parse import quote

keyword = '壁纸'
url = 'https://www.baidu.com/?wd=' + quote(keyword)
print(url)

# https://www.baidu.com/?wd=%E5%A3%81%E7%BA%B8

4.11、unquote

进行URL解码

url = https://www.baidu.com/?wd=%E5%A3%81%E7%BA%B8
print(unquote(url))
# https://www.baidu.com/?wd=壁纸

5、分析Robots协议

5.1、Robots协议

Robots协议也称作爬虫协议、机器人协议，全名为网**络爬虫排除标准**，用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以！通常叫做robots.txt文件，在网站根目录下。

搜索爬虫在访问一个网站时，先检查这个站点根目录下是否存在robots.txt文件，若存在，则根据其中定义的爬取范围来爬取。若无，则访问能访问到的页面。

# robots.txt文件样例：
# 限定搜索爬虫只能爬取public目录：
User-agent:*	# 搜索爬虫名称，*代表此文件对所有爬虫有效
Disallow:/		# 不允许爬虫爬取的目录，/代表不允许爬取所有页面
Allow:/public/	# 允许爬取的页面，和Disallow一起使用，用来排除限制。此表示可爬取public目录

# 禁止所有爬虫访问所有目录：
User-agent:*
Disallow:/

# 允许所有爬虫访问所有目录： robots.txt文件留空也是可以的
User-agent:*
Disallow:

# 禁止所有爬虫访问网站某些目录：
User-agent:*
Disallow:/private/
Disallow:/tmp/

# 只允许某一个爬虫访问所有目录：
User-agent:WebCrawler
Disallw:
User-agent:*
Disallw:/

5.2、爬虫名称

其实，爬虫有固定名字。例如：

爬虫名称	网站名称
BaiduSpider	百度
Googlebot	谷歌
360Spider	369搜索
YodaoBot	有道
ia_archiver	Alexa
Scooter	altavista
Bingbot	必应

5.3、rebotparser模块

rebotparser模块解析robots.txt文件。该模块提供一个类RobotFileParser，它可以根据网站的robots.txt文件判断是否有权限爬取。

# 用法：在构造方法里串robots.txt文件链接即可
urllib.rebotparser.RobotFileParser(url='')

下面列出RobotFileParser类的几个常用方法：

set_url：设置tobots.txt的文件链接
read：读取robots.txt文件进行分析。记得调用，否则接下来的判断都为False
parse：解析tobots.txt文件
can_fetch：有两个参数，第一个是User-agent，第二个是要抓取的URL。返回结果True或False。表示User-agent指示的搜索引擎是否可以抓取这个URL
mtime：返回上次抓取和分析robots.txt文件的时间，这对长时间分析和抓取robots.txt文件的搜索爬虫很有必要。
modified：对于长时间分析和抓取的搜索爬虫重要，可以将当前时间设置为上次抓取和分析robots.txt文件的时间。

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('Baiduspider','https://www.baidu.com'))				# True
print(rp.can_fetch('Baiduspider','https://www.baidu.com/homepage/')) 	# True
print(rp.can_fetch('Googlebot','https://www.baidu.com/homepage/')) 		# False