引言:
1.语言:python
2.预备知识——python:爬虫学习前记----Python-CSDN博客
3.学习资源:【Python+爬虫】
html:
<!DOCTYPE html>
<html>
<head>
<title>czy_demo</title>
<meta charset="UTF-8"> <!-- 指定字符编码 -->
</head>
<body>
<h1>一级标题(h1~h6)</h1>
<p>普通文本<b>加粗</b><i>斜体</i><u>下划线</u></p>
<img src="1.jpg" width="500px">
<br><a href="http://t.csdnimg.cn/DvHJ6" target="_blank">CSDN链接</a>
<p>这是多个span展示:<span style="background-color: bisque">span1</span><span style="background-color: aquamarine">span2</span></p>
<ol>
<li>有序列表</li>
<li>有序列表</li>
<li>有序列表</li>
</ol>
<ul>
<li>无序列表</li>
<li>无序列表</li>
<li>无序列表</li>
</ul>
<table border="1">
<thead>
<tr>头部有几个就写几行tr</tr>
<tr>第二行头部标签</tr>
</thead>
<tbody>
<tr>
<td>第一行*单元格1</td>
<td>第一行*单元格2</td>
<td>第一行*单元格3</td>
</tr>
<tr>
<td>第二行*单元格1</td>
<td>第二行*单元格2</td>
<td>第二行*单元格3</td>
</tr>
</tbody>
</table>
</body>
</html>
爬虫代码
1.两个需要的包
from bs4 import BeautifulSoup
import requests
2.爬原代码
response = requests.get('http:.......')
print(response) # 响应
print(response.status_code) # 状态码---200[ok]
print(response.text) # 打印源码
3.爬指定的内容
response = requests.get('http:........')
content =response.text
soup = BeautifulSoup(content,"html.parser") # 解析器html
all_p=soup.findAll("p",attrs={"class":""})
for p in all_p:
print(p.string)
all_p=soup.findAll("h3")
for p in all_p:
p1=p.findAll("a")
for p2 in p1:
print(p2.string)
3.下载图片
from bs4 import BeautifulSoup
import requests
headers={
'User-Agent': 【替换成目标网页的User-Agent】
}
response = requests.get('http://data.shouxi.com/item.php?id=1239786',headers=headers)
response.encoding = 'GBK'
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text,"html.parser") # 解析器html
# print(response.text)
i=soup.findAll("img")
num=1;
for Img in i:
img_url=Img.get("src")
if not img_url.startswith('http:'):
img_url="http:....【替换成网页地址】"+img_url # 将相对地址转换成绝对地址
# 发送请求下载图片
img_response = requests.get(img_url, headers=headers)
with open(f'image.{num}.jpg', mode='wb') as f:
f.write(img_response.content)
print(f'图片已保存: images.{num}')
num = num + 1