爬虫学习日记

2024-07-13 22:30:03
开发
18

引言：

1.语言：python

html：

<!DOCTYPE html>
<html>
    <head>
        <title>czy_demo</title>
        <meta charset="UTF-8"> <!-- 指定字符编码 -->
    </head>
    <body>
        <h1>一级标题（h1~h6）</h1>
        <p>普通文本<b>加粗</b><i>斜体</i><u>下划线</u></p>
        <img src="1.jpg" width="500px">
        <br><a href="http://t.csdnimg.cn/DvHJ6" target="_blank">CSDN链接</a>
        <p>这是多个span展示:<span style="background-color: bisque">span1</span><span style="background-color: aquamarine">span2</span></p>
        <ol>
            <li>有序列表</li>
            <li>有序列表</li>
            <li>有序列表</li>
        </ol>
        <ul>
            <li>无序列表</li>
            <li>无序列表</li>
            <li>无序列表</li>
        </ul>

    <table border="1">
        <thead>
            <tr>头部有几个就写几行tr</tr>
            <tr>第二行头部标签</tr>
        </thead>
        <tbody>
            <tr>
                <td>第一行*单元格1</td>
                <td>第一行*单元格2</td>
                <td>第一行*单元格3</td>
            </tr>
            <tr>
                <td>第二行*单元格1</td>
                <td>第二行*单元格2</td>
                <td>第二行*单元格3</td>
            </tr>
        </tbody>

    </table>

    </body>
</html>

爬虫代码

1.两个需要的包

from bs4 import BeautifulSoup
import requests

2.爬原代码

response = requests.get('http:.......')
print(response) #  响应
print(response.status_code) #  状态码---200[ok]
print(response.text) #  打印源码

3.爬指定的内容

response = requests.get('http:........')
content =response.text
soup = BeautifulSoup(content,"html.parser") # 解析器html

all_p=soup.findAll("p",attrs={"class":""})
for p in all_p:
    print(p.string)

all_p=soup.findAll("h3")
for p in all_p:
    p1=p.findAll("a")
    for p2 in p1:
        print(p2.string)

3.下载图片

from bs4 import BeautifulSoup
import requests

headers={
    'User-Agent': 【替换成目标网页的User-Agent】
}
response = requests.get('http://data.shouxi.com/item.php?id=1239786',headers=headers)
response.encoding = 'GBK'
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text,"html.parser") # 解析器html

# print(response.text)

i=soup.findAll("img")

num=1;
for Img in i:
    img_url=Img.get("src")
    if not img_url.startswith('http:'):
        img_url="http:....【替换成网页地址】"+img_url # 将相对地址转换成绝对地址
    # 发送请求下载图片
    img_response = requests.get(img_url, headers=headers)
    with open(f'image.{num}.jpg', mode='wb') as f:
        f.write(img_response.content)
        print(f'图片已保存: images.{num}')
    num = num + 1

原文地址:https://blog.csdn.net/wuhu_czy/article/details/140360967 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1812132384544526336.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部

爬虫学习日记

引言：

html：

爬虫代码

相关推荐

最近更新

热门阅读