python爬虫原理和编程实战：爬取CSDN博主的账号信息

🧑 作者简介：阿里巴巴嵌入式技术专家，深耕嵌入式+人工智能领域，具备多年的嵌入式硬件产品研发管理经验。

📒 博客介绍：分享嵌入式开发领域的相关知识、经验、思考和感悟,欢迎关注。提供嵌入式方向的学习指导、简历面试辅导、技术架构设计优化、开发外包等服务，有需要可私信联系。

python爬虫原理和编程实战：爬取CSDN博主的账号信息

1. 爬虫基础知识
- 1.1 什么是爬虫
- 1.2 为什么需要爬虫
2. Python爬虫流程
3. python爬虫工具
4. Python爬虫的应用场景
5. python爬虫：爬取CSDN博主的账号信息
6. 爬虫的常见问题
7. 总结

1. 爬虫基础知识

在这里插入图片描述

1.1 什么是爬虫

爬虫，又称网络爬虫或网络蜘蛛，是一种可以自动化地获取互联网信息的程序或脚本。它可以模拟人的行为，自动浏览网页、提取数据、甚至执行一些简单的操作。爬虫的工作原理是通过自动发送HTTP请求，获取网页内容，然后解析网页数据并进行处理。

1.2 为什么需要爬虫

在当今互联网时代，网络上充满了各种各样的信息，包括新闻、商品信息、股票数据、天气预报等。通过使用爬虫技术，我们能够自动化地获取这些信息，而不需要手动浏览每个网页，从而节省时间和成本。爬虫还可以用于数据分析、搜索引擎优化、舆情监控等各种应用场景，对于个人和企业而言，具有非常重要的意义。

2. Python爬虫流程

2.1 发送请求

爬虫首先向目标网站发送请求。这通常是通过HTTP或HTTPS协议完成的。发送请求时，我们可能还需要处理一些参数，如请求头、请求体、cookies等，以便模拟真实用户的浏览行为。

2.1 获取响应

服务器在接收到请求后，会返回响应。响应中包含了服务器返回的所有数据，这些数据通常是以HTML、JSON或其他格式编码的。

2.3 解析数据

获取到响应数据后，我们需要对其进行解析，以提取我们感兴趣的信息。这通常涉及到对HTML或JSON数据的解析。

2.4 存储数据

解析出数据后，我们可能需要将这些数据存储起来，以便后续的分析或处理。

2.5 处理异常和反爬虫机制

在爬虫的运行过程中，可能会遇到各种异常，如网络问题、服务器错误等。此外，很多网站为了防止爬虫，会设置一些反爬虫机制，如验证码、请求频率限制等。因此，我们在编写爬虫时，需要妥善处理这些异常和反爬虫机制，以保证爬虫的稳定性和可靠性。

3. python爬虫工具

3.1 requests

requests是Python中最受欢迎的HTTP库之一，它提供了简洁易用的API，可用于发送HTTP请求和处理响应。

3.2 BeautifulSoup

BeautifulSoup是一个优秀的HTML和XML解析库，它能够提供简单明了的API和丰富的功能，用于网页信息的提取和解析。

3.3 lxml

lxml是一个高性能的XML和HTML解析库，基于libxml2和libxslt库，可以高效地处理大型文档，适用于需求复杂的网页信息提取。

3.4 PyQuery

PyQuery是一个类似于jQuery的解析库，使用jQuery风格的语法来解析HTML文档，提取所需的数据。

3.5 Scrapy

Scrapy是一个功能强大、灵活的爬虫框架，适用于快速开发爬虫应用，支持数据的抓取、处理和存储。

4. Python爬虫的应用场景

Python爬虫技术具有广泛的应用场景，以下是一些常见的应用领域：

4.1 数据采集和分析

爬虫在数据采集和分析领域有着广泛的应用。企业可以利用爬虫技术从互联网上收集各种数据，如市场行情、竞争对手的动向、产品信息等，用于商业数据分析、市场调研等。同时，爬虫还可以用于科学研究、舆情分析等领域，为数据分析提供更多的信息来源。

4.2 SEO优化

搜索引擎优化（SEO）是一种重要的网络营销手段，而爬虫可以用于获取各种有关网站的数据，如收录情况、关键词排名等，帮助网站优化其搜索引擎排名。通过爬虫技术，可以及时了解搜索引擎对网页的抓取情况，优化网站结构和内容，提高网站在搜索结果中的排名。

4.3 舆情监控

舆情监控是企业常用的一种市场调研手段，通过对社交媒体、新闻网站等信息源进行监控和分析，了解公众对企业、产品或服务的舆论趋势。爬虫可以帮助企业及时获取各种网络信息，并进行分析和汇总，快速了解公众对企业的看法，及时处理负面舆情，制定合适的品牌营销策略。

4.4 价格监测

在电商行业，价格是消费者购买产品时非常重要的考量因素。企业可以利用爬虫技术监测竞争对手的价格变化，也可以根据市场行情进行实时调整，以更好地制定价格和促销策略。而消费者也可以利用爬虫技术来监测商品价格的变动，以获取最优惠的购买时机。

5. python爬虫：爬取CSDN博主的账号信息

如下所示，我们使用requests和BeautifulSoup两个库开发了一个获取CSDN博主的账号信息的爬虫程序。

import requests
from bs4 import BeautifulSoup
import sys
from datetime import datetime
from urllib.parse import urlparse

def make_request(url):
    """发送HTTP请求到指定的URL，并返回响应内容。"""
    try:
        headers = {
            'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                           'AppleWebKit/537.36 (KHTML, like Gecko) '
                           'Chrome/88.0.4324.150 Safari/537.36')
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response
    except requests.exceptions.HTTPError as errh:
        print ("Http Error:", errh)
    except requests.exceptions.ConnectionError as errc:
        print ("Error Connecting:", errc)
    except requests.exceptions.Timeout as errt:
        print ("Timeout Error:", errt)
    except requests.exceptions.RequestException as err:
        print ("Oops: Something Else", err)
    return None

def parse_html(html_content):
    """解析HTML内容并返回BeautifulSoup对象。"""
    return BeautifulSoup(html_content, 'html.parser')

def get_user_id_from_url(url):
    """从URL中提取用户ID"""
    parse_result = urlparse(url)
    # 假设用户ID是URL路径的最后一部分
    user_id = parse_result.path.rstrip('/').split('/')[-1]
    return user_id

def get_user_nickname(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户昵称。"""
    nickname_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-t > div > div.user-profile-head-info-rr > div.user-profile-head-info-r-t > div.user-profile-head-name > div:nth-child(1)"
    nickname_element = soup.select_one(nickname_selector)
    return nickname_element.get_text(strip=True) if nickname_element else "未知"

def get_fans_count(soup):
    """根据提供的BeautifulSoup对象，提取并返回粉丝数量。"""
    fans_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-t > div > div.user-profile-head-info-rr > div.user-profile-head-info-r-c > ul > li:nth-child(4) > a > div.user-profile-statistics-num"
    fans_element = soup.select_one(fans_selector)
    fans_text = fans_element.get_text(strip=True) if fans_element else "0"
    # 移除数字中的逗号
    return fans_text.replace(",", "")

def get_user_rank(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户排名。"""
    rank_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-t > div > div.user-profile-head-info-rr > div.user-profile-head-info-r-c > ul > li:nth-child(3) > a > div.user-profile-statistics-num"
    rank_element = soup.select_one(rank_selector)
    rank_text = rank_element.get_text(strip=True) if rank_element else "0"
    # 移除数字中的逗号
    return rank_text.replace(",", "")

def get_personal_introduction(soup):
    """根据提供的BeautifulSoup对象，提取并返回个人简介内容。"""
    intro_selector = "p.introduction-fold.default"
    intro_element = soup.select_one(intro_selector)
    if intro_element:
        # 获取 <p> 标签下的所有文本
        intro_text = intro_element.get_text(strip=True)
        # 将 <span> 标签里的文本 "个人简介：" 移除
        intro_text = intro_text.replace('个人简介：', '', 1).strip()
        return intro_text
    return "未知"

def get_user_coding_age(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户的码龄（编程年限），不包含 '码龄' 二字。"""
    coding_age_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-t > div > div.user-profile-head-info-rr > div.user-profile-head-info-r-t > div.user-profile-head-name > div.person-code-age > span"
    coding_age_element = soup.select_one(coding_age_selector)
    if coding_age_element:
        coding_age_text = coding_age_element.get_text(strip=True)
        # 移除 "码龄" 二字
        coding_age_text = coding_age_text.replace("码龄", "").strip()
        return coding_age_text
    return "未知"

def get_blog_level(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户的博客等级。"""
    blog_level_selector = ".user-profile-icon img[src*='blog']"
    blog_level_element = soup.select_one(blog_level_selector)
    if blog_level_element and 'src' in blog_level_element.attrs:
        src = blog_level_element['src']
        # 提取 src 中 "blog" 后的直到 ".png" 前的数字，作为等级
        level = src.split('blog')[-1].split('.png')[0]
        return level
    return "未知"

def get_yuanli_level(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户的原力等级。"""
    yuanli_level_selector = "#userSkin > div.user-profile-body > div > div.user-profile-body-left > div > div.user-influence-list.user-profile-aside-common-box > ul > li > div.influence-bottom-box > div > div > dl:nth-child(1) > dt"
    yuanli_level_element = soup.select_one(yuanli_level_selector)
    return yuanli_level_element.get_text(strip=True) if yuanli_level_element else "未知"

def get_registration_date(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户的注册时间。"""
    registration_date_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-b > div.user-profile-head-info-b-r > div > ul > li.user-general-info-join-csdn > span.user-general-info-key-word"
    registration_date_element = soup.select_one(registration_date_selector)
    return registration_date_element.get_text(strip=True) if registration_date_element else "未知"

def get_graduate_school(soup):
    """根据提供的BeautifulSoup对象，提取并返回用户的毕业院校信息。"""
    graduate_school_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-b > div.user-profile-head-info-b-r > div > ul > li.user-general-info-edu > div > span.user-general-info-key-word"
    graduate_school_element = soup.select_one(graduate_school_selector)
    return graduate_school_element.get_text(strip=True) if graduate_school_element else "未知"

def get_registration_days(soup):
    """计算从注册日期到今天共多少天。"""
    registration_date_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-b > div.user-profile-head-info-b-r > div > ul > li.user-general-info-join-csdn > span.user-general-info-key-word"
    registration_date_element = soup.select_one(registration_date_selector)
    if registration_date_element:
        date_text = registration_date_element.get_text(strip=True)
        registration_date = datetime.strptime(date_text, '%Y-%m-%d')
        today = datetime.now()
        days_registered = (today - registration_date).days
        return days_registered
    return "未知"

def convert_days_to_ymd(days):
    """将天数转换成年月日的字符串形式"""
    years = days // 365
    days -= years * 365
    months = days // 30
    days -= months * 30
    result = []
    if years > 0:
        result.append(f"{years}年")
    if months > 0:
        result.append(f"{months}月")
    if days > 0:
        result.append(f"{days}天")
    return "".join(result) if result else "0天"

def get_blog_description(soup):
    """根据提供的BeautifulSoup对象，提取并返回博客的描述信息。"""
    blog_description_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-b > div.user-profile-head-info-b-r > div > div > div:nth-child(2) > div"
    blog_description_element = soup.select_one(blog_description_selector)
    return blog_description_element.get_text(strip=True) if blog_description_element else "未知"

def get_published_articles_count(soup):
    """根据提供的BeautifulSoup对象，提取并返回已发表文章的数量。"""
    articles_count_selector = "#userSkin > div.user-profile-head > div.user-profile-head-info > div.user-profile-head-info-t > div > div.user-profile-head-info-rr > div.user-profile-head-info-r-c > ul > li:nth-child(2) > a > div.user-profile-statistics-num"
    articles_count_element = soup.select_one(articles_count_selector)
    return articles_count_element.get_text(strip=True) if articles_count_element else "未知"

# 主函数，执行脚本逻辑
def main(url):
    response = make_request(url)
    if response and response.text:

        # 请求网页
        soup = parse_html(response.text)

        # 获取用户ID
        user_id = get_user_id_from_url(url)
        print(f"用户ID：{user_id}")

        # 获取用户昵称
        user_nickname = get_user_nickname(soup)
        print(f"用户昵称：{user_nickname}")

        # 获取个人简介
        personal_introduction = get_personal_introduction(soup)
        print(f"个人简介：{personal_introduction}")

        # 获取博客描述信息
        blog_description = get_blog_description(soup)
        print(f"博客描述：{blog_description}")

        # 获取毕业院校信息
        graduate_school = get_graduate_school(soup)
        print(f"毕业院校：{graduate_school}")

        # 获取码龄和注册时间
        # 获取码龄
        coding_age = get_user_coding_age(soup)
        # 获取注册时间
        registration_date = get_registration_date(soup)
        # 获取注册天数
        days_registered = get_registration_days(soup)
        # 注册时间转换为易读格式
        ymd_registered = convert_days_to_ymd(days_registered)
        print(f"注册时间：{registration_date}, 已注册{days_registered}天({ymd_registered}), 码龄：{coding_age}")

        # 获取博客等级, 原力等级
        blog_level = get_blog_level(soup)
        yuanli_level = get_yuanli_level(soup)
        print(f"博客等级：{blog_level}级，原力等级：{yuanli_level}级")

        # 获取已发表文章数
        articles_count = get_published_articles_count(soup)
        print(f"已发表文章数：{articles_count}")

        # 获取粉丝数量
        fans_count = get_fans_count(soup)
        print(f"粉丝数量：{fans_count}")

        # 获取用户排名
        user_rank = get_user_rank(soup)
        print(f"用户排名：{user_rank}")
    else:
        print("请求失败，请检查网页URL地址。")


if __name__ == "__main__":
    # 获取命令行参数
    if len(sys.argv) < 2:
        print("Usage: python script.py <URL>")
        sys.exit(1)

    input_url = sys.argv[1]
    main(input_url)

演示：

$ python3 spider_csdn_userinfo.py https://blog.csdn.net/g310773517
用户ID：g310773517
用户昵称：I'mAlex
个人简介：深耕嵌入式+人工智能领域，阿里巴巴嵌入式技术专家。分享嵌入式开发领域的知识、工作过程中的思考、人生的感悟。提供嵌入式方向的学习指导和简历面试辅导，有需要可私信联系。
博客描述：科技改变人类，技术成就未来
毕业院校：未知
注册时间：2010-03-17, 已注册5145天(14年1月5天), 码龄：14年
博客等级：6级，原力等级：5级
已发表文章数：51
粉丝数量：6948
用户排名：3061