为什么基于 Django 和 Scrapy 的项目需要 @sync_to_async 装饰器

2024-06-06 23:04:05
开发
8

在现代 web 开发中，异步编程正变得越来越重要，特别是对于需要处理大量 I/O 操作的应用程序。Scrapy 是一个用于 web 抓取的异步框架，而 Django 是一个流行的 web 框架，主要采用同步编程模型。将这两个框架结合在一个项目中时，会遇到一些挑战，特别是在处理数据库操作时。这时，@sync_to_async 装饰器就显得尤为重要。

异步编程 vs. 同步编程

Scrapy 的异步特性

Scrapy 是一个高效的 web 爬虫框架，使用 Twisted 实现异步 I/O 操作。这意味着 Scrapy 可以在等待网络响应时，不会阻塞 CPU，可以继续处理其他任务。这种非阻塞的 I/O 操作使 Scrapy 能够高效地抓取大量网页。

Django 的同步模型

Django 是一个强大的 web 框架，提供了 ORM（对象关系映射）来简化数据库操作。然而，Django ORM 的操作是同步的，这意味着每个数据库查询都会阻塞当前线程，直到操作完成。这在同步 web 应用中不会有太大问题，但在异步环境中（如 Scrapy 中），这种阻塞操作会严重影响性能。

为什么需要 @sync_to_async

在 Scrapy 中，如果直接调用同步的 Django ORM 操作，可能会导致以下问题：

阻塞事件循环：Scrapy 的事件循环会被同步的数据库操作阻塞，导致无法处理其他并发请求，降低爬虫的效率。
降低并发性能：阻塞操作会使得 Scrapy 的并发处理能力大打折扣，无法充分利用其异步 I/O 优势。

@sync_to_async 装饰器来自 asgiref.sync 模块，用于将同步函数转换为异步函数。这样可以避免阻塞事件循环，使异步代码能够高效运行。

使用 @sync_to_async 的示例

以下是一个示例，展示如何在 Scrapy 管道中使用 @sync_to_async 装饰器来调用同步的 Django ORM 操作：

定义 Django 模型

假设我们有两个 Django 模型：SpiderProductDetail 和 SpiderProductList。

# models.py

from django.db import models

class SpiderProductList(models.Model):
    product_id = models.CharField(max_length=255, primary_key=True)

class SpiderProductDetail(models.Model):
    product_id = models.OneToOneField(SpiderProductList, on_delete=models.CASCADE)
    seller_location = models.CharField(max_length=100, blank=True, null=True)
    seller_details = models.TextField(blank=True, null=True)

Scrapy 管道

在 pipelines.py 文件中定义一个管道，用于处理爬取到的数据，并更新 SpiderProductDetail 表中的 seller_location 和 seller_details 字段：

# pipelines.py

import logging
from asgiref.sync import sync_to_async
from django.db import IntegrityError
from myapp.models import SpiderProductDetail, SpiderProductList

class UpdateProductDetailPipeline:
    async def process_item(self, item, spider):
        try:
            await self.save_to_database(item)
        except IntegrityError as e:
            logging.error(f"Error updating SpiderProductDetail: {e}")
        return item

    @sync_to_async
    def save_to_database(self, item):
        try:
            # 获取 item 中的必要信息
            product_id_str = item.get('product_id')
            seller_location = item.get('seller_location')
            seller_details = item.get('seller_details')

            # 获取 product_id 对应的 SpiderProductList 实例
            product_id = SpiderProductList.objects.get(product_id=product_id_str)

            # 更新数据库中的记录
            SpiderProductDetail.objects.update_or_create(
                product_id=product_id,
                defaults={
                    'seller_location': seller_location,
                    'seller_details': seller_details
                }
            )
        except SpiderProductList.DoesNotExist:
            logging.error(f"SpiderProductList with product_id {product_id_str} does not exist")
        except IntegrityError as e:
            logging.error(f"Error updating SpiderProductDetail: {e}")

启用管道

确保在 Scrapy 项目的 settings.py 文件中启用刚刚定义的管道：

# settings.py

ITEM_PIPELINES = {
    'myproject.pipelines.UpdateProductDetailPipeline': 300,
}

爬虫示例

确保您的爬虫返回包含 product_id、seller_location 和 seller_details 字段的 item。例如：

# my_spider.py

import scrapy

class AmazonSellerSpider(scrapy.Spider):
    name = "amazon_seller_spider"
    
 
    def start_requests(self):
        # 构造请求对象
        pass


    def parse(self, response):
        item = {}
        item['product_id'] = self.get_product_id(response)
        item['seller_location'] = self.get_seller_location(response)
        item['seller_details'] = self.get_seller_details(response)
        yield item

    def get_product_id(self, response):
        # 从响应中提取 product_id 的逻辑
        pass

    def get_seller_location(self, response):
        # 从响应中提取 seller_location 的逻辑
        pass

    def get_seller_details(self, response):
        # 从响应中提取 seller_details 的逻辑
        pass

总结

通过使用 @sync_to_async 装饰器，我们可以在 Scrapy 的异步环境中高效地调用同步的 Django ORM 操作。这样可以避免阻塞事件循环，充分利用 Scrapy 的异步 I/O 优势，从而提升爬虫的性能和并发处理能力。在构建基于 Django 和 Scrapy 的项目时，理解并正确使用 @sync_to_async 是非常重要的，这将帮助你构建高效、健壮的应用程序。

作者：pycode
链接：https://juejin.cn/post/7376894518329262115

原文地址:https://blog.csdn.net/weixin_48612224/article/details/139503808 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1798732596818087936.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部