scrapy框架爬取豆瓣top250电影排行榜（下）

2024-07-18 13:36:01
开发
21

（3）在 pipeline.py 文件中对数据进行存储，此程序先写入 txt 文件中，是为了判断该程序是否能正确爬取出数据。此处使用了 json 库，使用 ensure_ascii = False，能够确保非 ASCII 字符（如中文）的数据写入 txt 文件中。

import json
class DoubanPipeline:
    def open_spider(self,spider):
        self.f = open('maoer1.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        json_str = json.dumps(dict(item),ensure_ascii=False) + '\n'
        self.f.write(json_str)
        return item
    def close_spider(self,spider):
        self.f.close()

（4）在 setting.py 文件中设置优先级。

此外，在我调试的过程中，我发现得做反爬措施。

（5）在此项目下创建一个 main.py 文件，用于调试。

import os.path
import sys
from scrapy.cmdline import execute
currentFile = os.path.abspath(__file__)
currentPath = os.path.dirname(currentFile)
# print(currentPath)
sys.path.append(currentPath)
execute(["scrapy","crawl","db"])

（6）最终得到的数据如下（json 文件中）：

（7）将数据转存至 mysql 中，使用 pymysql 成功连接数据库后，通过 sql 语句 insert into 表名 values（值）将数据进行保存。

import mysql.connector
import json

conn = mysql.connector.connect(
    host="127.0.0.1",
    user="root",
    password="010208",
    database="spider",
    port = 3306,
    charset = "utf8"
)

cursor = conn.cursor()

with open('maoer1.json', 'r') as file:
    data = json.load(file)
    for entry in data:
        description = entry.get('description', '')  # 确保title字段存在
        movie_name = entry.get('movie_name', '')
        director = entry.get('director', '')
        score = entry.get('score', '')

        sql = "INSERT INTO spider10 (description,movie_name,director,score) VALUES (%s,%s,%s,%s)"
        cursor.execute(sql, (description,movie_name,director,score))
conn.commit()

cursor.close()
conn.close()

（8）结果展示

三.数据可视化

本题根据现有数据，做了柱状图和词云图。（好像不是很好看）

四.应用场景

通过对豆瓣网站进行数据爬取并进行可视化分析，我们可以看到，当代社会中人们喜欢的影视作品种类多样，评分较高，质量较高。希望该网站进行推出优秀作品，丰富人们的闲暇生活。

ok，这就是完整的程序说明，重点，我自己写的！

原文地址:https://blog.csdn.net/m0_73523976/article/details/140505457 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1813809930789064704.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部

scrapy框架爬取豆瓣top250电影排行榜（下）

三.数据可视化

四.应用场景

相关推荐

最近更新

热门阅读