爬虫案例—抓取找歌词网站的按歌词找歌名数据

爬虫案例—抓取找歌词网站的按歌词找歌名数据

找个词网址:https://www.91ge.cn/lxyyplay/find/

目标:抓取页面里的所有要查的歌词及歌名等信息,并存为txt文件

一共46页数据

网站截图如下:

Screenshot 2024-01-21 at 20.03.39

抓取完整歌词数据,如下图:

Screenshot 2024-01-21 at 20.04.26

源码如下:

import asyncio
import time
import aiohttp
from aiohttp import TCPConnector  # 处理ssl验证报错
from lxml import etree

headers = {
   
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}


# 返回每首歌的href函数
async def get_song_url(page_url):
    async with aiohttp.ClientSession(headers=headers, connector=TCPConnector(ssl=False)) as session:
        async with session.get(page_url) as res:
            html = await res.text()
            tree = etree.HTML(html)
            url_lst = tree.xpath('//div[@class="des"]/a/@href')
    return url_lst


# 获取每首歌的详细信息
async def get_song_word(song_url):
    async with aiohttp.ClientSession(headers=headers, connector=TCPConnector(ssl=False)) as session:
        async with session.get(song_url) as res:
            html = await res.text()
            tree = etree.HTML(html)
            if tree is not None:
                song_question = tree.xpath('//div[@class="logbox"]')
                if song_question:
                    song_q = song_question[0].xpath('./h1/text()')[0]
                else:
                    pass
                div_word = tree.xpath('//div[@class="logcon"]')
                if div_word:
                    where_song = div_word[0].xpath('./h2[1]/text()')[0]
                    question_song = div_word[0].xpath('./p[1]/text()')[0]
                    answer_song = div_word[0].xpath('./p[2]/text()')[0]
                    song_words = div_word[0].xpath('./p[position()>2]//text()')
                    # song_name = div_word.xpath('./h2[2]/text()')[0].strip('\r\n\t')
                    song_words = ''.join(song_words[:-1]).strip('\r\n\t')

                    with open(f'songs/{
     song_q}.txt', 'a') as f:
                        f.write(where_song + '\n' + question_song + '\n' + answer_song + '\n\n' + song_words)
            else:
                pass


if __name__ == '__main__':
    t1 = time.time()
    loop = asyncio.get_event_loop()
    for n in range(1, 47):
        song_url = f'https://www.91ge.cn/lxyyplay/find/list_16_{
     n}.html'
        urls = loop.run_until_complete(get_song_url(song_url))
        tasks = [get_song_word(url) for url in urls]
        loop.run_until_complete(asyncio.gather(*tasks))

    print(f'耗时:{
     time.time() - t1:.2f}秒')

运行结果如下图:

Screenshot 2024-01-21 at 20.08.09

利用协程抓取数据,效率很高。

相关推荐

  1. 爬虫技术抓取网站数据

    2024-01-22 17:14:01       20 阅读
  2. 一种爬取网易云歌曲歌词方法

    2024-01-22 17:14:01       21 阅读
  3. LRC歌词格式文件

    2024-01-22 17:14:01       39 阅读
  4. :旅游网站数据分析 - 数据抓取

    2024-01-22 17:14:01       46 阅读

最近更新

  1. TCP协议是安全的吗?

    2024-01-22 17:14:01       18 阅读
  2. 阿里云服务器执行yum,一直下载docker-ce-stable失败

    2024-01-22 17:14:01       19 阅读
  3. 【Python教程】压缩PDF文件大小

    2024-01-22 17:14:01       19 阅读
  4. 通过文章id递归查询所有评论(xml)

    2024-01-22 17:14:01       20 阅读

热门阅读

  1. vim命令打开日志中文乱码问题解决

    2024-01-22 17:14:01       33 阅读
  2. 解决Unity WebGLInput插件全屏输入的问题

    2024-01-22 17:14:01       34 阅读
  3. 【SVG入门知识】

    2024-01-22 17:14:01       33 阅读
  4. 数据库命令集

    2024-01-22 17:14:01       24 阅读
  5. golang导入go-git错误记录

    2024-01-22 17:14:01       40 阅读