SkyPile-150B 数据下载地址

https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0014.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0015.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0016.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_head_0017.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0014.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0015.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0016.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0017.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0018.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0019.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0020.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-40_zh_middle_0021.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_head_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0011.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0012.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-45_zh_middle_0013.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_head_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0000.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0001.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0002.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0003.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0004.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0005.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0006.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0007.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0008.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0009.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0010.jsonl
https://hf-mirror.com/datasets/Skywork/SkyPile-150B/resolve/main/data/2020-50_zh_middle_0011.jsonl



数据集摘要
SkyPile-150B是一个全面的、大规模的中国数据集,专门用于大型语言模型的预训练。它来自大量可公开访问的中国互联网网页。我们采用了严格的过滤、广泛的重复删除和彻底的敏感数据过滤来确保其质量。此外,我们还利用了fastText和BERT等先进工具来过滤低质量数据。

SkyPile-150B数据集的公开部分包含大约2.33亿个独特的网页,每个网页平均包含1000多个中文字符。该数据集总共包含大约1500亿个令牌和620 GB的纯文本数据。

语言
SkyPile-150B数据集完全由中国数据组成。

数据字段说明
文本:从每个页面中提取的经过处理和清洗的文本。
数据集安全
我们使用了200w多个规则和基于BERT的模型来确定数据集中存在的敏感数据,并随后删除了我们检测到的任何有害条目。

敏感信息和偏见
尽管我们做出了最大的努力,但根据公开网页上的信息,SkyPile-150B可能包含敏感信息,如电子邮件地址、电话号码或IP地址。我们已通过重复数据删除和低质量过滤尽量减少这些信息,但SkyPile-150B的用户仍应保持警惕。

相关推荐

  1. SkyPile-150B 数据下载地址

    2023-12-09 02:14:02       54 阅读
  2. 开源数据下载地址

    2023-12-09 02:14:02       63 阅读
  3. GNSS数据及产品下载地址(FTP/HTTP)

    2023-12-09 02:14:02       52 阅读
  4. 数据组件官方源和国内源下载地址

    2023-12-09 02:14:02       27 阅读
  5. 海洋与地质地理信息数据下载网站汇总集锦

    2023-12-09 02:14:02       32 阅读
  6. 数据结构与算法-15_ B

    2023-12-09 02:14:02       22 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2023-12-09 02:14:02       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2023-12-09 02:14:02       100 阅读
  3. 在Django里面运行非项目文件

    2023-12-09 02:14:02       82 阅读
  4. Python语言-面向对象

    2023-12-09 02:14:02       91 阅读

热门阅读

  1. Spring中拦截WebSecurityConfigurerAdapter和Aop拦截区分

    2023-12-09 02:14:02       60 阅读
  2. 计算三位数每位上数字的和

    2023-12-09 02:14:02       57 阅读
  3. 理想中的PC端剪切板工具,应该有哪些功能?

    2023-12-09 02:14:02       67 阅读
  4. QT 中 线程池 (备查)

    2023-12-09 02:14:02       67 阅读
  5. Copilot使用指南:提升编程效率的智能助手

    2023-12-09 02:14:02       89 阅读
  6. NTP时钟同步服务器(校时服务器)技术参数分享

    2023-12-09 02:14:02       52 阅读
  7. v-model和:model的区别

    2023-12-09 02:14:02       54 阅读