大模型微调-数据部分

数据加载

国内用户建议到 https://modelscope.cn/datasets 下载数据,但是下载后发现并不能和huggingface datasets无缝衔接,而是报了个错

  • AttributeError: ‘MsDataset’ object has no attribute ‘column_names’

因此,可以继续采用魔搭下载数据,但是转换到dataset适应的形式,顺便也对整个数据过程更加了解一下。

但最简单的修改方法是:

 dataset = MsDataset.load()
train_dataset = dataset.to_hf_dataset()  # 魔搭社区下载

然后是:

  • https://github.com/modelscope/modelscope/blob/a903ec7a898f5dfb44349e2ce15971ec5f08e528/examples/pytorch/llm/utils/dataset.py#L34
  • https://github.com/hiyouga/LLaMA-Factory/blob/6c94305e4746c9a735ff62a6428e295d1a67da52/src/llmtuner/data/loader.py#L83

几种方法

train_dataset = load_from_disk(args.dataset_name, split="train[:1024]")

def preprocess_function(examples):

        queries = examples["sentence"]
        queries = get_detailed_instruct(task, queries)
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        result = {f"sentence_{k}": v for k, v in batch_dict.items()}

        queries = examples["positive"]
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        for k, v in batch_dict.items():
            result[f"positive_{k}"] = v
        
        queries = examples["negative"]
        batch_dict = tokenizer(queries, max_length=args.max_length - 1, return_attention_mask=False, padding=False, truncation=True)
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
        batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')

        for k, v in batch_dict.items():
            result[f"negative_{k}"] = v

        result["labels"] = [0] * len(examples["sentence"]) 
        return result
 
 processed_datasets = dataset.map(
        preprocess_function,
        batched=True,
        remove_columns=dataset["train"].column_names,
        desc="Running tokenizer on dataset",
    )
    

数据构造

数据清洗

相关推荐

  1. 模型微调-数据部分

    2024-03-31 09:10:06       43 阅读
  2. 语言模型微调数据集(2)

    2024-03-31 09:10:06       56 阅读
  3. Unsloth - 模型微调

    2024-03-31 09:10:06       19 阅读
  4. 详解模型微调数据集构建方法(持续更新)

    2024-03-31 09:10:06       31 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-03-31 09:10:06       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-03-31 09:10:06       101 阅读
  3. 在Django里面运行非项目文件

    2024-03-31 09:10:06       82 阅读
  4. Python语言-面向对象

    2024-03-31 09:10:06       91 阅读

热门阅读

  1. 5 倍经验日

    2024-03-31 09:10:06       32 阅读
  2. Python:静态方法

    2024-03-31 09:10:06       41 阅读
  3. 5.91 BCC工具之tcpcong.py解读

    2024-03-31 09:10:06       40 阅读
  4. #!/bin/sh和#!/bin/bash的区别

    2024-03-31 09:10:06       38 阅读
  5. 【使用python读取多类型文件夹中的文档内容】

    2024-03-31 09:10:06       39 阅读
  6. pytest中文使用文档----9集成文档测试

    2024-03-31 09:10:06       46 阅读
  7. Linux|如何管理多个Git身份

    2024-03-31 09:10:06       37 阅读
  8. wifi密码,pc端

    2024-03-31 09:10:06       36 阅读