使用 AlgorithmStar and Spark 实现词频统计

2024-01-04 13:44:01
开发
43

使用 AlgorithmStar and Spark 实现词频统计

数据分析案例

(AS 算法之星机器学习库)(Spark 分布式计算框架)实现词频统计数据分析的案例

AlgorithmStar 实现词频统计

AlgorithmStar 是相较于Spark来说，使用起来比较简单的框架，API非常简洁，推荐使用Java语言调用，下面开始进行一个实现。

代码

在下面的代码中，每个步骤都进行了注释，同时在结尾处演示了两种打印方法。

package com.zhao;

import zhao.algorithmMagic.algorithm.featureExtraction.WordFrequency;
import zhao.algorithmMagic.core.AlgorithmStar;
import zhao.algorithmMagic.operands.matrix.ColumnIntegerMatrix;

/**
 * @author 赵凌宇
 */
public class MAIN {
    public static void main(String[] args) {
        // 准备一个句子
        final String data = "The next step is to demonstrate some operations related to word frequency statistics. You will learn about the relevant operations of word frequency statistics in different frameworks";
        // 获取到算法之星门户类 这里的泛型中 第二个参数代表的就是计算组件的计算结果类型
        // 词频向量计算之后返回的是个 ColumnIntegerMatrix 矩阵
        // 所以 第二个 泛型就要设置为 ColumnIntegerMatrix
        final AlgorithmStar<Object, ColumnIntegerMatrix> instance = AlgorithmStar.getInstance();
        // 开始转换成为词频向量
        final ColumnIntegerMatrix wordCount = instance.extract(WordFrequency.getInstance("wordCount"), data);
        // 查看词频向量 这个操作会使用默认的方式打印出矩阵格式
        System.out.println(wordCount);
        // 获取到其中的值 词频向量在这里计算完毕之后是个带列与行名的 ColumnIntegerMatrix，不过其中只有一行数值
        // 列名代表的就是被统计单词
        // 行名代表的就是被统计的句子
        // 在这里我们需要先获取到列名
        final String[] rowFieldNames = wordCount.getRowFieldNames();
        for (int i = 0; i < wordCount.getRowCount(); i++) {
            // 然后才可以在循环中 使用索引 定位到当前这列的第一个元素（这个就是被统计好的数值）与列名（当前统计的数量对应的单词）关联起来
            System.out.println("出现次数：" + wordCount.get(i, 0) + "\t被统计单词：" + rowFieldNames[i]);
        }
    }
}

代码计算结果

在计算结果中我们使用两种方式打印出了词频结果，可以看到这里的结果还是蛮清晰的。

------------IntegerMatrixStart-----------
The next step is to demonstrate some operations related to word frequency statistics. You will learn about the relevant operations of word frequency statistics in different frameworks	rowColName
[1]	next
[1]	some
[1]	will
[1]	learn
[1]	in
[1]	frameworks
[1]	about
[1]	is
[2]	frequency
[1]	The
[1]	the
[1]	relevant
[2]	operations
[1]	related
[1]	of
[1]	step
[2]	to
[1]	demonstrate
[1]	different
[2]	word
[1]	You
[2]	statistics
------------IntegerMatrixEnd------------

出现次数：1	被统计单词：next
出现次数：1	被统计单词：some
出现次数：1	被统计单词：will
出现次数：1	被统计单词：learn
出现次数：1	被统计单词：in
出现次数：1	被统计单词：frameworks
出现次数：1	被统计单词：about
出现次数：1	被统计单词：is
出现次数：2	被统计单词：frequency
出现次数：1	被统计单词：The
出现次数：1	被统计单词：the
出现次数：1	被统计单词：relevant
出现次数：2	被统计单词：operations
出现次数：1	被统计单词：related
出现次数：1	被统计单词：of
出现次数：1	被统计单词：step
出现次数：2	被统计单词：to
出现次数：1	被统计单词：demonstrate
出现次数：1	被统计单词：different
出现次数：2	被统计单词：word
出现次数：1	被统计单词：You
出现次数：2	被统计单词：statistics

进程已结束,退出代码0

Spark 实现词频统计

代码

package run

import org.apache.spark.{SparkConf, SparkContext}

object MAIN3 {

  def main(args: Array[String]): Unit = {
    // 准备一个句子
    val data = "The next step is to demonstrate some operations related to word frequency statistics. You will learn about the relevant operations of word frequency statistics in different frameworks"
    // 构造 Spark 门户类
    val sparkContext = new SparkContext(
      new SparkConf()
        // 在这里我们指定 Spark 计算使用的是本地计算
        .setMaster("local[*]")
        // 在这里我们指定 Spark 计算任务的名字
        .setAppName("wordCount")
    )
    // 将 句子 加载为 RDD
    val value = sparkContext.makeRDD(
      // 在这里我们指定按照 逗号拆分
      data.split(" ")
    )
    // 将拆分之后的单词开始进行计算
    val tempRes = value.mapPartitions(iter => {
      // iter 是当前分区中的单词集合
      for (word <- iter) yield {
        // 在这里我们将 iter 中的每个单词 word 依次迭代
        // 并在这里打标签 为每个单词出现的次数标记为 1
        (word, 1)
      }
    }
    ).groupByKey()
    // 到这里，我们已经实现了最基本的单词分组，每个单词都是 key 每个 key 中都对应了一个 与 key 单词数量相同的容器 容器中的元素都是1
    // 例如句子中有两个 operations 则 key 为 operations 对应的 容器中有两个元素为 1
    // 例如句子中有一个 is 则 key 为 is 对应的 容器中有一个元素为 1
    // 所以我们在这里直接将每个容器的尺寸做为 每个 key 出现的次数
    tempRes.map(wc => wc._1 -> wc._2.size).foreach(
      // 然后在这里进行打印
      wc => println("出现次数：" + wc._2 + "\t被统计单词：" + wc._1)
    )
  }
}

代码计算结果

出现次数：1	被统计单词：related
出现次数：1	被统计单词：statistics.
出现次数：2	被统计单词：operations
出现次数：1	被统计单词：relevant
出现次数：1	被统计单词：demonstrate
出现次数：1	被统计单词：next
出现次数：1	被统计单词：The
出现次数：1	被统计单词：learn
出现次数：1	被统计单词：in
出现次数：2	被统计单词：word
出现次数：1	被统计单词：the
出现次数：1	被统计单词：is
出现次数：1	被统计单词：different
出现次数：1	被统计单词：You
出现次数：1	被统计单词：some
出现次数：1	被统计单词：step
出现次数：1	被统计单词：of
出现次数：2	被统计单词：frequency
出现次数：2	被统计单词：to
出现次数：1	被统计单词：about
出现次数：1	被统计单词：frameworks
出现次数：1	被统计单词：will
出现次数：1	被统计单词：statistics

操作记录
作者：root
操作时间：2024-01-03 12:16:38 星期三
事件描述备注：保存/发布

参考文章：http://www.lingyuzhao.top/?/linkController=/articleController&link=-22966947

原文地址:https://blog.csdn.net/Liming07/article/details/135360459 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1742783921462054912.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部