Flink集群从节点TaskManager启动分析

1.概述

  TaskManager 是 Flink 集群的工作进程,执行数据流的具体计算,称之为"Worker"。Flink集群必须至少有一个TaskManager;每一个TaskManager都包含了一定数量的任务槽(task slots)。Slot是资源调度的最小单位,slot的数量限制了TaskManager能够并行处理的任务数量。

  启动之后,TaskManager会向资源管理器注册它的slots;收到资源管理器的指令后,TaskManager就会将一个或者多个槽位提供给JobMaster调用,JobMaster就可以分配任务来执行了。

  在Job执行过程中,TaskManager可以缓冲数据,还可以跟其他运行同一应用的TaskManager交换数据。

  TaskManager 是一个逻辑抽象,代表一台服务器,服务器的启动,必然会包含一些服务,另外再包含一个 TaskExecutor,存在于TaskManager的内部,真实的帮助TaskManager 完成各种核心操作:提交Task执行、申请和释放slot。

2.TaskManager启动

  TaskManager主要负责本机slot资源的管理与具体task的执行。根据集群启动脚本分析:TaskManager 的启动主类: TaskManagerRunner。

2.1 启动入口(main)

public static void main(String[] args) throws Exception {
		// startup checks and logging 从节点启动的时打印的相关信息
		EnvironmentInformation.logEnvironmentInfo(LOG, "TaskManager", args);
		SignalHandler.register(LOG);
		JvmShutdownSafeguard.installAsShutdownHook(LOG);
		long maxOpenFileHandles = EnvironmentInformation.getOpenFileHandlesLimit();
		if(maxOpenFileHandles != -1L) {
			LOG.info("Maximum number of open file descriptors is {}.", maxOpenFileHandles);
		} else {
			LOG.info("Cannot determine the maximum number of open file descriptors");
		}
		//  注释: 启动入口
		runTaskManagerSecurely(args, ResourceID.generate());
	}

注:ResourceID:Flink集群启动时主节点和从节点都会生成一个全局唯一的ID。

2.2 runTaskManagerSecurely(入口)

  • 1.加载参数(解析main方法参数+配置文件参数)
  • 2.启动TaskManager
    • 2.1 初始化插件服务以及文件系统服务(基础服务)
    • 2.2 通过线程启动TaskManager
      • 2.2.1 构建TaskManagerRunner实例对象
        • ① 初始化很多服务(对外提供服务)
        • ② 初始化Executor
      • 2.2.2 发送start消息确认是否启动成功
public static void runTaskManagerSecurely(String[] args, ResourceID resourceID) {
			try {
				//  注释: 加载配置参数(shell脚本传入参数+flink-conf.yaml文件)
				Configuration configuration = loadConfiguration(args);
	
				//  注释: 启动TaskManager
				runTaskManagerSecurely(configuration, resourceID);
	
			} catch(Throwable t) {
				final Throwable strippedThrowable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);
				LOG.error("TaskManager initialization failed.", strippedThrowable);
				System.exit(STARTUP_FAILURE_RETURN_CODE);
			}
		}

// --> runTaskManagerSecurely(configuration, resourceID);
// 1.初始化插件服务和文件系统
// 2.通过线程启动TaskManger,与main线程不是同一个线程中
public static void runTaskManagerSecurely(Configuration configuration, ResourceID resourceID) throws Exception {
		replaceGracefulExitWithHaltIfConfigured(configuration);
		/*************************************************
		 *  注释: 初始化插件
		 */
		final PluginManager pluginManager = PluginUtils.createPluginManagerFromRootFolder(configuration);

		// TODO_MA 注释: 初始化文件系统
		FileSystem.initialize(configuration, pluginManager);
		SecurityUtils.install(new SecurityConfiguration(configuration));

		/*************************************************
		 *  注释: 包装启动
		 */
		SecurityUtils.getInstalledContext().runSecured(

			// 注释: 通过一个线程来启动 TaskManager
			() -> {
				runTaskManager(configuration, resourceID, pluginManager);
				return null;
			});
	}

// -->runTaskManager(configuration, resourceID, pluginManager);
public static void runTaskManager(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
		/*************************************************
		 *  注释: 构建 TaskManager 实例
		 *  TaskManagerRunner 是 standalone 模式下 TaskManager 的可执行入口点。
		 *  它构造相关组件(network, I/O manager, memory manager, RPC service, HA service)并启动它们。
		 */
		final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, resourceId, pluginManager);

		/*************************************************
		 *  注释: 发送 START 消息,确认是否启动成功
		 */
		taskManagerRunner.start();
	}

2.3 实例化TaskManagerRunner对象

1.初始化服务:

  • 线程池:异步回调函数的处理(异步编程:future.xxx(() -> xxxxx(), exceutor))
  • HA服务:ZooKeeperHaServices(flink-conf.yaml文件中HA参数为ZooKeeper)
  • Rpc服务:通过创建代理对象的方式创建RpcServer
  • Heartbeat服务:心跳服务(ResourceManger与TaskManager的两个关键参数:10s与50s)
  • Blob服务:内部就是两个定时任务,用来定时检查删除过期的Job的资源文件。通过引用计数的方法,判断文件是否过期。PermanentBlobCache与TransientBlobCache

2.启动TaskManager

  • 负责启动TaskExecutor,负责多个Task的执行
public TaskManagerRunner(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
		this.configuration = checkNotNull(configuration);
		this.resourceId = checkNotNull(resourceId);
		timeout = AkkaUtils.getTimeoutAsTime(configuration);

		//  注释:初始化进行回调处理的线程池
		this.executor = java.util.concurrent.Executors
			.newScheduledThreadPool(Hardware.getNumberCPUCores(), new ExecutorThreadFactory("taskmanager-future"));

		/*************************************************
		 *  注释:HA 服务: ZooKeeperHaServices
		 *  提供对高可用性所需的所有服务的访问注册,分布式计数器和Leader选举
		 */
		highAvailabilityServices = HighAvailabilityServicesUtils
			.createHighAvailabilityServices(configuration, executor, HighAvailabilityServicesUtils.AddressResolution.NO_ADDRESS_RESOLUTION);

		//  注释:初始化 RpcService
		rpcService = createRpcService(configuration, highAvailabilityServices);

		//  注释:初始化 HeartbeatServices
		HeartbeatServices heartbeatServices = HeartbeatServices.fromConfiguration(configuration);

		metricRegistry = new MetricRegistryImpl(MetricRegistryConfiguration.fromConfiguration(configuration),
			ReporterSetup.fromConfiguration(configuration, pluginManager));
		final RpcService metricQueryServiceRpcService = MetricUtils.startRemoteMetricsRpcService(configuration, rpcService.getAddress());
		metricRegistry.startQueryService(metricQueryServiceRpcService, resourceId);

		//  注释:初始化 BlobCacheService
		blobCacheService = new BlobCacheService(configuration, highAvailabilityServices.createBlobStore(), null);

		//  注释:提供外部资源的信息
		final ExternalResourceInfoProvider externalResourceInfoProvider = ExternalResourceUtils
			.createStaticExternalResourceInfoProvider(ExternalResourceUtils.getExternalResourceAmountMap(configuration),
				ExternalResourceUtils.externalResourceDriversFromConfig(configuration, pluginManager));

		/*************************************************
		 *  注释:启动 TaskManager
		 *  负责创建 TaskExecutor,负责多个任务Task的运行
		 */
		taskManager = startTaskManager(this.configuration, this.resourceId, rpcService, highAvailabilityServices, heartbeatServices, metricRegistry,
			blobCacheService, false, externalResourceInfoProvider, this);

		this.terminationFuture = new CompletableFuture<>();
		this.shutdown = false;

		MemoryLogger.startIfConfigured(LOG, configuration, terminationFuture);
	}

2.4 startTaskManager(启动TaskManager对象)

  • 1.获取资源定义对象:一台真实的物理节点的资源(cpu,memory,network)
  • 2.taskExecutorResourceSpec–> TaskManagerServicesConfiguration(配置信息封装在TaskManagerServicesConfiguration对象中)
  • 3.初始化ioExecutor(io线程池)
  • 3.构建TaskManagerServices对象封装了TaskManager运行过程中需要对外提供服务的各种服务组件
    • 1.初始化 TaskEventDispatcher(调度的作用)
    • 2.初始化 IOManagerASync(通过异步的形式实现数据流转)
    • 3.shuffleEnvironment = NettyShuffleEnvironment(上下游Task存在shuffle)
    • 4.初始化 KVStageService(状态服务)
    • 5.初始化 BroadCastVariableManager(广播服务)
    • 6.初始化 TaskSlotTable【interface–>TaskSlotTableImpl】(维护TaskManager上所有的TaskSlot与Task以及Job的关系)
    • 7.初始化 DefaultJobTable服务(job信息)
    • 8.初始化 JobLeaderService服务(为JobMaster启动提供服务)
  • 4.返回TaskExecutor对象(startTaskManager–>TaskExecutor),内部构建了两个重要的心跳管理器
    • JobManagerHeartbeatManager
    • ResourceManagerHeartbeatManager
public static TaskExecutor startTaskManager(Configuration configuration, ResourceID resourceID, RpcService rpcService,
		HighAvailabilityServices highAvailabilityServices, HeartbeatServices heartbeatServices, MetricRegistry metricRegistry,
		BlobCacheService blobCacheService, boolean localCommunicationOnly, ExternalResourceInfoProvider externalResourceInfoProvider,
		FatalErrorHandler fatalErrorHandler) throws Exception {

		checkNotNull(configuration);
		checkNotNull(resourceID);
		checkNotNull(rpcService);
		checkNotNull(highAvailabilityServices);

		LOG.info("Starting TaskManager with ResourceID: {}", resourceID);

		String externalAddress = rpcService.getAddress();

		final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);

		//  注释: TaskManagerServicesConfiguration
		TaskManagerServicesConfiguration taskManagerServicesConfiguration = TaskManagerServicesConfiguration.fromConfiguration(configuration, resourceID, externalAddress, localCommunicationOnly, taskExecutorResourceSpec);
	   Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(metricRegistry, externalAddress, resourceID,			taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());

		//  注释: 初始化 ioExecutor
		final ExecutorService ioExecutor =
			newFixedThreadPool(taskManagerServicesConfiguration.getNumIoThreads(), new ExecutorThreadFactory("flink-taskexecutor-io"));

		//  注释: taskManagerServices = TaskManagerServices
		TaskManagerServices taskManagerServices = TaskManagerServices
			.fromConfiguration(taskManagerServicesConfiguration, blobCacheService.getPermanentBlobService(), taskManagerMetricGroup.f1, ioExecutor,
				fatalErrorHandler);

		// 注释: TaskManagerConfiguration
		TaskManagerConfiguration taskManagerConfiguration = TaskManagerConfiguration
			.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);

		String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();

		/*************************************************
		 *  注释: 创建 TaskExecutor 实例
		 *  内部会创建两个重要的心跳管理器:
		 *  1、JobManagerHeartbeatManager
		 *  2、ResourceManagerHeartbeatManager
		 */
		return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,
			heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
			createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
	}

//--> TaskManagerServices fromConfiguration
public static TaskManagerServices fromConfiguration(TaskManagerServicesConfiguration taskManagerServicesConfiguration,
		PermanentBlobService permanentBlobService, MetricGroup taskManagerMetricGroup, ExecutorService ioExecutor,
		FatalErrorHandler fatalErrorHandler) throws Exception {

		// pre-start checks 检查工作目录
		checkTempDirs(taskManagerServicesConfiguration.getTmpDirPaths());

		// 注释: 初始化 TaskEventDispatcher
		final TaskEventDispatcher taskEventDispatcher = new TaskEventDispatcher();

		 //注释: 初始化 IOManagerASync
		// start the I/O manager, it will create some temp directories.
		final IOManager ioManager = new IOManagerAsync(taskManagerServicesConfiguration.getTmpDirPaths());

		//注释: shuffleEnvironment = NettyShuffleEnvironment
		final ShuffleEnvironment<?, ?> shuffleEnvironment = createShuffleEnvironment(taskManagerServicesConfiguration, taskEventDispatcher,
			taskManagerMetricGroup, ioExecutor);
		final int listeningDataPort = shuffleEnvironment.start();

		// 注释: 初始化 KVStageService
		final KvStateService kvStateService = KvStateService.fromConfiguration(taskManagerServicesConfiguration);
		kvStateService.start();

		final UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new UnresolvedTaskManagerLocation(
			taskManagerServicesConfiguration.getResourceID(), taskManagerServicesConfiguration.getExternalAddress(),
			// we expose the task manager location with the listening port
			// iff the external data port is not explicitly defined
			taskManagerServicesConfiguration.getExternalDataPort() > 0 ? taskManagerServicesConfiguration.getExternalDataPort() : listeningDataPort);

		// 注释: 初始化 BroadCastVariableManager
		final BroadcastVariableManager broadcastVariableManager = new BroadcastVariableManager();

		//  注释: 初始化 TaskSlotTable
		final TaskSlotTable<Task> taskSlotTable = createTaskSlotTable(taskManagerServicesConfiguration.getNumberOfSlots(),
			taskManagerServicesConfiguration.getTaskExecutorResourceSpec(), taskManagerServicesConfiguration.getTimerServiceShutdownTimeout(),
			taskManagerServicesConfiguration.getPageSize(), ioExecutor);

		//  注释:  初始化 DefaultJobTable
		final JobTable jobTable = DefaultJobTable.create();

		//  注释: 初始化 JobLeaderService
		final JobLeaderService jobLeaderService = new DefaultJobLeaderService(unresolvedTaskManagerLocation,
			taskManagerServicesConfiguration.getRetryingRegistrationConfiguration());

		final String[] stateRootDirectoryStrings = taskManagerServicesConfiguration.getLocalRecoveryStateRootDirectories();

		final File[] stateRootDirectoryFiles = new File[stateRootDirectoryStrings.length];

		for(int i = 0; i < stateRootDirectoryStrings.length; ++i) {
			stateRootDirectoryFiles[i] = new File(stateRootDirectoryStrings[i], LOCAL_STATE_SUB_DIRECTORY_ROOT);
		}

		//  注释: 初始化 TaskExecutorLocalStateStoresManager
		final TaskExecutorLocalStateStoresManager taskStateManager = new TaskExecutorLocalStateStoresManager(
			taskManagerServicesConfiguration.isLocalRecoveryEnabled(), stateRootDirectoryFiles, ioExecutor);

		final boolean failOnJvmMetaspaceOomError = taskManagerServicesConfiguration.getConfiguration()
			.getBoolean(CoreOptions.FAIL_ON_USER_CLASS_LOADING_METASPACE_OOM);

		//  注释: 初始化 LibraryCacheManager
		final LibraryCacheManager libraryCacheManager = new BlobLibraryCacheManager(permanentBlobService, BlobLibraryCacheManager		.defaultClassLoaderFactory(taskManagerServicesConfiguration.getClassLoaderResolveOrder(),			taskManagerServicesConfiguration.getAlwaysParentFirstLoaderPatterns(), failOnJvmMetaspaceOomError ? fatalErrorHandler : null));
		//  注释: 返回: TaskManagerServices
		return new TaskManagerServices(unresolvedTaskManagerLocation, taskManagerServicesConfiguration.getManagedMemorySize().getBytes(), ioManager,
			shuffleEnvironment, kvStateService, broadcastVariableManager, taskSlotTable, jobTable, jobLeaderService, taskStateManager,
			taskEventDispatcher, ioExecutor, libraryCacheManager);
	}

2.5 startTaskManager–>TaskExecutor(返回TaskExecutor)

  从节点的启动是通过实例化TaskManagerRunner对象,后续分为两部分工作:1.初始化各种基础服务(线程池、HA、Rpc、心跳服务、以及Blob服务)、2.启动Taskmanager:通过将硬件配置信息封装在TaskManagerServicesConfiguration对象中,初始化IO线程池,然后通过构建TaskManagerServices对象(封装了TaskManager运行过程中对外提供服务的各种服务组件),最终返回TaskExecutor对象(封装了两个心跳服务:JobManagerHeartBeatManager与ResourceManagerHeartBeatManager)。

//-->返回TaskExecutor对象
return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()))
    
// 当前构造方法执行完了之后,执行 onStart() 方法,因为 TaskExecutor 是一个 RpcEndpoint
    public TaskExecutor(RpcService rpcService, TaskManagerConfiguration taskManagerConfiguration, HighAvailabilityServices haServices,
		TaskManagerServices taskExecutorServices, ExternalResourceInfoProvider externalResourceInfoProvider, HeartbeatServices heartbeatServices,
		TaskManagerMetricGroup taskManagerMetricGroup, @Nullable String metricQueryServiceAddress, BlobCacheService blobCacheService,
		FatalErrorHandler fatalErrorHandler, TaskExecutorPartitionTracker partitionTracker, BackPressureSampleService backPressureSampleService) {
		//创建形式为prefix_X随机名称,其中X为递增数字
		super(rpcService, AkkaRpcServiceUtils.createRandomName(TASK_MANAGER_NAME));
		checkArgument(taskManagerConfiguration.getNumberSlots() > 0, "The number of slots has to be larger than 0.");
		this.taskManagerConfiguration = checkNotNull(taskManagerConfiguration);
		this.taskExecutorServices = checkNotNull(taskExecutorServices);
		this.haServices = checkNotNull(haServices);
		this.fatalErrorHandler = checkNotNull(fatalErrorHandler);
		this.partitionTracker = partitionTracker;
		this.taskManagerMetricGroup = checkNotNull(taskManagerMetricGroup);
		this.blobCacheService = checkNotNull(blobCacheService);
		this.metricQueryServiceAddress = metricQueryServiceAddress;
		this.backPressureSampleService = checkNotNull(backPressureSampleService);
		this.externalResourceInfoProvider = checkNotNull(externalResourceInfoProvider);
		this.libraryCacheManager = taskExecutorServices.getLibraryCacheManager();
		this.taskSlotTable = taskExecutorServices.getTaskSlotTable();
		this.jobTable = taskExecutorServices.getJobTable();
		this.jobLeaderService = taskExecutorServices.getJobLeaderService();
		this.unresolvedTaskManagerLocation = taskExecutorServices.getUnresolvedTaskManagerLocation();
		this.localStateStoresManager = taskExecutorServices.getTaskManagerStateStore();
		this.shuffleEnvironment = taskExecutorServices.getShuffleEnvironment();
		this.kvStateService = taskExecutorServices.getKvStateService();
		this.ioExecutor = taskExecutorServices.getIOExecutor();
		this.resourceManagerLeaderRetriever = haServices.getResourceManagerLeaderRetriever();
    	//硬件抽象对象
		this.hardwareDescription = HardwareDescription.extractFromSystem(taskExecutorServices.getManagedMemorySize());
		this.resourceManagerAddress = null;
		this.resourceManagerConnection = null;
		this.currentRegistrationTimeoutId = null;
		final ResourceID resourceId = taskExecutorServices.getUnresolvedTaskManagerLocation().getResourceID();
		//  注释: HeartbeatManagerImpl jobManagerHeartbeatManager
		this.jobManagerHeartbeatManager = createJobManagerHeartbeatManager(heartbeatServices, resourceId);
		//  注释: HeartbeatManagerImpl resourceManagerHeartbeatManager
		this.resourceManagerHeartbeatManager = createResourceManagerHeartbeatManager(heartbeatServices, resourceId);
	}
//-->代码执行到就到去到TaskExecutor的Onstart()方法
public void onStart() throws Exception {
		try {

			/*************************************************
			 *  注释: 开启服务
			 * 重要的服务:
			 * 1.监控ResourceManager
			 * 2.启动TaskSlotTable服务
			 * 3.监控JobMaster
			 * 4.启动FileCache服务
			 */
			startTaskExecutorServices();

		} catch(Exception e) {
			final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), e);
			onFatalError(exception);
			throw exception;
		}

		//  注释: 开始注册
		startRegistrationTimeout();
	}
//-->startTaskExecutorServices();
private void startTaskExecutorServices() throws Exception {
		try {

			//  注释: 启动 ResourceManagerLeaderListener,监听 TaskManger 向 ResourceManager 注册是通过ResourceManagerLeaderListener 来完成的,它会监控 ResourceManager 的 leader 变化, 如果有新的 leader 被选举出来, 将会调用 notifyLeaderAddress() 方法去触发与 ResourceManager 的重连
			// start by connecting to the ResourceManager
			resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
            
            // 理解上述代码:
            /*1.ResourceManagerLeaderListener 是 LeaderRetrieverListener的子类,构建ResourceManagerLeaderListener对象中,执行的是notifyLeaderAddress()方法【连接RM】,构建TaskExecutorRegistration对象与TaskExecutorToResourceManagerConnection对象(TaskExecutor 和 ResourceManager 之间的链接对象),启动,创建注册对象newRegistration,开始进行注册(向ResourceManager进行注册),ResourceManager获取TaskManager的全局唯一ID,taskExecutorRegistration-->WorkerRegistration对象,将taskExecutorResourceId与registration放入一个map结构中,taskExecutors.put(taskExecutorResourceId, registration),最终返回一个TaskExecutorRegistrationSuccess注册成功的对象,接下来要维持ResourceManager与TaskExecutor之间的心跳,taskExecutorGateway相当于注册成功的那个TaskExecutor,taskExecutorGateway.heartbeatFromResourceManager(resourceID),此时TaskExecutor接收到ResourceManager的心跳请求,此时TaskExecutor向ResourceManager汇报心跳 ;
             *2.在StandAlone场景中,resourceManagerLeaderRetriever的实现类是ZooKeeperLeaderRetrievalService,ZooKeeperLeaderRetrievalService是	NodeCacheListener的子类,NodeCacheListener(接口) 是 curator提供的监听器,当指定的zookeeper中的znode节点数据发生改变,则会收到通知,回调nodeChanged()方法【ZooKeeperLeaderRetrievalService中的nodeChanged()】,在nodeChanged()中会调用对应的LeaderRetrieverListener的notifyIfNewLeaderAddress()方法
             *3.resourceManagerLeaderRetriever的实现类是:ZooKeeperLeaderRetrievalService,它是LeaderRetrievalService的子类
             *4.resourceManagerLeaderRetriever进行监听,当发生变更时,就会调用ResourceManagerLeaderListener的notifyLeaderAddress()方法
            */

			//  注释: 启动 TaskSlotTable
			// tell the task slot table who's responsible for the task slot actions
			taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());

			//  注释: 启动 JobLeaderService
			// start the job leader service
			jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());

			//  注释: 初始化 FileCache
			fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
		} catch(Exception e) {
			handleStartTaskExecutorServicesException(e);
		}
	}

实例化TaskExecutor对象后,就要执行TaskExecutor对象的onStart()方法:

  • 开启服务 startTaskExecutorServices()
    • 监控ResourceManager(连接ResourceManager,注册(超时注册机制),监听RM)
      • Flink的主从节点的心跳:
        • 1.启动ResourceManager,启动HeartBeatManager,每隔10s钟,遍历注册的TaskExecutor,执行发送心跳请求
        • 2.启动TaskExecutor,启动超市注册检查机制(每隔5min),完成启动后进行注册,接收到心跳的请求之后,相当于RM与TaskManager之间维持心跳
        • 3.TaskManager每次接收到ResourceManager的心跳后,重置超时任务。
    • 启动TaskSlotTable服务:内部包含一个超时检查服务
    • 监控JobLeaderService服务:启动一个监听(当已启动的jobMaster发生节点迁移,JobLeaderService接收到请求进行处理)
    • 启动FileCache服务:资源的缓存服务
  • 开始注册 startRegistrationTimeout()
    • 上述步骤如果超过5min就超时了。超时检查机制(5min的注册超时检查)

3.总结

  • TaskManager作为Flink集群的从节点,主要负责slot资源的管理以及具体task的执行,同时保持与JobManager之间的通信。

  • TaskManager的具体实现类为TaskManagerRunner。

  • TaskManger的启动过程主要为:

    • 加载配置信息(main传入的参数+flink-conf.yaml文件),初始化插件服务以及文件系统服务
    • 通过线程的方式启动TaskManager
    • 实例化TaskManagerRunner对象,成功之后给自己发送一个hello确认。
      • TaskManagerRunner包含了很多基础服务(HA/rpc/HeartBeatServices/大文件处理的服务)
      • 启动TaskManager,最终返回TaskExecutor,负责多个任务Task的运行。
        • 初始化TaskManagerServices(包含很多对外提供服务的服务组件:shuffleEnvironment、TaskSlotTable)以及JobLeaderService等等
        • 创建两个心跳管理服务:JobManagerHeartbeatManager、ResourceManagerHeartbeatManager
    • TaskExecutor实例化完成之后会执行对应的onStart()方法,其中启动四个服务:心跳服务、管理Task、Slot之间的对应关系、JobMaster服务以及文件缓存服务。

相关推荐

  1. Flink节点TaskManager启动分析

    2024-04-06 11:18:02       28 阅读
  2. Flink启动脚本分析

    2024-04-06 11:18:02       34 阅读
  3. Flink部署

    2024-04-06 11:18:02       36 阅读
  4. Flink架构

    2024-04-06 11:18:02       19 阅读
  5. 扩展 Kafka 三台节点到四台节点的过程

    2024-04-06 11:18:02       31 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-04-06 11:18:02       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-04-06 11:18:02       100 阅读
  3. 在Django里面运行非项目文件

    2024-04-06 11:18:02       82 阅读
  4. Python语言-面向对象

    2024-04-06 11:18:02       91 阅读

热门阅读

  1. 大语言模型LLM《提示词工程指南》学习笔记01

    2024-04-06 11:18:02       30 阅读
  2. 如何更改WordPress站点的域名:完全指南

    2024-04-06 11:18:02       36 阅读
  3. Day3-struct类型、列转行、行转列、函数

    2024-04-06 11:18:02       31 阅读
  4. MySQL 里记录货币用什么字段

    2024-04-06 11:18:02       32 阅读
  5. C# Socket发送、接收结构体

    2024-04-06 11:18:02       39 阅读
  6. 【ubuntu】Vim配置记录

    2024-04-06 11:18:02       33 阅读
  7. ubuntu20.04 安裝PX4 1.13

    2024-04-06 11:18:02       36 阅读
  8. 习题3-2 高速公路超速处罚

    2024-04-06 11:18:02       32 阅读
  9. 【系统架构设计师】- 知识点汇总(易错总结)

    2024-04-06 11:18:02       36 阅读