sedna-ji-fl-controller-optimization-v1.1-en
When changing federated learning CRs with kubectl edit FederatedLearningJob **, the update path cannot refresh pod parameters the way joint inference does for its Deployment (see this issue and this blog), so the only option was to delete all pods (there’s no way to target a single pod by name) and recreate them with the new config. In the joint inference controller, because the Deployment name is fixed, you can do:
Deployment, err := c.deploymentsLister.Deployments(service.Namespace).Get(workerName)
to fetch that Deployment, patch parameters, and update. But for generated pods the name suffix has five random characters, so you can’t reliably get a pod by name—and I found the controller’s pod naming logic didn’t actually take effect (code below):
"WORKER_NAME": "aggworker-" + utilrand.String(5)
This line effectively does nothing; Kubernetes assigns the pod name.
Tracing the cause: in pkg/globalmanager/runtime/worker.go, the injectWorkerParam function never sets pod.ObjectMeta.Name. I think it should be:
pod.ObjectMeta.Name = workerParam.Env["WORKER_NAME"]
Then you can control the pod name via "WORKER_NAME", and once the name is stable you can tell which worker in the CR changed, delete only that pod, roll the config, and recreate—instead of nuking everything.
Later, Tang Ming noted that unlike inference jobs, federated learning is more like a Kubernetes Job: one-shot work, so it may be better to disallow in-place edits and tell users to delete the job and redeploy if parameters change—which makes sense to me. Combined with pods not really supporting spec updates, forcing updates that way feels against Kubernetes’ intent. So the next optimization direction is tightening access to the resource; I’m still gathering material.
Compared to the open-source summer work this is a small delta, hence v1.1; v2 will diverge from the v1 idea, so this likely won’t go into a PR either.