-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executing each task without restarting pods for every task #42
Comments
There is a configuration which allows airflow to run in celery worker mode. There, when running a task it will run in one of the worker pods. I have a project that deploys airflow that way. Feel free to look at the examples there or checkout the official chart here (I prefer mine :)) Another way to do this would be to write a custom operator that exec a command in a pre-running pod using the kuberentes api (official, or mine). But that would be very problematic, since it breaks the kubernetes separation principle. Is there a specific reason you would prefer not to restart the pod? |
Yes, So i am on kubernetes and have different tenant/teams....We are orchestrating long and short processes but not doing any heavy computing....So spinning up pods for every task would probably slow us down. However, using Celery would be a good approach, but my question is how do I tell the scheduler to delegate tasks to a certain worker. What i am ultimately trying to do, is make deployments without restarting airflow processes. I originally wanted to use containerization for this, and if i can treat each dag as a container / airflow worker and airflow knows how to use a specific worker for a specific dag, then that would give us the ability to deploy changes to workers without restarting the webserver / scheduler.... |
Hum. I see. I think that in celery you cannot specify which worker dose which tasks. That is not possible. Also, I found working with Celery for ~1 year, that its very slow in scheduling tasks. The requirement for synchronization (rabbitmq + DB) usually ends up slowing down the process. When there are many small tasks, I found that this sometimes drives the scheduling to respond on the order of minutes. My recommendation would be to restart the pod every time. It has a few advantages:
Further, I should note, that it is possible to make deployments without restarting airflow. In general, by creating a DB associated with the deployments that uses a PVC (I use postgres statefull set). This db runs external to the scheduler/webserver/etc. You can also persist the airflow logs using this package. Feel free to try the example I have here. It is a helm chart, using helmfile to manage the deployemnt. It should provide you with production ready airflow who's db persists through deployments. This package provides,
You can also find the stable official chart here, it should do the same. |
Hmm, ok would a job vs pod differ in terms of speed? I am using the local executor now but I need to scale this workload because i will have many clients. Is a job faster than a pod? Bc a pod is still launched for a job i believe. I guess i will try to test it. But i am curious also about the following....
class CustomOperator(BaseOperator):
def __init__(self,...):
self.queue = 'client queue'. <- now on the base operator class in airflow 2.1.0
command: ["airflow", "celery", "worker", "-q", "client"] Although, the issue with this is that I would still have to use a PV to mount the client dag to the scheduler I think |
@LamaAni So it is interesting how it is being used here, Kedro KubernetesPodOperator They are launching it per node, but would you see any issues around telling the whole dag run to execute? I might try just returning a different operator on certain steps to tell it not to spin up a pod for a specific task maybe. |
Oh, I see the confusion now. The setup I was talking about is the the one that uses the KubernetesExecutor. Which dose the following,
When there are not tasks running, there is no worker pods running. Note that the KubernetesExecutor differes from the KubernetesPodOperator and the KubernetesJobOperator. The kubernetes executor will run any operator, and dose not have to be an image. See here for more information about the kubernetes executor: https://airflow.apache.org/docs/apache-airflow/stable/executor/kubernetes.html |
Issue closed. Stale |
ORIGINAL POST FROM: @teaglebuilt
I am looking for a way to execute each task without restarting pods for every task....even if it is a job...do you know how this could be possible?
Originally posted by @teaglebuilt in #40 (comment)
The text was updated successfully, but these errors were encountered: