× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

REST job submission instead of SSH in data pipeline

Transitioning from SSH to REST-based job submission changes the behavior of the data pipeline at the architectural level. This is about manageability, fault tolerance, and resource control.

The problem does not manifest immediately — until the system hits a scale limit. In this case, over 700 jobs were executed via SSH to EMR clusters. This included everything from Spark and MapReduce to arbitrary CLI commands. This approach seemed straightforward: connect to the master node and execute a command. However, as the number of pipelines grew, SSH became a source of instability. The connection was stateful: if the client failed (for example, a pod in Kubernetes), the job could continue running, hang, or leave behind “zombie” processes. Concurrently, the risk area expanded: direct access to production clusters, keys, and a complex access model. At some point, this became a blocker for further evolution of the infrastructure.

The solution was not in “improving SSH,” but in abandoning it. The team transitioned to REST-based job submission via API. The key idea was to shift job lifecycle management to the server side. Instead of maintaining a connection, the client sends an HTTP request, and the system tracks execution on its own. This reduces component coupling and makes behavior predictable. For the Hadoop stack (Spark, Hive, MapReduce), APIs already existed: Livy and HiveServer2. The main challenge was different — over 300 tasks consisted of arbitrary shell commands without a REST interface. Here, YARN Distributed Shell was used. This allowed any shell script to be executed within a YARN container with resource management and a standard lifecycle. The compromise is evident: an abstraction layer is added, but in return, a unified execution model emerges.

Architecturally, a key element became an intermediary service — a REST gateway for job execution. It accepts requests from the orchestrator (Airflow), handles authentication, sends tasks to compute engines (EMR/YARN, Trino, Snowflake), and tracks status. This eliminates the need for direct access to clusters. Now, Airflow does not maintain SSH sessions but operates via API. Even if the orchestrator restarts, the job continues executing, and the state can be retrieved through the same API. This pattern has long been discussed in the industry as a way to reduce coupling and enhance the resilience of distributed systems.

The implementation proved non-trivial. The migration affected hundreds of jobs and several regions with different network topologies. One of the unexpected effects was the emergence of hidden problems. For example, after transitioning to YARN, tasks began to fail due to virtual memory (vmem) constraints. Previously, SSH bypassed these limits by executing commands directly on the master node. After migration, resources became genuinely controlled. It was necessary to disable vmem checks, following AWS recommendations, as they could produce false positives. A second class of problems involved network dependencies. Some jobs depended on specific routing (for example, access to key management services), which was not explicitly documented. When moved to other clusters, this broke. This highlighted an important point: SSH masks infrastructural dependencies, while a stricter execution model exposes them.

The results manifested at several levels. Security became simpler: SSH access to production clusters was eliminated, authentication transitioned to service-to-service tokens, and auditing through API logs was introduced. Resource management became predictable: all tasks run in YARN containers, without competition for the master node. Reliability increased due to server-side lifecycle — jobs survive client failures and terminate correctly. Observability also changed: instead of manually connecting to nodes, structured logs, statuses, and metrics are available via API.

At the same time, it is important to note: specific numerical metrics of improvements are not provided. However, from the description, it is clear that the changes affected key system properties — reliability, observability, and resource management. This is an evolutionary improvement of the architecture, not just a change of interface.

The main conclusion is that transitioning from SSH to REST-based job submission changes not only the way tasks are launched but also the system model itself. SSH is convenient at the start but does not scale well and hides problems. The REST approach, especially in conjunction with resource managers like YARN, makes the system more explicit. And thus — more manageable.

Read

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.