Robotic bin packing in warehouses demands a careful balance between two often conflicting goals: maximizing space utilization and minimizing operational time. STEP (Space-Time Efficient Packing) addresses this trade-off by learning to select not just which item to pack, but also which face to grasp it from. Each choice influences both the quality of placement and the time required to execute it. Our method uses a preference-conditioned multi-objective policy that dynamically weighs packing efficiency against time cost, adapting its strategy based on user-specified preferences. While STEP is trained and evaluated in simulation, its design is grounded in real-world robotic constraints which include grasp direction, object surface type, and the impact of transport failures. These factors can directly be abstracted into the time model used during training, ensuring the policy reflects practical deployment conditions.
Different grasp directions directly affect how quickly and reliably the robot can pick an item:
Importantly, reorientation times are not universal - they depend on the specific robotic setup, hardware, and motion planning constraints. STEP abstracts all of these strategies into a unified time cost, which is explicitly taken into account when planning for bin packing. Each graspable face is treated as a separate candidate with its own time cost.
Transport speed also affects reliability:
The suction-cup dynamics with the box surface directly influence the safety and reliability of transport, and these effects depend on the specific setup and item. Since STEP explicitly optimizes for operational time, such behaviors are abstracted into face-dependent time penalties during bin packing. In training, we defined three surface categories - smooth, plastic-wrapped, and package-labeled — and randomly assigned each box face to one of these categories in simulation, with each category given a fixed time penalty.
STEP frames bin packing as a multi-candidate, multi-objective selection problem that aims to balance space utilization and operational time. Each graspable face of each item in the buffer is treated as a distinct candidate in the selection process.
The policy receives as input:
The Transformer-Select module processes the bin and item-face features to produce embeddings.
These embeddings are then combined with the preference vector in the actor and critic heads to compute scores over all item-face candidates.
The highest-scoring candidate is selected; the bin state is updated, and a new item enters the buffer.
Space utilization and operational time are inherently conflicting in robotic bin packing:
STEP resolves this conflict through a preference vector, which defines how much weight to place on each objective. By tuning this vector, the policy can prioritize space efficiency, time efficiency, or a balance of both within a single framework.
The Pareto front below illustrates the achievable trade-offs. STEP-n denotes a policy with n items in the buffer in the semi-online setting. The figure shows results for buffers of size 1, 3, and 5, each evaluated across preference vectors ranging from 0 to 1 for both objectives. Larger buffers provide more candidate choices and improve space utilization, while the preference vector governs how the trade-off between space and time is considered, respectively.
We compare STEP-1 against three strong baselines, each with a different strategy for handling grasping and reorientation:
Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where actions such as selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a Transformer-based policy conditioned on dynamic preferences, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves higher packing density without compromising operational time.