11
iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes Bokui Shen*, Fei Xia*, Chengshu Li*, Roberto Mart´ ın-Mart´ ın*, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Li Fei-Fei, Silvio Savarese Abstract— We present iGibson, a novel simulation environ- ment to develop robotic solutions for interactive tasks in large- scale realistic scenes. Our environment contains fifteen fully in- teractive home-sized scenes populated with rigid and articulated objects. The scenes are replicas of 3D scanned real-world homes, aligning the distribution of objects and layout to that of the real world. iGibson integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality visual virtual sensor signals (RGB, depth, segmentation, LiDAR, flow, among others), ii) domain randomization to change the mate- rials of the objects (both visual texture and dynamics) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demon- strated behaviors. iGibson is open-sourced with comprehensive examples and documentation. For more information, visit our project website: http://svl.stanford.edu/igibson/. I. I NTRODUCTION Simulation environments have proliferated over the last few years as a way to train robots and interactive agents in a rapid and safe manner. In these environments, agents learn to control physical interactions [1, 2], navigate based on sensor signals [3, 4, 5, 6, 7], or plan long horizon tasks [8, 9, 10, 11]. In simulation, agents learn to control motion and perform interactions that actively change the input sensor signals or change the state of the environment towards a desired configuration, capabilities at the core of what any embodied agent needs to achieve. However, existing simulation environments that combine physics simulation and robotic tasks often cater to a narrow set of tasks and include only clean, small-scale scenes [12, 13, 14, 15, 16, 17]. The few simulation environments that include large scenes such as homes or offices either disable the possibility of changing the scene, focusing only on navi- gation (e.g. Habitat [18]), or use game engines and simplified modes of interaction (e.g. AI2Thor [19], VirtualHome [20]). These simulators do not support the development of end- to-end sensorimotor control loops for tasks that require rich interaction with the scene. Such tasks are difficult to accomplish in the aforementioned simulators; moreover, * equal contribution. All authors are with the Stanford Vision & Learning Laboratory, Stanford University : Object Randomization Material Randomization Agent’s View × Fig. 1: Robot interacting in iGibson. It operates in the kitchen of one of iGibson’s fifteen fully interactive scenes, planning an interaction with the arm using a integrated sampling-based motion planner and receiving first-person view of the interaction. Bottom: The same scene can be randomized with different materials and/or object models, resulting in endless variations of the scenes that help training more robust and generalizable solutions. simplified modes of interaction can lead to difficulties in transferring the learned interaction policy into actionable real robot commands. We present iGibson, a large-scale fully interactive simula- tion environment for the development of embodied interactive agents (Fig. 1). iGibson contains fifteen fully interactive and photorealistic scenes that we generated by annotating 3D reconstructions of real-world scans and converting them into fully interactive scene models. In this process, we respect the original object-instance layout and object-category dis- tribution. The object models are extended from open-source datasets [21, 22, 15] enriched with annotations of material and dynamic properties. iGibson’s physics-based renderer leverages the extra information provided in the material annotation (maps of metallic, roughness and normals) to generate high-quality virtual images. To further facilitate the training of more robust visuomotor agents, iGibson offers domain randomization procedures for materials (both visual appearances and dynamics properties) and object shapes while respecting the distribution of object placements and preserving interactability. iGibson is also equipped with a 2D interface that allows human users to easily interact with arXiv:2012.02924v2 [cs.AI] 8 Dec 2020

iGibson, a Simulation Environment for Interactive Tasks in ...Gibson [25] (now Gibson v1) was the precursor of iGibson. It includes over 1400 3D-reconstructed floors of homes and

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • iGibson, a Simulation Environment for Interactive Tasksin Large Realistic Scenes

    Bokui Shen*, Fei Xia*, Chengshu Li*, Roberto Martı́n-Martı́n*, Linxi Fan, Guanzhi Wang, Shyamal Buch,Claudia D’Arpino, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Li Fei-Fei, Silvio Savarese

    Abstract— We present iGibson, a novel simulation environ-ment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains fifteen fully in-teractive home-sized scenes populated with rigid and articulatedobjects. The scenes are replicas of 3D scanned real-world homes,aligning the distribution of objects and layout to that of the realworld. iGibson integrates several key features to facilitate thestudy of interactive tasks: i) generation of high-quality visualvirtual sensor signals (RGB, depth, segmentation, LiDAR, flow,among others), ii) domain randomization to change the mate-rials of the objects (both visual texture and dynamics) and/ortheir shapes, iii) integrated sampling-based motion planners togenerate collision-free trajectories for robot bases and arms,and iv) intuitive human-iGibson interface that enables efficientcollection of human demonstrations. Through experiments, weshow that the full interactivity of the scenes enables agents tolearn useful visual representations that accelerate the trainingof downstream manipulation tasks. We also show that iGibsonfeatures enable the generalization of navigation agents, and thatthe human-iGibson interface and integrated motion plannersfacilitate efficient imitation learning of simple human demon-strated behaviors. iGibson is open-sourced with comprehensiveexamples and documentation. For more information, visit ourproject website: http://svl.stanford.edu/igibson/.

    I. INTRODUCTION

    Simulation environments have proliferated over the lastfew years as a way to train robots and interactive agentsin a rapid and safe manner. In these environments, agentslearn to control physical interactions [1, 2], navigate basedon sensor signals [3, 4, 5, 6, 7], or plan long horizon tasks [8,9, 10, 11]. In simulation, agents learn to control motion andperform interactions that actively change the input sensorsignals or change the state of the environment towards adesired configuration, capabilities at the core of what anyembodied agent needs to achieve.

    However, existing simulation environments that combinephysics simulation and robotic tasks often cater to a narrowset of tasks and include only clean, small-scale scenes [12,13, 14, 15, 16, 17]. The few simulation environments thatinclude large scenes such as homes or offices either disablethe possibility of changing the scene, focusing only on navi-gation (e.g. Habitat [18]), or use game engines and simplifiedmodes of interaction (e.g. AI2Thor [19], VirtualHome [20]).These simulators do not support the development of end-to-end sensorimotor control loops for tasks that requirerich interaction with the scene. Such tasks are difficultto accomplish in the aforementioned simulators; moreover,

    ∗equal contribution. All authors are with the Stanford Vision & LearningLaboratory, Stanford University

    :

    Object Randomization

    Material Randomization

    Agent’s View

    ×

    Fig. 1: Robot interacting in iGibson. It operates in the kitchenof one of iGibson’s fifteen fully interactive scenes, planning aninteraction with the arm using a integrated sampling-based motionplanner and receiving first-person view of the interaction. Bottom:The same scene can be randomized with different materials and/orobject models, resulting in endless variations of the scenes that helptraining more robust and generalizable solutions.

    simplified modes of interaction can lead to difficulties intransferring the learned interaction policy into actionable realrobot commands.

    We present iGibson, a large-scale fully interactive simula-tion environment for the development of embodied interactiveagents (Fig. 1). iGibson contains fifteen fully interactive andphotorealistic scenes that we generated by annotating 3Dreconstructions of real-world scans and converting them intofully interactive scene models. In this process, we respectthe original object-instance layout and object-category dis-tribution. The object models are extended from open-sourcedatasets [21, 22, 15] enriched with annotations of materialand dynamic properties. iGibson’s physics-based rendererleverages the extra information provided in the materialannotation (maps of metallic, roughness and normals) togenerate high-quality virtual images. To further facilitate thetraining of more robust visuomotor agents, iGibson offersdomain randomization procedures for materials (both visualappearances and dynamics properties) and object shapeswhile respecting the distribution of object placements andpreserving interactability. iGibson is also equipped with a2D interface that allows human users to easily interact with

    arX

    iv:2

    012.

    0292

    4v2

    [cs

    .AI]

    8 D

    ec 2

    020

    http://svl.stanford.edu/igibson/

  • the scenes and modify them, enabling efficient collection ofhuman demonstrations for imitation learning.

    To summarize, iGibson provides these novel features thatfacilitate developing and training new robotic solutions:1) Fifteen fully interactive visually realistic scenes represent-ing real world homes with furniture and articulated objectmodels annotated with materials and dynamics properties.2) Capabilities to import models from CubiCasa5K [23] and3D-Front [24], giving access to more than 8000 additionalinteractive home scenes.3) Realistic virtual sensor signals, including high qualityRGB images from a physics-based renderer, depth maps,1 beam and 16 beams virtual LiDAR signals, seman-tic/instance/material segmentation, optical and scene flow,and surface normals.4) Domain randomization for visual texture, dynamics prop-erties and object instances for endless variations of scenes.5) Human-computer interface for humans to provide demon-strations of fully physical interactions with the scenes.6) Integration with sampling-based motion planners to facil-itate motion of robotic bases (navigation in 2D layout) andarms (interaction in 3D space).We demonstrate the benefits of these novel features in acomprehensive set of experiments in which visual agents aretrained for navigation and interactive tasks. Our experimentsshow that iGibson enables researchers to 1) train morerobust and generalizable visually-guided policies thanks toits domain randomization mechanisms, 2) develop LiDAR-guided policies based on virtual signals that transfer directlyto the real world, 3) collect human demonstrations andtrain imitation learning policies, and 4) learn intermediatevisual representations linked to interactability of the scenethat accelerate training of downstream manipulation tasks.iGibson is open-source and academically developed, andavailable at this link.

    II. RELATED WORK

    The list of physics simulators and simulation environ-ments has grown significantly in the last years. Here, wemake the following distinction between the two: a physicssimulator is an engine capable of computing the physicaleffect of actions on an environment (e.g. motion of bodieswhen a force is applied, or flow of liquid particles whenbeing poured) [27, 28, 29, 30, 31, 32]. On the other hand,a simulation environment is a framework that includes aphysics simulator, a renderer of virtual signals, and a setof assets (i.e. models of scenes, objects, robots) ready to beused to study and develop solutions for different tasks. Bothare crucial for advancing embodied AI and robotics. Here,we focus on discussion of simulation environments.

    Several simulation environments have been proposed re-cently to study manipulation with stationary arms [13, 12,17, 33, 16]. Most of them are based on Bullet [27] orMuJoCo [28] for physics simulation, and use the default or aUnity [30] plugin as renderer. Different from these simulationenvironments, iGibson focuses on large-scale (house-size)

    scenes and includes fifteen fully interactive scenes where re-searchers can develop solutions for navigation, manipulationand mobile manipulation.

    Closer to iGibson are simulation environments that includelarge-scale realistic scenes (e.g. homes or offices), which wesummarize and compare in Table I. Gibson [25] (now Gibsonv1) was the precursor of iGibson. It includes over 1400 3D-reconstructed floors of homes and offices with real-worldobject distribution and layout. Although Gibson incorporatesPyBullet as its physics engine, each scene is one single fullyrigid object. Thus, it does not allow agents to interact withthe scenes, restricting its use to only navigation. A similarenvironment is Habitat [18]. Despite its high rendering speed,Habitat uses non-interactive assets from Gibson v1 [25] andMatterport [34] and therefore only supports navigation tasks.Recent work [35] introduced an extension to the Gibson v1environment to support Interactive Navigation [35] whereparts of the reconstructions corresponding to five objectclasses (chairs, tables, desks, sofas, and doors) in severalGibson static models were segmented and replaced with in-teractive versions. This enabled navigation agents to interactwith the scene, and thus allowed for the first benchmarkfor Interactive Navigation. The goal of iGibson is to sup-port not only (interactive) navigation, but also manipulationand mobile manipulation at scene-level with the necessarysimulation engine, scene and object assets, and annotations.

    Some simulation environments have been proposed re-cently for scene-level interactive tasks, such as Sapien [15],AI2Thor [19] and ThreeDWorld (TDW [26]). Sapien focuseson interaction with articulated objects and introduces thePartNet-Mobility dataset; in iGibson, we enrich this set ofarticulated objects with materials and dynamics properties(weight, friction). TDW is a multi-modal simulator with au-dio, high quality visuals, and simulation of flexible materialsand liquids via Nvidia Flex [31]. In contrast to Sapien andTDW, iGibson includes fully interactive scenes aligned withreal object distribution and layout as part of the environment.Similar to iGibson, AI2Thor includes multiple models ofrooms that can be interacted with. However, unlike iGibson,AI2Thor main mode of interaction is through predefineddiscrete actions: interactable objects are annotated with thepossible actions they can receive. When the agent is closeenough to an object and the object is in the right state(precondition), the agent can select a predefined action, andthe object is “transitioned” to the next state (postcondition).AI2Thor’s main focus is navigation and task planning ofsequences of actions to change object states.

    Different from existing simulation environments, the mainspecialty of iGibson is interactivity: enabling realistic inter-actions in large scenes. In iGibson’s fully interactive scenes,users can develop robust navigation, manipulation and mo-bile manipulation solutions. To this end, iGibson is equippedwith its own fast and open-source physics-based renderer,domain randomization, integration with sample-based motionplanners, and intuitive mouse-keyboard interfaces for humandemonstrations. We analyze the benefits of these novel andunique features via experiments in Sec. IV.

    2

    http://svl.stanford.edu/igibson/

  • TABLE I: Comparison of Simulation Environments

    iGibson (ours) Gibson [25] Habitat [18] Sapien [15] AI2Thor [19] VirtualH [20] TDW [26]

    Provided Large Scenes (RS,AS) 15,- 1400,- - - -,17* -,7 -,0Provided Objects (O,M) 570,Yes -,- -,- 2346,No 609,Yes 308,No 200,YesModes of Interaction F - -* F F,PA F,PA FType of Simulation RBF RBF RBF RBF RBF,PP PP RBF,PSType of Rendering PBR IBR PBR PBR,RT PBR PBR PBR

    Virtual Sensor Signals RGB,D,NSS,L,FLRGB,DN,SS RGB,D,SS RGB,D,SS RGB,D,SS

    RGB,DSS,FL RGB,D,SS

    Domain Randomization S,O,M - - - S S S,OSpeed ++ + +++ ++(PBR)/-(RT) + + +Human-Simulator Interface MK - MK - MK NL VRIntegrated Motion Planner Yes No No No No No No

    Specialty PI N Fast,N AO,RT OS,TP OS,TP A,FProvided Large Scenes: number of included models of large (house size) Real-world Scenes (RS) or Artificially-created Scenes (AS), in number of floors(- = none, * = assumed 7 rooms/floor as in iGibson)Provided Objects: number of included models of Objects (O), and whether they are annotated with Material (M) information (texture and dynamical properties)Modes of Interaction: ways for an agent to change the state of the scene, F:Contact Forces, PA:Predefined Actions, (* = main mode does not allow interactions)Type of simulation: RBF:Rigid-Body Physics, PP:Pre/Postconditions, PS:Particle Simulation (fluids)Type of rendering: PBR:Physics-Based Rendering, IBR:Image-Based Rendering, RT:Ray TracingVirtual sensor signals generated by the simulator: RGB: Color Images, D:Depth, N:Normals, SS:Semantic Segmentation, L:Lidar, FL:Flow (optical and/or scene)Domain randomization: Randomized changes of scenes (S), object models (O), or materials (M), including texture and dynamical propertiesHuman-Simulator Interface: Interfaces for humans to control interactions in simulation, MK:Mouse/Keyboard, NL:Natural Language, VR:Virtual RealitySpecialty: differentiating feature or task (subjective): PI:Physical Interaction, N:Navigation, AO:Articulated Objects, RT:Ray Tracing, OS:Object States, TP:Task Planning,A:Audio, F:Fluids

    III. IGIBSON SIMULATION ENVIRONMENT

    In this section, we discuss the main structure, propertiesand the features of iGibson that support training of robustsensor-guided policies for navigation and manipulation.

    A. Simulation Characteristics and API

    At the highest level, iGibson follows the OpenAIGym [36] convention, which is the most common instan-tiation of discrete time reinforcement learning loop. The en-vironment receives an action and returns a new observation,reward and additional meta-information (e.g. if the episodehas ended). Environments are specified with config files thatdetermine scenes, tasks, robot embodiments, sensors, etc.Given a config file, iGibson creates an Environment thatcontains a Task and a Simulator. The Simulator con-tains a Scene, with a list of interactive Objects and one ormore Robots. It also contains a Renderer that generatesvirtual visual signals for the agents. The Task defines thereward, initial and termination conditions for the scene andthe agents. While modular and easy to extend, most usersmay only need to interface with the Environment classafter instantiating it with the appropriate config.

    iGibson comes with multiple easy-to-use configs, demosand Docker [37] files. It has been extensively adopted totrain visuo-motor policies that successfully transfer to thereal world [38, 39, 40, 41], and was the platform for iGibsonSim2Real Challenge at CVPR20 [42]. iGibson is easilyparallelizable and supports off-screen rendering on clusters.

    B. Fully Interactive Assets

    iGibson provides fifteen high quality fully interactivescenes (see Fig. 2), populated with interactable objects. Thescenes are interactive versions of fifteen 3D reconstructedscenes included in the Gibson v1 dataset. To preserve thereal-world layout and distribution of objects, we follow a

    semi-automatic annotation procedure. This process is rad-ically different from the annotation we performed for theInteractive Gibson Benchmark [35]. Instead of segmentingthe original scenes and replace part of the meshes with inter-active object models, we create fully interactive counterpartsof the 3D reconstruction from scratch. This eliminates theneed to fix artifacts in the original mesh due to reconstructionnoise or segmentation error, and allows us to improve theoverall quality of the scenes.

    The scene generation process is composed of two anno-tation phases. First, the layout of the scene is annotatedwith floors, walls, doors and window openings. Then, allobjects are annotated with 3D bounding boxes and classlabels. We annotate bounding boxes for 57 different objectclasses, including all furniture types (doors, chairs, tables,cabinets, TVs, shelves, stoves, sinks, . . . ) and some smallobjects (plants, laptops, speakers, . . . ); see project website forthe complete list. Annotating class-labeled bounding boxesallows us to scale and use different models of the sameobject class, while maintaining the real-world distributionof objects in the scene. In this way, we are able generaterealistic randomized versions of the scenes (see Sec. III-D). To achieve the highest quality, for each class-labeledbounding box, we select a best fitting object model. Thescene is also annotated with lights, with which we generatelight probes for physics-based rendering (see Sec. III-C). Wealso bake in a realistic ray-traced ambient light and otherlight effects in the walls, floors and ceilings.

    The object models are curated from open-source datasets:ShapeNet [21], PartNet Mobility [15, 43], and SketchFab.Topreserve visual realism of the original reconstruction, weimprove the object visual quality by annotating differentparts of the models with photorealistic materials, which arethen used by iGibson’s physics-based renderer. We utilizematerials from CC0Texture, including wood, marble, metal,etc. To achieve a high degree of physics realism, we curate

    3

    https://sketchfab.com/feedhttps://cc0textures.com/

  • Fig. 2: Fifteen fully interactive iGibson scenes. The scenes arecopies of real-world reconstructions that preserve the layout, distri-bution and size of objects. The objects (rigid and articulated) arecurated from open-source datasets and annotated with material anddynamics properties for visual and physical realism.

    a mapping from visual materials to friction coefficients. Weadditionally compute the collision mesh, center of mass andinertia frame for each link of all objects. To assign realisticmass and density for different objects, we take the the medianvalues of the top 20 search results from Amazon MarketplaceWeb Service.

    Additionally, we provide compatibility with Cubi-Casa5K [23] and 3D-Front [24] repositories of home scenes.We use their scene layouts and populate them with our an-notated object models, leading to additional more than 8000interactive home scenes. These scenes contain less objectsthan the fifteen iGibson scenes, but provide a very largenumber of additional models to train tasks. For additionalinformation on our procedure to generate 3D scenes fromthe annotations in these datasets and type of data included,we refer to Appendix VI-D.

    The fully interactive scenes we include in iGibson enablelearning of interactive tasks in large realistic home scenes;in Sec. IV-D we show that the scenes can be used to learn auseful visual representation that accelerates the learning ofdownstream manipulation tasks.

    C. Virtual Sensors

    A crucial component of iGibson is the generation of highquality virtual sensor signals, i.e. images and point clouds,for the simulated robots. In the following, we summarize themost relevant of these signal generators (Fig. 3).

    Physics Based Rendering: In iGibson, we include anopen-source physics-based renderer, which implements anapproximation of BRDF models [44] with spatially varyingmaterial maps including roughness, metallic and tangent-space surface normals, extending Michał Siejak’s work.

    LiDAR Sensing: Many real-world robots are equippedwith LiDAR sensors for obstacle detection. In iGibson,we support virtual LiDAR signals, with both 1 beam (e.g.Hokuyo) and 16 beams (e.g. Velodyne VLP-16). The virtualLiDAR signals are point clouds that represent the locationshit by LiDAR rays originated from the sensor location.We also include a simple drop-out sensor noise model toemulate the common failure case in real sensors in whichsome of the laser pulses do not return. Additionally, weprovide the functionality to turn the 1L LiDAR scans intooccupancy maps around the robot, which are bird’s-eye viewimages with three types of pixels indicating free, occupied,or unknown space.

    Fig. 3: Robot interacting in iGibson (large picture: 3rd person view)and virtual sensor signals generated. Policies and solutions canmake use of the following channels: (from top to bottom, left toright) RGB, depth, semantic/instance segmentation, normals, 16DLiDAR (point cloud), 1D LiDAR (also as occupancy map). Notdepicted: optical/scene flow, joint encoders for robot’s and objects’joints, poses, wrenches, contact points, and map localization.

    Additional Visual Channels: In addition to RGB andLiDAR, we support a wide range of visual modalities, suchas depth maps, optical/scene flow and normals, segmen-tation of semantic class, instance, material and movableparts. These modalities can support research topics suchas: depth/segmentation/normal/affordance prediction [45, 46,47], action-conditioned flow prediction [48], multi-modalpose estimation [49, 50, 51], and visuomotor policy trainingassuming perfect vision systems [35, 52].

    D. Domain Randomization

    It is standard practice for robot learning to partiallyrandomize the environment’s parameters in order to make thepolicy more robust [53, 54, 55, 56]. With the model beingtrained in a wide distribution of environments, it will be morelikely to generalize to unknown evaluation environments. Theevaluation environment may be the real world if we aim totrain in simulation and transfer the policy to a real robot. IniGibson, we include domain randomization that leads to anendless variation of visual appearance, dynamics propertiesand object instances with the same scene layout.

    First, we provide object randomization. Our annotationprovides class-labeled object bounding boxes based on theoriginal 3D reconstructions. When we instantiate a scenein iGibson, we can inject any object models of a certainclass into the bounding boxes of that class. For example,for a bounding box labeled as “table”, any table modelcan be scaled and fit into it. This randomization maintainsthe semantic layout of the scenes (i.e. the object categoriesremain at the same 3D locations) while enabling near-infinitecombinations of object instances. It provides strong variationin depth maps and LiDAR signals that helps robustifypolicies based on these observations (see Sec. IV-A,IV-B).

    Second, we provide material randomization. While theobject and scene models have been annotated with a high-quality appropriate material for each object part, we providea mechanism to randomize the specific material model asso-ciated with each object part (e.g. associating a different typeof wood or metal). The effect is a stark color randomization

    4

    https://www.engineersedge.com/coeffients_of_friction.htmlhttps://developer.amazonservices.com/https://developer.amazonservices.com/https://github.com/Nadrin/PBRhttps://www.hokuyo-aut.jp/search/single.php?serial=166https://velodynelidar.com/products/puck/

  • that still represents plausible material combinations. Thisrandomization generates strong variations in the RGB imagesand helps robustify policies based on this observation (seeSec. IV-A). Moreover, based on our curated mapping fromvisual materials to dynamics properties, we can randomizedynamics properties of all object links.

    E. Motion Planning

    Motion planners provide collision-free trajectories to movea robot from an initial to a final configuration [57]. Theycan be used to generate collision-free navigation paths forrobot bases and collision-free motion paths for robot arms.In iGibson, we include implementations of the most popularsampling-based motion planners: rapidly growing randomtrees (RRT [58]) and its bidirectional variant (BiRRT [59]),and lazy probabilistic road-maps (lazyPRM [60]), adaptedfrom [61]. Sampling-based motion planners can have rathersuboptimal and intricate paths. To alleviate this, we includeacceleration-bounded shortcuts [62] for smoother paths.

    F. Human-iGibson Interface

    While the main purpose of iGibson is to develop and trainsimulated robots, it is very helpful for humans to be able tonavigate and interact in iGibson scenes. This can be usedto evaluate the difficulty or feasibility of a task, change thescene into a better initial state, or generate demonstrationsthat can be used for imitation learning. While sophisticatedinterfaces such as virtual reality or 3D mouse may providea more intuitive and natural experience, most users do nothave the necessary hardware. In iGibson, we provide ahuman-iGibson interface based on mouse and key commandson a viewer window. The user can navigate and interactwith (pull, push, pick and place) objects. This is a naturalinterface to efficiently collect demonstrations of manipulationtasks for imitation. Moreover, the human-iGibson mouseinterface integrates with the motion planner, by which theuser can command the robot into desired base and/or armconfigurations. We verify this interface facilitates efficientdevelopment of interactive robotic solutions in Sec. IV-C.

    IV. EXPERIMENTS

    The goal of our experiments is to answer how iGibson’sfeatures help to develop AI agents. Specifically, we examine:• (Sec.IV-A) does domain randomization help visual nav-igation agents generalize to unseen scenes?• (Sec.IV-B) can policies using iGibson generated signals(LiDAR) transfer to real world?• (Sec.IV-C) can human-iGibson interface be used to effi-ciently train imitation learning agents?• (Sec.IV-D) does the full interactability in the scenesallow agents to learn visual representations that acceleratelearning of downstream manipulation tasks?

    A. Domain Randomization for Visual Navigation

    In the first set of experiments, we evaluate the benefitsbrought by our integrated domain randomization feature,

    both in material and object shape. We compare the gener-alization capabilities of visual-based reinforcement learningpolicies trained with and without domain randomization.

    First, we evaluate the performance of policies trainedin iGibson for PointGoal tasks [63]. Here, a mobile robot(Locobot) needs to navigate to a goal location while usingvirtual depth maps to avoid collisions. The robot makesdecisions at 10Hz in the form of twist commands. Theobservations for the policy include the depth maps, therobot’s linear and angular velocities, the goal location inthe robot’s reference frame, and the location of the next 10waypoints in the shortest path between the robot’s currentlocation and the goal location, separated by 0.2m. The pathis computed using one of the included motion planners (seeSec. III-E) based on iGibson’s provided traversability maps.Note that the traversability maps only depict room layout anddo not include objects, so the robot mainly relies on depthmaps for obstacle avoidance. The task is successful if therobot gets closer than 0.36m to the goal location (the sizeof the robot). The reward is shaped by the geodesic distanceto the goal and has a collision penalty. [18].

    Second, we evaluate the performance of policies trainedin iGibson for the robot to navigate to an object (a lamp)using virtual RGB images. The goal is to get at least 5% ofthe image occupied by the pixels of the target object. Thesame lamp is randomly sampled at different locations in theroom that the robot is initially placed. The observation forthe policy only includes RGB images. We use the instancesegmentation channel to compute reward and success crite-rion, but do not provide it as input to the policy.

    We train in eleven fully interactive scenes and evaluate infour held-out scenes with held-out visual textures. Refer toAppendix VI-A.1 for more details of experiments.

    Results: For both experiments, iGibson’s domain ran-domization helps policies to generalize better. For PointGoalnavigation based on depth images, the performance goesfrom 0.27 to 0.40 SPL [63] and from 31.25% to 44.75%success rate when using randomization, indicating that thelarger variety of shapes observed in the training processgenerates more robust depth-based policies. The full tableis shown in Table II. For object navigation based on RGBimages, the performance goes from 49.75% to 57.5% successrate, indicating that material randomization helps in obtain-ing RGB-based policies that are more generalizable to unseenscenes and textures. The full table is shown in Table III.

    B. LiDAR-Based Point-to-Point Navigation

    In the second set of experiments, we examine the efficacyof our alternative virtual sensing modalities to train sensori-motor policies in iGibson, and assess how well those policiestransfer to the real world without adaptation. Here, we usevirtual LiDAR signals to train PointGoal navigation policies.The observations for the policy include a 1D LiDAR scanwith 512 laser rays, the robot’s linear and angular velocities,and the goal location in the robot’s reference frame. To focuson sim2real transferability, we train in our scene Rs int,for which we have the real-world counterpart (see Fig. 4,

    5

  • Fig. 4: Left: Robot navigating the real-world counterpart of theiGibson scene Rs int. The robot executes a policy trained insimulation with virtual LiDAR signals, without domain adaptation.The quality and realism of iGibson allows for zero-shot policytransfer. Right: Imitation learning for pick-and-place operations.Robot executing the imitation policy trained using 50 demonstra-tions collected with our human-iGibson interface. The interfaceallows users to efficiently provide demonstrations (< 15 sec/demo)in continuous or discrete (enabled by motion planners) action spacesto train robots for navigation and manipulation tasks.

    left). For the evaluation, we sample 15 pairs of initial andgoal robot locations and test our trained policies three timeson each of them, both in simulation and in real, leading to45 episodes in the real and simulated apartment.

    Results: The LiDAR-based agent achieves 33% successrate in Rs int in iGibson, while in the real-world apartment,the same agent achieves 24% success rate. Most failuresoccur for the same pairs of initial and goal locations iniGibson and real world. The slight drop in performancesuggests that the LiDAR signals generated in iGibson arerealistic enough to allow for zero-shot policy transfer.

    C. Imitation Learning: Human Demonstrated Manipulation

    In the third set of experiments, we evaluate the usabilityof the human-iGibson interface to efficiently collect demon-strations for imitation learning. We collect 50 demonstrationsof pick-and-place operations: pick a mug and place in thesink (Fig. 4, right). Each demo requires less than 15 s, andwe store them as pairs of state (object position) and action(desired delta translation to add to the object position in thenext step). We use 20 different mug models in this process.

    After collecting demonstrations, we train a behavioralcloning policy that maps states to actions. The action spacecontains two parts: a 3-dimensional continuous space rep-resenting the desired delta of object position (to be addedto the current object position), and a 1-dimensional discretespace representing the open/close state of the gripper. Weuse a simulated mobile manipulator, a Fetch robot, forevaluation. The first value of the policy is interpreted as adesired location for the arm to move to and grasp. We useone of the integrated motion planners (BiRRT) to plan anarm motion trajectory to the object position and close thefingers. Afterwards, the policy outputs the desired motion(delta in position) for the current object position at 20Hz.Assuming a firm grasp, this is the same as the desired end-effector motion. The resulting Cartesian position is given toan inverse kinematics solver that provides the joint statesto move the end-effector, which we execute with a jointposition controller. The policy also outputs whether to openthe gripper, which indicates the end of an episode. The tasksucceeds if the policy brings the object (a mug) into the targetarea (the sink) within the time budget. We test generalizationwith 5 unseen mug models during evaluation.

    interactability predictionRGB Input

    0 1 2 3 4 5num steps 1e4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    mea

    n re

    turn

    PushCabinet task

    No pretrainingWith pretraining

    0 1 2 3 4num steps 1e4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    mea

    n re

    turn

    PushDrawer task

    No pretrainingWith pretraining

    Fig. 5: Left: Example result of interaction pretraining. The agentreceives RGB input and predicts if the pixels are pushable (red:higher probability; blue: lower) (Sec. IV-D). The model learns toassociate edge of doors as the most pushable points. Right: Trainingcurves for two interactive tasks (PushDrawer,PushCabinet)with and without interaction pretraining. While policies reachsimilar or higher final performances, using the pretrained repre-sentation accelerates training. The full interactability of iGibsonallows the agent to first derive a visual interactive representationthat accelerates training of downstream mobile manipulation tasks.

    Results: After reaching 96.75% accuracy in training, weevaluate the policy on the simulated Fetch robot, achieving98% success rate for 100 evaluation episodes. This experi-ment indicates that the human-iGibson interface enables easycollection of effective demonstrations for imitation learning.

    D. Pretraining in Fully Interactive Scenes

    In the fourth and final set of experiments, we evaluatethe potential of using iGibson’s fully interactive scenesto learn an intermediate visual representation that encodesthe expected outcome of interactions with different objects.Such an intermediate visual representation may be used toaccelerate robot learning of manipulation tasks, since theytypically require the agent to associate visual observationswith promising areas of interaction to change the state ofthe scene towards a manipulation goal.

    To learn such representations, we set up a virtual agentthat interacts with random points in the scenes and learnsto predict the outcome of these interactions. The interactionis parameterized as a coordinate in the virtual agent’s imageobservation space. We emulate a pushing interaction by dis-placing the corresponding 3D location of the selected pixelby 30 cm in the opposite direction of the surface normal,applying a maximum force of 60N (a common payload ofcommercial robots). A motion of the point for more than10 cm is considered a success. We sample 10 random pushesat each location, 4,000 locations in each scene. We use theimages annotated with interaction successes/failures to traina U-Net [64]-based visual encoder that predicts heatmaps ofexpected interaction success from RGB input.

    For the second phase, we train two policy networksfor two manipulation task respectively (PushDrawer,PushCabinet). The goal is to close the drawers or thecabinets. The output of the policy are points to interact (push)that are given to one of our integrated motion planners togenerate an arm motion [10]. We use DQN [65] as policylearning algorithm. The predicted interaction heatmaps areused to gate the Q-value maps predicted by the network.

    Results: Fig. 5 (left) depicts the result of the pretrainedvisual model. We observe that the heatmap has strongeractivation at the edge of the door than in the area closer to

    6

  • the hinge, and closed cabinets are estimated as not pushable,demonstrating a correct correlation to best areas to push tosuccessfully cause motion (further visualizations on projectwebsite). For both downstream tasks, we observe that usingthe pre-trained representation significantly accelerates train-ing (Fig. 5 (right)). This suggests that the full interactabilityof iGibson can help agents learn useful visual representationfor downstream mobile manipulation tasks.

    V. CONCLUSION

    We presented iGibson, a novel simulation environment fordeveloping interactive robotic agents in large-scale realisticscenes. iGibson includes fifteen fully interactive scenes,and novel capabilities to generate high-quality virtual sen-sor signals, domain randomization, integration with motionplanners, and an efficient human-iGibson interface. Throughexperiments, we showcased that iGibson helps to developrobust policies for navigation and manipulation. We hopethat iGibson can aid researchers in solving complex roboticsproblems in large-scale realistic scenes.

    VI. ACKNOWLEDGEMENT

    We thank Nvidia, Google, ONR MURI (N00014-14-1-0671), ONR (1165419-10-TDAUZ), Panasonic (1192707-1-GWMSX), Qualcomm and Samsung for their support on thisproject.

    REFERENCES[1] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

    D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971, 2015.

    [2] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” The Journal of Machine LearningResearch, vol. 17, no. 1, pp. 1334–1373, 2016.

    [3] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning tonavigate in complex environments,” arXiv preprint arXiv:1611.03673,2016.

    [4] E. Parisotto and R. Salakhutdinov, “Neural map: Structured memoryfor deep reinforcement learning,” arXiv preprint arXiv:1702.08360,2017.

    [5] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, andA. Farhadi, “Target-driven visual navigation in indoor scenes usingdeep reinforcement learning,” in 2017 IEEE international conferenceon robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.

    [6] W. B. Shen, D. Xu, Y. Zhu, L. J. Guibas, L. Fei-Fei, and S. Savarese,“Situational fusion of visual representation for visual navigation,”in Proceedings of the IEEE International Conference on ComputerVision, 2019, pp. 2881–2890.

    [7] K. Chen, J. P. de Vicente, G. Sepulveda, F. Xia, A. Soto, M. Vázquez,and S. Savarese, “A behavioral approach to visual navigation withgraph localization networks,” arXiv preprint arXiv:1903.00445, 2019.

    [8] C. R. Garrett, T. Lozano-Perez, and L. P. Kaelbling, “Ffrob: Lever-aging symbolic planning for efficient task and motion planning,” TheInternational Journal of Robotics Research, vol. 37, no. 1, pp. 104–136, 2018.

    [9] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese,“Neural task programming: Learning to generalize across hierarchicaltasks,” in 2018 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2018, pp. 1–8.

    [10] F. Xia, C. Li, R. Martı́n-Martı́n, O. Litany, A. Toshev, and S. Savarese,“Relmogen: Leveraging motion generation in reinforcement learningfor mobile manipulation,” arXiv preprint arXiv:2008.07792, 2020.

    [11] E. Li, R. Martı́n-Martı́n, F. Xia, and S. Savarese, “Hrl4in: Hierarchicalreinforcement learning forinteractive navigation with mobile manipu-lators,” in 2019 Conference on Robot Learning (CoRL), 2019.

    [12] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: Therobot learning benchmark & learning environment,” IEEE Roboticsand Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020.

    [13] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, andS. Levine, “Meta-world: A benchmark and evaluation for multi-taskand meta reinforcement learning,” in Conference on Robot Learning,2020, pp. 1094–1100.

    [14] Y. Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, andP. Abbeel, “Doorgym: A scalable door opening environment andbaseline agent,” arXiv preprint arXiv:1908.01887, 2019.

    [15] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang,Y. Yuan, H. Wang et al., “Sapien: A simulated part-based interactiveenvironment,” in Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, 2020, pp. 11 097–11 107.

    [16] Y. Lee, E. S. Hu, Z. Yang, A. Yin, and J. J. Lim, “IKEA furnitureassembly environment for long-horizon complex manipulation tasks,”arXiv preprint arXiv:1911.07246, 2019.

    [17] Y. Zhu, J. Wong, A. Mandlekar, and R. Martı́n-Martı́n, “robosuite:A modular simulation framework and benchmark for robot learning,”arXiv preprint arXiv:2009.12293, 2020.

    [18] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao,E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh,and D. Batra, “Habitat: A Platform for Embodied AI Research,” inProceedings of the IEEE/CVF International Conference on ComputerVision (ICCV), 2019.

    [19] E. Kolve et al., “Ai2-thor: An interactive 3d environment for visualai,” arXiv preprint arXiv:1712.05474, 2017.

    [20] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba,“Virtualhome: Simulating household activities via programs,” in 2018IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 8494–8502.

    [21] A. X. Chang et al., “Shapenet: An information-rich 3d model reposi-tory,” arXiv preprint arXiv:1512.03012, 2015.

    [22] X. Wang, B. Zhou, Y. Shi, X. Chen, Q. Zhao, and K. Xu,“Shape2motion: Joint analysis of motion parts and attributes from 3dshapes,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2019, pp. 8876–8884.

    [23] A. Kalervo, J. Ylioinas, M. Häikiö, A. Karhu, and J. Kannala, “Cubi-casa5k: A dataset and an improved multi-task model for floorplanimage analysis,” in Scandinavian Conference on Image Analysis.Springer, 2019, pp. 28–40.

    [24] H. Fu, B. Cai, L. Gao, L. Zhang, C. Li, Z. Xun, C. Sun, Y. Fei,Y. Zheng, Y. Li et al., “3d-front: 3d furnished rooms with layouts andsemantics,” arXiv preprint arXiv:2011.09127, 2020.

    [25] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibsonenv: Real-world perception for embodied agents,” in Conference onComputer Vision and Pattern Recognition (CVPR), 2018.

    [26] C. Gan, J. Schwartz, S. Alter, M. Schrimpf, J. Traer, J. De Freitas,J. Kubilius, A. Bhandwaldar, N. Haber, M. Sano et al., “Threedworld:A platform for interactive multi-modal physical simulation,” arXivpreprint arXiv:2007.04954, 2020.

    [27] E. Coumans et al., “Bullet physics library,” Open source: bulletphysics.org, vol. 15, no. 49, p. 5, 2013.

    [28] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics enginefor model-based control,” in International Conference on IntelligentRobots and Systems (IROS). IEEE, 2012, pp. 5026–5033.

    [29] J. Lee, M. X. Grey, S. Ha, T. Kunz, S. Jain, Y. Ye, S. S. Srinivasa,M. Stilman, and C. K. Liu, “Dart: Dynamic animation and roboticstoolkit,” Journal of Open Source Software, vol. 3, no. 22, p. 500, 2018.

    [30] “Unity,” https://unity.com/, accessed: 2020-10-30.[31] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and

    D. Fox, “Gpu-accelerated robotic simulation for distributed reinforce-ment learning,” arXiv preprint arXiv:1810.05762, 2018.

    [32] “Ode,” https://ode.org/, accessed: 2020-10-30.[33] Z. Erickson, V. Gangaram, A. Kapusta, C. K. Liu, and C. C.

    Kemp, “Assistive gym: A physics simulation framework for assistiverobotics,” in 2020 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2020, pp. 10 169–10 176.

    [34] A. Chang et al., “Matterport3d: Learning from rgb-d data in indoorenvironments,” in 2017 International Conference on 3D Vision (3DV).IEEE, 2017, pp. 667–676.

    [35] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev,R. Martı́n-Martı́n, and S. Savarese, “Interactive gibson benchmark: Abenchmark for interactive navigation in cluttered environments,” IEEERobotics and Automation Letters, vol. 5, no. 2, pp. 713–720, 2020.

    7

    https://unity.com/https://ode.org/

  • [36] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540, 2016.

    [37] D. Merkel, “Docker: lightweight linux containers for consistent de-velopment and deployment,” Linux journal, vol. 2014, no. 239, p. 2,2014.

    [38] X. Meng, N. Ratliff, Y. Xiang, and D. Fox, “Scaling local controlto large-scale topological navigation,” in 2020 IEEE InternationalConference on Robotics and Automation (ICRA). IEEE, 2020, pp.672–678.

    [39] ——, “Neural autonomous navigation with riemannian motion policy,”in 2019 International Conference on Robotics and Automation (ICRA).IEEE, 2019, pp. 8860–8866.

    [40] K. Kang, S. Belkhale, G. Kahn, P. Abbeel, and S. Levine, “Gen-eralization through simulation: Integrating simulated and real datainto deep reinforcement learning for vision-based autonomous flight,”International Conference on Robotics and Automation (ICRA), 2019.

    [41] N. Hirose, F. Xia, R. Martı́n-Martı́n, A. Sadeghian, and S. Savarese,“Deep visual mpc-policy learning for navigation,” IEEE Robotics andAutomation Letters, vol. 4, no. 4, pp. 3184–3191, 2019.

    [42] “iGibson Sim2real Challenge at CVPR2020.” [Online]. Available:http://svl.stanford.edu/igibson/challenge.html

    [43] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas,and H. Su, “Partnet: A large-scale benchmark for fine-grained andhierarchical part-level 3d object understanding,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2019,pp. 909–918.

    [44] C. Schlick, “An inexpensive brdf model for physically-based render-ing,” in Computer graphics forum, vol. 13, no. 3. Wiley OnlineLibrary, 1994, pp. 233–246.

    [45] L. Porzi, S. R. Bulo, A. Penate-Sanchez, E. Ricci, and F. Moreno-Noguer, “Learning depth-aware deep representations for robotic per-ception,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp.468–475, 2016.

    [46] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, “Depth predictionwithout the sensors: Leveraging structure for unsupervised learningfrom monocular videos,” in Proceedings of the AAAI Conference onArtificial Intelligence, vol. 33, 2019, pp. 8001–8008.

    [47] C. Xie, Y. Xiang, A. Mousavian, and D. Fox, “The best of bothmodes: Separately leveraging rgb and depth for unseen object instancesegmentation,” in Conference on robot learning. PMLR, 2020, pp.1369–1378.

    [48] I. Nematollahi, O. Mees, L. Hermann, and W. Burgard, “Hindsightfor foresight: Unsupervised structured dynamics models from physicalinteraction,” arXiv preprint arXiv:2008.00456, 2020.

    [49] C. Choi and H. I. Christensen, “Rgb-d object pose estimation in un-structured environments,” Robotics and Autonomous Systems, vol. 75,pp. 595–613, 2016.

    [50] C. Wang, R. Martı́n-Martı́n, D. Xu, J. Lv, C. Lu, L. Fei-Fei,S. Savarese, and Y. Zhu, “6-pack: Category-level 6d pose tracker withanchor-based keypoints,” in 2020 IEEE International Conference onRobotics and Automation (ICRA). IEEE, 2020, pp. 10 059–10 066.

    [51] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann, “Segmentation-driven6d object pose estimation,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 3385–3394.

    [52] M. Yan, Q. Sun, I. Frosio, S. Tyree, and J. Kautz, “How toclose sim-real gap? transfer with segmentation!” arXiv preprintarXiv:2005.07695, 2020.

    [53] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without asingle real image,” arXiv preprint arXiv:1611.04201, 2016.

    [54] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” in 2017 IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp.23–30.

    [55] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan,J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonicaladaptation networks,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 12 627–12 637.

    [56] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc-Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al.,“Learning dexterous in-hand manipulation,” The International Journalof Robotics Research, vol. 39, no. 1, pp. 3–20, 2020.

    [57] S. M. LaValle, Planning algorithms. Cambridge university press,

    2006.[58] S. M. Lavalle, “Rapidly-exploring random trees: A new tool for path

    planning,” Department of Computer Science; Iowa State University,Tech. Rep., 1998.

    [59] A. H. Qureshi and Y. Ayaz, “Intelligent bidirectional rapidly-exploringrandom trees for optimal motion planning in complex cluttered envi-ronments,” Robotics and Autonomous Systems, vol. 68, pp. 1–11, 2015.

    [60] R. Bohlin and L. E. Kavraki, “Path planning using lazy prm,” inProceedings 2000 ICRA. Millennium Conference. IEEE InternationalConference on Robotics and Automation. Symposia Proceedings (Cat.No. 00CH37065), vol. 1. IEEE, 2000, pp. 521–528.

    [61] Caelan Reed Garrett, “PyBullet Planning.” https://pypi.org/project/pybullet-planning/, 2018.

    [62] K. Hauser and V. Ng-Thow-Hing, “Fast smoothing of manipulatortrajectories using optimal bounded-acceleration shortcuts,” in 2010IEEE international conference on robotics and automation. IEEE,2010, pp. 2493–2498.

    [63] P. Anderson et al., “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018.

    [64] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in International Confer-ence on Medical image computing and computer-assisted intervention.Springer, 2015, pp. 234–241.

    [65] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” in Thirtieth AAAI conference on artificialintelligence, 2016.

    [66] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille, “Weight standard-ization,” arXiv preprint arXiv:1903.10520, 2019.

    [67] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” arXivpreprint arXiv:1812.05905, 2018.

    [68] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operatingsystem,” in ICRA workshop on open source software, vol. 3, no. 3.2.Kobe, Japan, 2009, p. 5.

    [69] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

    APPENDIX

    A. Details on the Experimental Evaluation

    In this section, we provide additional information on theexperiments presented in Sec. IV such as training processesand architectures, or system characteristics.

    1) Domain Randomization for Visual Navigation: ForPointGoal navigation, as mentioned in the main paper, theobservations for the policy include the depth maps, therobot’s linear and angular velocities, the goal location inthe robot’s reference frame, and the location of the next 10waypoints in the shortest path between the robot’s currentlocation and the goal location. We concatenate the robot’slinear and angular velocities, the goal location, and thelocation of the next 10 waypoints together and denote themby sensor observations. For both depth maps and sensorobservations, we adopt a frame stack of 4. The encoder of thepolicy network consists of 3 parts: (a) a 9-layer ResNet withWeight Standardization [66] and GroupNorm, followed by a3-layer Conv1D block to encode depth maps; (b) a 2-layerMLP to encode sensor observations; (c) a 2-layer fusion MLPto encode the concatenation of depth maps embedding andsensor observations embedding. The learning rate for SACis 1e-4. For object navigation, the observation only includesa frame stack of 4 RGB images. The encoder of the policynetwork is a 9-layer ResNet with Weight Standardizationand GroupNorm, followed by a 3-layer Conv1D block toencode RGB images. The learning rate for SAC is 5e-4. Toaccelerate the training of our visual policies, we implement a

    8

    http://svl.stanford.edu/igibson/challenge.htmlhttps://pypi.org/project/pybullet-planning/https://pypi.org/project/pybullet-planning/

  • multi-GPU distributed reinforcement learning pipeline basedon SAC [67]. iGibson can be easily parallelized and deployedon large computing clusters, which makes the training highlyefficient.

    Tables II and III include a breakdown analysis of theresults of the experiment for different scenes and phases ofthe training process.

    2) LiDAR-Based Point-to-Point Navigation: As men-tioned in the main paper, the observations for the policyinclude a 1D LiDAR scan with 512 laser rays, the robot’slinear and angular velocities, and the goal location in therobot’s reference frame. Similar to PointGoal navigationin Sec. VI-A.1, we concatenate the robot’s linear velocity,angular velocity and the goal location and denote them bysensor observations. The controller runs at 10Hz. For bothLiDAR scans and sensor observations, we adopt a framestack of 8. The encoder of the policy network consists of3 parts: (a) a 3-layer MLP to encode LiDAR scans; (b) a3-layer MLP to encode sensor observations; (c) a 3-layerfusion MLP to encode the concatenation of the flattened(along temporal dimension) LiDAR scans embedding andsensor observations embedding. The learning rate for SACis 1e-4.

    As real robot platform we use a Locobot, a non-holonomicmobile base, with an additionally mounted Hokuyo 1 beamLiDAR sensor (see Fig. 4, left). The simulated agentsmatches the characteristics of the real robot and sensor. Therobot is controlled via ROS [68]. Processing the imagesrequires more computation than the one provided by theonboard computer of the Locobot; thus, we send the imagesto a desktop computer that hosts the policy and generatecommands that are sent back to the robot. We use thesystem we developed for the CVPR Challenge “Sim2Realwith iGibson” [42].

    3) Imitation Learning: Human Demonstrated Manipula-tion: We collected 50 demonstrations of pick-and-placeoperations demonstrated with our Human-iGibson interfaceto pick a mug and place in the sink (Fig. 4, right) with 20different mug models. We split the demos into train/val/testset with 35, 10, 5 demos and 1095, 304, 157 pairs of state andaction, respectively. The state includes the object position andthe action includes the desired delta translation to add to theobject position in the next step and whether to open gripper.The starting position of the mug is randomized within a50 cm by 10 cm area.

    We trained a policy using behavioral cloning to mapstates to actions. The policy network is composed by threeMLP layers and ReLU activation. We used MSE loss forthe delta translation action and cross entropy loss for thebinary gripper action. We used ADAM optimizer [69] withlearning rate 0.1 and trained the policy for 1000 epochs aftervalidation loss plateaus.

    For evaluation, we deployed the policy on the simulatedFetch robot and achieved 98% success rate for 100 evaluationepisodes. The two failed episodes are caused by the robotprematurely opening its gripper and the mug being droppedoutside the sink. The robot has a time budget of 500 action

    (a) PushDrawer task

    (b) PushCabinet task

    Fig. 6: Execution of (a) PushDrawer task. The agent successfullypush the top drawers in. (b) PushCabinet task. The agent pushand close the cabinet doors.

    steps (25 seconds) to accomplish the task.4) Pretraining in Fully Interactive Scenes: In this section

    we show the details of the network used in pretraining andthe details for policy training in downstream tasks. Thenetwork used in pretraining is a UNet structure. The inputto the UNet is an RGB image of size 128 × 128, and theoutput is a binary mask indicating which area is interactable.The UNet consists of an encoder with ResNet9 architecture,4 (Upsampling, Conv) blocks and finally a readout moduleconsisting of 2 Conv layers predicting the binary mask.

    For the downstream task, we focused on pushing task.The observation space is RGB images of size 128 × 128,and the action space is a point on the image. To give readersan intuitive sense, an example trajectory of the two tasksare shown in Fig. 6. The baseline model for these tasks areDQN with dense Q-value prediction. The policy networkuses a 6-layer convolutional neural network to predict anarray of Q-values, with the same shape as the input image.The agent picks the pixel with the highest associated Q-value, and use a motion planner to plan a push motion.The method that integrates pretraining modifies the baselinemethod, by adding a mask to the predicted Q-values withpredicted interaction mask, only pixels that are predicted tobe interactable keep the Q-values, and the rest are zeroedout. Both setups uses Q-learning algorithm with �-greedy toupdate the network, we use a learning rate of 1 × 10−3,discount factor of 0.99, � = 0.2 for exploration, and thepolicies are trained for 5× 104 steps.

    As shown in the main paper, with pretraining, the agentlearns faster. Fig. 6 depicts different stages of the interactionslearned by an agent using our learned intermediate represen-tation.

    B. Physics Based Rendering

    In this section we include additional information about theshading model and rendering process we use in iGibson.

    Shading Models To improve the realism when renderingimages of objects, we represent the properties of the sur-face material of each object using four layers, 1) metalliclayer (single channel image), 2) roughness layer (singlechannel image), 3) albedo (three channel image), and 4)tangent-space normal mapping (three channel image). The

    9

    http://www.locobot.org/

  • TABLE II: Quantitative results on PointGoal navigation.

    Domain Training Rs int Beechwood 0 int Merom 0 int Pomaria 0 int Overall

    Randomization Steps SPL SR SPL SR SPL SR SPL SR SPL SR

    w/o 400k 0.41 42% 0.21 21% 0.22 22% 0.27 27% 0.28 28%w/ 400k 0.35 36% 0.25 26% 0.32 33% 0.25 25% 0.29 30%w/o 800k 0.50 56% 0.14 17% 0.22 25% 0.23 27% 0.27 31.15%w/ 800k 0.60 67% 0.37 40% 0.29 33% 0.36 39% 0.40 44.75%

    TABLE III: Quantitative results on object navigation.

    Domain Training Rs int Beechwood 0 int Merom 0 int Pomaria 0 int Overall

    Randomization Steps SR SR SR SR SR

    w/o 100k 43% 34% 47% 45% 42.25%w/ 100k 40% 45% 52% 51% 47%w/o 200k 50% 46% 48% 55% 49.75%w/ 200k 55% 54% 64% 57% 57.5%

    information of the layers is combined by our physics-basedrendering model. To generate correct light effects on theobjects’ surfaces, we create environment maps that light thescene. We also pre-integrate the Cook-Torrance specular bidi-rectional scattering distribution function (BRDF) for varyingroughness and viewing directions to accelerate and enablereal time rendering. The results are saved into 2-dimensionallook up table texture, which is used to scale and add bias toFresnel reflectance at normal incidence(F0) during rendering.The F0 is then used to calculate Shlick’s approximation ofthe Fresnel factor as follows:

    F (θ) = F0 + (1− F0)(1− cos θ)5

    To calculate the specular highlights, we pre-filter envi-ronment cube map using GGX normal distribution functionimportance sampling. The results are saved as an image forlater faster retrieval. For diffuse effect we use quasi MonteCarlo sampling with Hammersley sequence to approximatethe integration.

    Light Probe Generation To generate light probes, we useBlender to bake high resolution high dynamic range(HDR)environment textures within iGibson scenes. The lightsources are artistically designed and placed on the ceiling.

    Shadow Mapping For generating shadows we use shadowmapping techniques and simulate orthogonal uniform light in+z direction. All the objects in the scenes are shadow castersexcept for the ceiling. The shadows don’t necessarily matchthe real world indoor lights, but create a realistic enougheffect that help perceiving depth.

    C. iGibson Performance

    In this section we evaluate the speed of the iGibsonsimulation environment, analyzing the time for rendering andphysics simulation, and its combination.

    1) Rendering Performance: We evaluate the performanceand speed of our novel iGibson renderer with differentworking settings. There are many configurable options iniGibson to control the rendering quality. We benchmark twotypical use cases. The first one is Reinforcement Learning(VisualRL), where the goal is to generate simulated sensor

    TABLE IV: Rendering Speed of iGibson Renderer [fps]

    Preset Modality Mean Max Min

    VisualRL

    RGB 409.8 1142.7 270.4Normal 530.9 1142.0 283.7

    Point Cloud 530.3 1129.0 282.1Semantic Mask 529.4 1140.4 281.7Optical Flow 528.1 1129.1 281.6Scene Flow 526.4 1129.9 280.1

    HighFidelity

    RGB 188.4 289.3 114.3Normal 219.6 310.1 148.9

    Point Cloud 240.4 345.1 174.7Semantic Mask 221.5 313.8 163.5Optical Flow 240.1 348.9 170.8Scene Flow 240.5 345.9 173.9

    TABLE V: Simulator Step Speed and Full Step Speed [Hz]

    Robot Mean Max Min

    Simulator Physics Step With robot 175 311 130Scene only 310 797 92

    Simulator Full Step With robot 100 150 68Scene only 136 230 84

    data (low resolution) as fast as possible. The second one iswhen higher quality images, as photorealistic as possible,are required, but maintaining a minimum level of renderingspeed (HighFidelity). This use case is common whentraining perceptual solutions (e.g., segmentation, object de-tection) and also to support collecting human demonstrationswith photorealistic enough images and effects such as shad-ows.

    For the first use case, VisualRL, we render 128 × 128images, with physically based rendering on, multi sampleanti-aliasing off, and shadow mapping off. For the secondcase, HighFidelity, we render 512 × 512 images, withphysically based rendering on, multi sample anti-aliasing on,shadow mapping on. The results of our benchmark of therendering time for different modalities with these settingsare shown in Table IV.

    10

    https://www.blender.org/

  • Fig. 7: Examples of iGibson scenes created based on the Cubi-Casa5K and 3D-Front annotations. We provide over 8000 additionalfully interactive iGibson scenes based the two datasets.

    2) Physics Simulator and Full Simulator Performance:The size and number of objects included in our scenesare beyond what is typically included in other simulatoror other projects based on PyBullet. We benchmarked theperformance of the physics simulator and the final perfor-mance of iGibson, combining physics and rendering. Inour experiments, the physics simulation timestep is set to1/120 s. Each simulator step includes four steps of thephysics simulator and one rendering pass, corresponding torendering at 30 fps in simulation. The results of our analysisare shown in Table. V. We achieve an average of 100Hzwith robot for simulator full step, and 136Hz without robot,which correspond to 3.33× and 4.53× real-time respectively.

    D. Integration of Additional Datasets of Scenes

    1) Integration of CubiCasa5K: CubiCasa5K [23] is adataset of five thousand annotated floor plans of real worldhomes in Finland. The annotated floor plans are semi-automatically generated and include the structural elements(walls, doors, windows) and position and size of fixedfurniture items (closets, toilets, benches, embedded cabinets,counters, sinks, . . . ). We convert this annotations into iGib-son 3D fully interactive scenes with a two step procedure:1) generate a building based on the annotation of structuralelements, and 2) populate the building with object modelsfrom our dataset based on the description of poses and sizesfrom CubiCasa5K. Some of the floor plans in CubiCasa5Kincluded two separate floors; since we do not include outdoornavigation in iGibson, we split them into two separate indoorscenes. In total, we offer 6297 scene in iGibson basedon the real-world layouts of CubiCasa5K. Fig. 7 depictssome examples of the scenes included created based on theCubiCasa5K dataset.

    2) Integration of 3D-Front: 3D-Front [24] is a largedataset of layouts with room models populated with furni-ture. The layouts have been created by artists and interiordesigners. It includes 18,797 rooms, around 10,000 houses,and 7,302 furniture models. We convert 3D-Front staticscenes into iGibson 3D fully interactive scenes with a twostep procedure: 1) we keep the original building structuralmeshes as visual meshes while procedurally generating col-lision meshes for structural elements that approximate theirgeometry, and 2) we populate the building with object models

    from our dataset based on the description of poses and sizesfrom 3D-Front.

    There are four challenges we faced when integrating 3D-Front to convert their scenes into interactive scenes foriGibson. First, some 3D-Front scenes contain objects ofundefined categories, corresponding to one-of-a-kind pieces.We skip these scenes, since we cannot generate appropriatelow-poly collision meshes for these objects. Second, some ofthe 3D data within 3D-Front is corrupted or contain shapeerrors (see reported issue here). Third, the kitchen cabinetsin 3D-Front are not annotated as objects, but instead theentire kitchen furniture is a single object with each panelof the furniture (front and lateral panels, internal shelves)annotated as a separate part. This impedes us to generateinteractive versions of the kitchen cabinets. We includetwo alternative versions of the scenes: a) a version withnon-interactive kitchen cabinets, and b) a version withoutany kitchen cabinets. We expect this problem to be solvedin future annotations of 3D-Front. Fourth, while 3D-Frontdataset includes a layout description of rooms and elements,including their position and size, the furniture pieces aresometimes defined as overlapping significantly with eachother (see reported issue here). This has a severe effect in ourphysics simulation as it tries to solve the penetrating contact.To alleviate this issue, we remove objects that overlap morethan 80% of volume with others. For objects that overlap lessthan 80%, we reduce their size (with a threshold of reducingno more than 20%) until the overlapping is resolve. Afterall the aforementioned filtering and fixing processes, weobtained 2239 scenes with correct object placements withoutoverlapping, ready to be used in interactive tasks.

    3) Comparing iGibson, CubiCasa5K and 3D-FrontScenes: There are two significant differences between the15 fully interactive iGibson scenes, and the ones obtainedfrom the integration of Cubicasa5K and 3D-Front. The firstone is in density of objects. Our iGibson curated scenescontain a density of 75 objects per room. This is morerealistic and significantly higher density that the one in 3D-Front rooms (37) and CubiCasa5K rooms (32). A seconddifference between the scenes is in the albedo and materialeffects of walls and floors. In the 15 iGibson scenes provided,we bake the effects of the ambient lighting and the lightsources on walls, floors and ceilings into a RGB texture map.This provides a more realistic lighting effect. Differently,the structural elements of CubiCasa5K and 3D-Front donot present this enhanced diffuse color channel. To bakelighting sources in CubiCasa5K and 3D-Front scenes, weneed lighting information that are not contained as part ofthe datasets. It would require a manual annotation of lightlocation and strength in thousands of scenes, an effort thatis beyond our attempt to make these datasets available tothe community. Despite these differences, we believe gettingaccess to these two large datasets of scenes significantlycomplement the lower number (15) but higher quality scenesprovided with iGibson.

    11

    https://github.com/3D-FRONT-FUTURE/3D-FRONT-ToolBox/issues/2#issuecomment-682678930https://github.com/3D-FRONT-FUTURE/3D-FRONT-ToolBox/issues/4

    I IntroductionII Related WorkIII iGibson Simulation EnvironmentIII-A Simulation Characteristics and APIIII-B Fully Interactive AssetsIII-C Virtual SensorsIII-D Domain RandomizationIII-E Motion PlanningIII-F Human-iGibson Interface

    IV ExperimentsIV-A Domain Randomization for Visual NavigationIV-B LiDAR-Based Point-to-Point NavigationIV-C Imitation Learning: Human Demonstrated ManipulationIV-D Pretraining in Fully Interactive Scenes

    V ConclusionVI AcknowledgementVI-A Details on the Experimental EvaluationVI-A.1 Domain Randomization for Visual NavigationVI-A.2 LiDAR-Based Point-to-Point NavigationVI-A.3 Imitation Learning: Human Demonstrated ManipulationVI-A.4 Pretraining in Fully Interactive Scenes

    VI-B Physics Based RenderingVI-C iGibson PerformanceVI-C.1 Rendering PerformanceVI-C.2 Physics Simulator and Full Simulator Performance

    VI-D Integration of Additional Datasets of ScenesVI-D.1 Integration of CubiCasa5KVI-D.2 Integration of 3D-FrontVI-D.3 Comparing iGibson, CubiCasa5K and 3D-Front Scenes