Pranav Agarwal

I am a Ph.D. candidate at Mila, advised by Sheldon Andrews and Samira Ebrahimi Kahou, working on reinforcement Learning for robotic applications and character animation. My research interests are in modeling efficient reinforcement learning algorithms utilizing large generative models. Towards this goal, I have worked on generative models (Transformers) for sample-efficient reinforcement learning and automating reward modeling (from large offline trajectories) for complex robotic applications like excavator automation.

Previously, I was a student researcher at Inria where I collaborated with Natalia Díaz-Rodríguez and Raoul de CHARETTE. I completed my Bachelors in Electronics and Communication Engineering at IIIT Guwahati, where I was awarded the President's Gold Medal. During my bachelor's I worked as a research intern at SUTD with Professor Gemma Roig.

I'm available for internships and open to collaborations. Feel free to reach out!

[ Email  /  CV  /  Github  /  Twitter  /  Google Scholar  /  Linkedin ]

profile photo

News

Research

elign Transformers in Reinforcement Learning: A Survey
Pranav Agarwal, Aamer Abdul Rahman, Pierre-Luc St-Charles, Simon J.D. Prince, Samira Ebrahimi Kahou
In submission (2023).
[ Paper ]

Transformers have significantly impacted domains like natural language processing, computer vision, and robotics, improving performance compared to other neural networks. This survey explores their use in reinforcement learning (RL), where they address challenges such as unstable training, credit assignment, interpretability, and partial observability. It provides an overview of RL, discusses challenges faced by classical RL algorithms, and examines how transformers are well-suited to tackle these challenges. The survey covers the application of transformers in representation learning, transition and reward function modeling, and policy optimization within RL. It also discusses efforts to enhance interpretability and efficiency through visualization techniques and tailored adaptations for specific applications. Limitations and potential for future breakthroughs are assessed as well.

elign Learning to Play Atari in a World of Tokens
Pranav Agarwal, Sheldon Andrews, Samira Ebrahimi Kahou
International Conference on Machine Learning (ICML), 2024.
[ Paper / Code / Webpage / Slides ]

Model-based reinforcement learning agents utilizing transformers have shown improved sample efficiency due to their ability to model extended context, resulting in more accurate world models. However, for complex reasoning and planning tasks, these methods primarily rely on continuous representations. This complicates modeling of discrete properties of the real world such as disjoint object classes between which interpolation is not plausible. In this work, we introduce discrete abstract representations for transformer-based learning (DART), a sample-efficient method utilizing discrete representations for modeling both the world and learning behavior. We incorporate a transformer-decoder for auto-regressive world modeling and a transformer-encoder for learning behavior by attending to task-relevant cues in the discrete representation of the world model. For handling partial observability, we aggregate information from past time steps as memory tokens. DART outperforms previous state-of-the-art methods that do not use look-ahead search on the Atari 100k sample efficiency benchmark with a median human-normalized score of 0.790 and beats humans in 9 out of 26 games.

elign Empowering Clinicians with MeDT: A Framework for Sepsis Treatment
Aamer Abdul Rahman, Pranav Agarwal, Vincent Michalski, Rita Noumeir, Philippe Jouvet, Samira Ebrahimi Kahou
NeurIPS 2023 Goal-Conditioned Reinforcement Learning Workshop (Spotlight).
[ Paper / Code / Webpage / Slides ]

Offline reinforcement learning is promising for safety-critical tasks like clinical decision support, but faces challenges of interpretability and clinician interactivity. To overcome these, the proposed Medical Decision Transformer (MeDT) utilizes a goal-conditioned RL paradigm for sepsis treatment recommendations. MeDT employs the decision transformer architecture, considering factors like treatment outcomes, patient acuity scores, dosages, and current/past medical states to provide a holistic view of the patient's history. This enhances decision-making by allowing MeDT to generate actions based on user-specified goals, ensuring clinician interactability and addressing sparse rewards. Results from the MIMIC-III dataset demonstrate MeDT's effectiveness in producing interventions that either outperform or compete with existing methods, offering a more interpretable, personalized, and clinician-directed approach.

elign TPTO: A Transformer-PPO based Task Offloading Solution for Edge Computing Environments
Niloofar Gholipour, Marcos Dias de Assuncao, Pranav Agarwal, Julien Gascon-Samson, Rajkumar Buyya,
IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS).
[ Paper / Code / Webpage / Slides ]

Emerging applications in healthcare, autonomous vehicles, and wearable assistance require interactive and low-latency data analysis services. Unfortunately, cloud-centric architectures cannot fulfill the low-latency demands of these applications, as user devices are often distant from cloud data centers. Edge computing aims to reduce the latency by enabling processing tasks to be offloaded to resources located at the network’s edge. However, determining which tasks must be offloaded to edge servers to reduce the latency of application requests is not trivial, especially if the tasks present dependencies. This paper proposes a Deep Reinforcement Learning (DRL) approach called TPTO, which leverages Transformer Networks and Proximal Policy Optimization (PPO) to offload dependent tasks of IoT applications in edge computing. We consider users with various preferences, where devices can offload computation to an edge server via wireless channels. Performance evaluation results demonstrate that under fat application graphs, TPTO is more effective than state-of-the-art methods, such as Greedy, HEFT, and MRLCO, by reducing latency by 30.24%, 29.61%, and 12.41%, respectively. In addition, TPTO presents a training time approximately 2.5 times faster than an existing DRL approach.

hpp Automatic Evaluation of Excavator Operators using Learned Reward Functions
Pranav Agarwal, Marek Teichmann, Sheldon Andrews, Samira Ebrahimi Kahou
NeurIPS 2022 Reinforcement Learning for Real Life Workshop.
[ Paper / Code / Video / Slides ]

Training novice users to operate an excavator for learning different skills requires the presence of expert teachers. Considering the complexity of the problem, it is comparatively expensive to find skilled experts as the process is timeconsuming and requires precise focus. Moreover, since humans tend to be biased, the evaluation process is noisy and will lead to high variance in the final score of different operators with similar skills. In this work, we address these issues and propose a novel strategy for the automatic evaluation of excavator operators. We take into account the internal dynamics of the excavator and the safety criterion at every time step to evaluate the performance.

kts Goal-constrained Sparse Reinforcement Learning for End-to-End Driving
Pranav Agarwal, Pierre de Beaucorps, Raoul de Charette
In submission (2021).
[ Paper / Code / Video ]

Deep reinforcement Learning for end-to-end driving is limited by the need of complex reward engineering. Sparse rewards can circumvent this challenge but suffers from long training time and leads to sub-optimal policy. In this work, we explore full-control driving with only goal-constrained sparse reward and propose a curriculum learning approach for end-toend driving using only navigation view maps that benefit from small virtual-to-real domain gap. To address the complexity of multiple driving policies, we learn concurrent individual policies selected at inference by a navigation system. We demonstrate the ability of our proposal to generalize on unseen road layout, and to drive significantly longer than in the training.

elign Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models
Pranav Agarwal, Alejandro Betancourt, Vana Panagiotou, Natalia Diaz-Rodriguez
Machine Learning in Real Life (ML-IRL) ICLR 2020 Workshop.
[ Paper / Code / Video / Slides ]

In this paper, we attempt to show the biased nature of the currently existing image captioning models and present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions. We further exploit the state of the art pre-trained image captioning and object recognition networks to annotate our images and show the limitations of existing works. Furthermore, in order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF). Existing image captioning metrics can evaluate a caption only in the presence of their corresponding annotations; however, SF allows evaluating captions generated for images without annotations, making it highly useful for real life generated captions.

elign Learning to synthesize faces using voice clips for Cross-Modal biometric matching
Pranav Agarwal, Soumyajit Poddar, Anakhi Hazarika, Hafizur Rahaman
2019 IEEE Region 10 Symposium (TENSYMP).
[ Paper / Code ]

In this paper, a framework for cross-modal biometric matching is presented, where faces of an individual are generated using his/her voice clips and further the synthesized faces are tested using a face classification network. We explore the advancements of Convolutional Neural Network (CNN) for feature extraction and generative networks for image synthesis. In the experiment, we compare the performance of Variational Autoencoders(VAE), Conditional Generative Adversarial Networks(C-GAN) and Regularized Conditional Generative Adversarial Networks(RC-GAN) and show that RC-GAN that is C-GAN with a regularization factor added to its loss is able to generate faces corresponding to the true identity of the voice clips with the best accuracy of 84.52% while VAE generates a less noise prone image with the highest PSNR of 28.276 decibels but with an accuracy of 72.61%.

Research Student
CM-Labs
Jan 2022
PhD Student
Mila, Québec
Jan 2022
Research Assistant
Inria, Paris
May 2019 - April 2021
B.Tech ECE
IIIT Guwahati
Aug 2015 - May 2019
Research Intern
Singapore University of Technology and Design
May 2018 - Aug 2018
Research Intern
Indian Institute of Science, Bangalore
May 2017 - Aug 2017
able>