Recent News

Publications

Go to full list

Work Experience

  • Jul 2019 Jul 2018

    Google Research

    AI Resident

  • Apr 2018 Jan 2018

    Microsoft

    Software Engineering Intern

  • Aug 2017 May 2017

    Google

    Software Engineering Intern

  • Mar 2017 Jan 2017

    Microsoft

    Software Engineering Intern

  • Jan 2015 Dec 2014

    C.E.S.A.R - Recife Center for Advanced Studies and Systems

    Software Engineering Intern

Education

  • Ph.D. Present

    Ph.D. student in Computer Science & Engineering

    Paul G. Allen School of Computer Science & Engineering

    University of Washington

  • M.Sc. 2018

    M.Sc. in Electronic and Computer Engineering

    Instituto Tecnológico de Aeronáutica - ITA

  • B.S.2017

    B.S. in Computer Engineering

    Instituto Tecnológico de Aeronáutica - ITA

Honors, Awards and Fellowships

  • 2019
    Honorable Mention for Best Paper Award
    For Research Inspired by Human Language Learning and Processing at CoNLL 2019
  • 2018
    1st place at Hack for Good
    Project Intercept: fighting human trafficking with natural language understanding
  • 2017
    Best Graduation Thesis
    Instituto Tecnológico de Aeronáutica - ITA, Computer Engineering
  • 2017-2018
    Fast.ai International Fellowship
  • 2017
    1st place at Deep Learning Hackathon - Cotidiano
    AI powered nutrition tracking from photographs.
  • 2016
    1st Place at Microsoft College Code Competition
    MSFT3C São José dos Campos, Brazil
  • 2015
    Honorable mention - top 15
    ACM ICPC South America / Brazil finals
  • 2013
    Honorable mention - top 4
    ACM ICPC Brazilian First Phase
  • 2012
    Gold Medal
    International Mathematical Kangaroo - Kangourou sans frontières
  • 2012
    Gold Medal
    Brazilian Astronomy Olympiad - OBA
  • 2012
    Gold Medal
    Regional Mathematical Olympiad (Rio de Janeiro State) - OMERJ
  • 2012
    Silver Medal
    Brazilian Chemistry Olympiad - OBQ
  • 2012
    Silver Medal
    Regional Chemistry Olympiad (Rio de Janeiro State) - OQRJ
  • 2012
    Bronze Medal
    Brazilian Mathematical Olympiad - OBM
  • 2012
    Bronze Medal
    Brazilian Physics Olympiad - OBF
  • 2011
    Gold Medal
    Brazilian Astronomy Olympiad - OBA
  • 2011
    Gold Medal
    Brazilian Physics Olympiad - OBF
  • 2011
    Gold Medal
    Regional Chemistry Olympiad (Rio de Janeiro state) - OQRJ
  • 2011
    Gold Medal
    Great Mathematical Olympiad (regional) - GOM
  • 2011
    Silver Medal
    Brazilian Mathematical Olympiad - OBM
  • 2011
    Silver Medal
    Regional Mathematical Olympiad (Rio de Janeiro state) - OMERJ
  • 2010
    Silver Medal
    Brazilian Mathematical Olympiad for Public Schools - OBMEP
  • 2010
    Bronze Medal
    Brazilian Astronomy Olympiad - OBA
  • 2009
    Gold Medal
    Brazilian Mathematical Olympiad - OBM
  • 2009
    Gold Medal
    Regional Mathematical Olympiad (Minas Gerais State) - OMM
  • 2008
    Gold Medal
    Brazilian Mathematical Olympiad for Public Schools - OBMEP
  • 2007
    Gold Medal
    Regional Mathematical Olympiad (Minas Gerais) - OMM
  • 2007
    Gold Medal
    Brazilian Mathematical Olympiad for Public Schools - OBMEP
  • 2006
    Honorable Mention
    Brazilian Mathematical Olympiad - OBM

Probing Text Models for Common Ground with Visual Representations

Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi,

Abstract

Vision, as a central component of human perception, plays a fundamental role in shaping natural language. To better understand how text models are connected to our visual perceptions, we propose a method for examining the similarities between neural representations extracted from words in text and objects in images. Our approach uses a lightweight probing model that learns to map language representations of concrete words to the visual domain. We find that representations from models trained on purely textual data, such as BERT, can be nontrivially mapped to those of a vision model. Such mappings generalize to object categories that were never seen by the probe during training, unlike mappings learned from permuted or random representations. Moreover, we find that the context surrounding objects in sentences greatly impacts performance. Finally, we show that humans significantly outperform all examined models, suggesting considerable room for improvement in representation learning and grounding.

Evaluating NLP Models via Contrast Sets

Matt Gardner et al.

Abstract

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, Noah Smith.

Abstract

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-ofsample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many finetuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

Large-scale representation learning from visually grounded untranscribed speech

Gabriel Ilharco, Yuan Zhang, Jason Baldridge.
23rd Conference on Computational Natural Language Learning (CoNLL 2019)
Honorable Mention for Best Paper Award in Research Inspired by Human Language Learning and Processing

Abstract

Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results---improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results.

General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping

Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge.
NeurIPS Visually Grounded Interaction and Language Workshop, 2019

Abstract

In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate through an environment. Datasets for studying this task typically contain pairs of these instructions and reference trajectories. Yet, most evaluation metrics used thus far fail to properly account for the latter, relying instead on insufficient similarity comparisons. We address fundamental flaws in previously used metrics and show how Dynamic Time Warping (DTW), a long known method of measuring similarity between two time series, can be used for evaluation of navigation agents. For such, we define the normalized Dynamic Time Warping (nDTW) metric, that softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful paths. We collect human similarity judgments for simulated paths and find nDTW correlates better with human rankings than all other metrics. We also demonstrate that using nDTW as a reward signal for Reinforcement Learning navigation agents improves their performance on both the Room-to-Room (R2R) and Room-for-Room (R4R) datasets. The R4R results in particular highlight the superiority of SDTW over previous success-constrained metrics.

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

Vihan Jain*, Gabriel Magalhaes*, Alexander Ku*, Ashish Vaswani, Eugene Ie, and Jason Baldridge.
57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)

Abstract

Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation(VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al.,2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

Transferable Representation Learning in Vision-and-Language Navigation

Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, Eugene Ie
IEEE International Conference in Computer Vision (ICCV 2019)

Abstract

Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) require machine agents to interpret natural language instructions and learn to act in visually realistic environments to achieve navigation goals. The overall task requires competence in several perception problems: successful agents combine spatio-temporal, vision and language understanding to produce appropriate action sequences. Our approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN. Specifically, the representations are adapted to solve both a cross-modal sequence alignment and sequence coherence task. In the sequence alignment task, the model determines whether an instruction corresponds to a sequence of visual frames. In the sequence coherence task, the model determines whether the perceptual sequences are predictive sequentially in the instruction-conditioned latent space. By transferring the domain-adapted representations, we improve competitive agents in R2R as measured by the success rate weighted by path length (SPL) metric.

Gesture Recognition for Brazilian Sign Language using Deep Learning

Gabriel Ilharco Magalhaes
M.Sc. Thesis, Instituto Tecnológico de Aeronáutica - ITA, May 2018 (Portuguese).

Abstract

In a world where more than 70 million people rely on sign language to communicate, a system capable of recognizing and translating gestures to written or spoken language has great social impact. Despite rights claimed in recent decades, the deaf community still faces many challenges due to communication barriers. Gesture recognition, crucial for translation, is an active research topic in the computer vision and machine learning communities, and has been studied for decades. Among the most common approaches for this task, there are the electronic gloves with sensors, depth camera based approaches and simple camera based approaches. This last method has the advantage of completeness, since there are in many sign languages, including Brazilian Sign Language, which is the subject of this study, where other parts of the body such as the face and its expressions are needed to recognize some gestures. Additionally, it relies only on commonly found technologies, in contrast to the other approaches.

We present a state-of-the art approach, using a simple color camera for real-time static and continuous recognition of gestures from Brazilian Sign Language. For static recognition, we create a dataset with 33000 examples of 30 gestures; for continuous recognition, we create a dataset with 2000 videos containing phrases with 72 distinct gestures. Both datasets were built without restrictions with respect to clothing, background, lighting or distance between the camera and the user, commonly found in other studies. We propose end-to-end systems for each case using Deep Learning: for the former, a system based on a Deep Convolutional Residual Neural Network; for the latter, a hybrid architecture using Long-Short Term Memory Cells on top of convolutional layers. Our method shows state of the art accuraccy for both cases and is capable of running in real time on a GPU.

Gesture Recognition for Brazilian Sign Language

Gabriel Ilharco Magalhaes
B.S. Thesis Instituto Tecnológico de Aeronáutica - ITA, Dec 2017 (Portuguese).

Abstract

In a world where more than 70 million people rely on sign language to communicate, a system capable of recognizing and translating gestures to written or spoken language has great social impact. Despite rights claimed in recent decades, the deaf community still faces many challenges due to communication barriers. Gesture recognition, crucial for translation, is an active research topic in the computer vision and machine learning communities, and has been studied for decades. Among the most common approaches for this task, there are the electronic gloves with sensors, depth camera based approaches and simple camera based approaches. This last method has the advantage of completeness, since there are in many sign languages, including Brazilian Sign Language, which is the subject of this study, where other parts of the body such as the face and its expressions are needed to recognize some gestures. Additionally, it relies only on commonly found technologies, in contrast to the other approaches.

We present a state-of-the art approach, using a simple color camera for real-time static recognition of gestures from Brazilian Sign Language. We create a dataset with 33000 examples of 30 gestures, without restrictions with respect to clothing, background, lighting or distance between the camera and the user, commonly found in other studies. We propose an end-to-end system using Deep Convolutional Residual Neural Networks, without the need to rely on laboriously engineered pipelines and feature extraction steps, as opposed to traditional approaches. The proposed system shows robustness for the classification of the 30 gestures, obtaining a 99.83% accuracy on the test set. The system runs in real time (26.2 ms per frame), using a NVIDIA K80 GPU.

* indicates equal contribution.

Research Summary

Humans learn from rich streams of perceptual data, and most of our knowledge reflects that. The way we process, store and communicate information is not just limited to a finite set of symbols that compose a language: colors, shapes, sounds, textures and physical metaphors are deeply ingrained in how we understand and talk about the world. Without grounding, language understanding models that learn from only textual data are incapable of fully sharing our semantic interpretation of symbols and their compositions.

Ultimately, it is desirable for AI systems not only to be able to understand and process multimodal - including visual, acoustic and symbolic - data, but to do so in a unified, coherent framework. Grounding natural language through multimodal tasks is an appealing research direction, which, thanks to recent advances in neural representations and learning, is taking the first steps in its renewed energetic infancy.

  • Artificial Intelligence
  • Multimodal Machine Learning
  • Natural Language Grounding
  • Computer Vision
  • Natural Language Processing
  • Self-supervised Deep Learning
  • image

    Real-time Gesture Recognition for Brazilian Sign Language

    Robust recognition using Deep Learning

    In a world where more than 70 million people rely on sign language to communicate, a system capable of recognizing and translating gestures to written or spoken language has great social impact. Despite rights claimed in recent decades, the deaf community still faces many challenges due to communication barriers. Gesture recognition, crucial for translation, is an active research topic in the computer vision and machine learning communities, and has been studied for decades. Among the most common approaches for this task, there are the electronic gloves with sensors, depth camera based approaches and simple camera based approaches. This last method has the advantage of completeness, since there are in many sign languages, including Brazilian Sign Language, which is the subject of this study, where other parts of the body such as the face and its expressions are needed to recognize some gestures. Additionally, it relies only on commonly found technologies, in contrast to the other approaches.

    This project presents a state-of-the art approach, using a simple color camera for real-time static and continuous recognition of gestures from Brazilian Sign Language. For static recognition, we created a dataset with 33000 examples of 30 gestures; for continuous recognition, we create a dataset with 2000 videos containing phrases with 72 distinct gestures. Both datasets were built without restrictions with respect to clothing, background, lighting or distance between the camera and the user, commonly found in other studies. We propose end-to-end systems for each case using Deep Learning: for the former, a system based on a Deep Convolutional Residual Neural Network; for the latter, a hybrid architecture using Long-Short Term Memory Cells on top of convolutional layers. Our method shows state of the art accuraccy for both cases and is capable of running in real time on a GPU.

  • image

    Deep Steganography

    Hiding Images in Plain sight

    Open-source implementation of the paper Hiding Images in Plain Sight: Deep Steganography, by Shumeet Baluja (Google), at NIPS 2017. This project is part of the Global NIPS Paper Implementation Challenge. The implementation is available here.

    Abstract: Steganography is the practice of concealing a secret message within another, ordinary, message. Commonly, steganography is used to unobtrusively hide a small message within the noisy regions of a larger image. In this study, we attempt to place a full size color image within another image of the same size. Deep neural networks are simultaneously trained to create the hiding and revealing processes and are designed to specifically work as a pair. The system is trained on images drawn randomly from the ImageNet database, and works well on natural images from a wide variety of sources. Beyond demonstrating the successful application of deep learning to hiding images, we carefully examine how the result is achieved and explore extensions. Unlike many popular steganographic methods that encode the secret message within the least significant bits of the carrier image, our approach compresses and distributes the secret image's representation across all of the available bits.

  • image

    Snap & Eat

    Nutrition tracking from food photos

    Winner of Deep Learning Hackathon by Cotidiano.

    The implementation is available here.

    According to the World Health Organization, worldwide obesity has nearly tripled since 1975. In the United States, almost 75% of the population is overweight and more than half of the population is obese (OECD). Today, many diseases that were preivously thought as hereditary are now shown to be seen conected to biological disfunction related to nutrition.

    Although being healty and eating better is something the vast majority of the population want, doing so usually requires great effort and organization. The lack of an easy and simple way to track nutrition information about the food you eat can easily lead to low engagement. By providing a very easy and fun way to keep track of what the user eat, we can largely improve engagement, and directly atack on of the largest health problems in the world.

    Snap & Eat is a web application that tracks the user's food intake by pictures. We use state-of-the-art deep learning techniques to recognize dishes, making instant nutrition estimates from the user's meals.

    The app also suggests meals based on the user's income, and is capable of showing places nearby that serve those dishes.

    The system is implemented in Pytorch using fastai lib, relying on Jupyter Notebooks for prototyping purposes. For the web app, we use Flask and Node.js.

    We use an Aggregated Residual Convolutional Neural Network - ResNeXt-101 with 101 layers, pretrained on ImageNet dataset. We finetune the model on Food-101 dataset, with more than 100 thousand images of 101 types of dishes. We achieve a significant improvement on accuracy (71% in our work compared to 50.1% in Bossard et al., 2014).

  • image

    Quiros

    Sign Language recognition from a sensor glove

    Project Quiros is a low-cost system created to improve communication for sign language users. It is designed to recognize specific sign language hand gestures and transform them into words (audio) and text. The project was developed with a total of R$600 (brazilian reais) - about U$150 (US dolars). The system uses flex sensors, contact sensors, an accelerometer and a gyro sensor to capture the position of the hand and sent it via bluetooth to an external device (such as a PC or a mobile device). After recognizing a gesture, the system displays the corresponding message in text and audio.

    The project was presented to the president of Brazil, Dilma Rousseff at Indústria do Futuro (Industry of the Future) technology fair in Belo Horizonte, MG, Brazil, 2014

  • image

    Key Face

    Detecting facial keypoints using images from camera

    Real-time landmark detection using Viola-Jones algorithm for face detection and a convolutional neural network.

    Project is publicaly available here.

  • image

    Breast Bot Sensor

    A breast cancer diagnosis system

    Created by LIKA and C.E.S.A.R, this is a project that proposes the development of a biosensor for early-stage detection of breast cancer in a non-invasive test with the help of synthetic biology and robotics, based on blood samples. The project won a Silver Medal in 2014 iGEM worldwide synthetic biology competition. One of its main component is an electrochemical DNA biosensor capable of recognizing a biomarker for breast cancer, which has a high expression in the early stage of this type of cancer.

    During my internship at C.E.S.A.R., I worked in software and hardware for reading and processing data from the biosensor, using a LMP91000EVM and a SPIO-4 Digital Controller. The data gathered was then processed using a neural network from FANN (Fast Artificial Neural Network) library, generating an output corresponding to the diagnosis. I also worked on a graphic interface that allows the user to see the real-time data that is being read.

  • image

    ITAbits

    Instituto Tecnológico de Aeronáutica's software development group

    A big part of ITAbits is game developing, which we believe is an inspiring way to teach new programmers how to code. The organization also prepares students for Hackathons and programming contests, such as Brazilian OBI and ACM International Collegiate Programming Contest, and provides introductory courses to programming.

    From March 2014 to August 2015 I was the president director of ITAbits, responsible for leading and managing the organization and its projects, training programs, workshops, hackathons, coding dojos and general events.

    From 2013 to 2015, I also worked on internal projects, primarily written in C++ and C# and taught introduction to programming, algorithms and data structures to freshman students.

  • image

    Robot Soccer

    Very small size division

    A team of autonomous robots (about 7,5×7,5×7,5cm) for playing soccer in the IEEE Very Small Size category. The robots are remotely controlled by a computer, which processes the image of a video camera placed above the field and commands them.

    Worked on the strategy and motion planing branches of the project.

  • image

    RayTracer

    Ray tracing from scratch in C++

    RayTracer is an implementation of an image generator based on the Ray Tracing technique, in a simplified model of the world, containing only spheres, planes and triangles.

    The code is publicaly available here.

Contact Info