Ujjwal Singh Rao (@brightertiger) | Data Science Leader & ML Engineer

About Me

Greetings! I'm Ujjwal (@brightertiger), and I extend a warm welcome to my website.

As a data science leader with 12 years of professional experience, I've delved into projects encompassing big data analytics, predictive modeling, machine learning, deep learning, and natural language processing. I completed my education at the Indian Institute of Technology (IIT) Kharagpur in 2013. During my leisure time, I enjoy participating in Kaggle competitions as @brightertiger, contributing to open source projects on GitHub, and sharing knowledge with the machine learning community.

If you come across any collaboration opportunities, don't hesitate to get in touch!

Professional Experience

MSCI Vice President

2024 - Present

I am a member of the Data Extraction team, leading initiatives in AI-powered document intelligence and workflow automation. My work involves:

Building LLM agents for answering complex questions across millions of financial documents using agentic RAG architectures and vector databases. These systems enable accurate information retrieval and synthesis at scale.
Designing and deploying LLM agents for automating end-to-end workflows, reducing manual intervention and accelerating data extraction processes across the organization.
Developing Retrieval Augmented Generation (RAG) pipelines using Large Language Models (LLMs) for fetching data and information from financial documents with high precision and reliability.

HERE Maps Lead Data Scientist

2023 - 2024

I was a member of the Map Observables team, tasked with constructing Self-Driving Maps for BMW's Urban Cruise Control. My work involved:

Tackling global-scale challenges by harnessing petabytes of data for creating high-definition maps in the field of autonomous driving. I have successfully enhanced crucial performance indicators such as False Positives, False Negatives, and Accuracy by more than 50% when compared to traditional legacy systems.
Applying machine learning algorithms and XGBoost models to integrate data observations from diverse input sources, including dashcams and overhead imagery. This process allows to deduce the accurate location and attributes of road signs.
Crafting innovative graph-based solutions to counteract positional observation drift from drive-based data sources used in map content. This implementation resulted in a notable reduction of False Positives by around 5%, surpassing the performance of radial search-based clustering.
Constructing a question-answering engine using LLAMA over extensive product and data requirement documents for data validations. This tool empowers users to efficiently search through these documents, extracting details and significantly enhancing productivity.

Gojek Tech Senior Data Scientist

2021 - 2023

I was a member of the Care Tech team, where I leveraged machine learning, deep learning, and natural language processing techniques to extract insights and facilitate automation. This involved analyzing customer service interactions across diverse channels such as email, in-app requests, chat, Twitter, and more. My work involved:

Facilitating AI/ML-driven intent detection through the implementation of multilingual NLP models. I developed intent classification models based on XLM-RoBERTa to support various languages, including Bahasa and English, achieving an accuracy rate exceeding 80%. Additionally, I deployed these models into production using torchscript and MLFlow.
Constructing named entity recognition (NER) models based on IndoBERT, utilizing open-source IndoNLU datasets. These models were designed to identify entities such as food, quantity, date, and chit-chat within text utterances.
Enhancing the search experience for help center articles by incorporating tags to encompass semantic diversity in search queries. I implemented a TF-IDF and Logistic Regression pipeline to extract pertinent keywords for each article, contributing to an improved search functionality.
Establishing a pipeline for issue discovery to identify emerging themes in service tickets and app reviews. Utilizing PyLDAVis and BERTopic libraries, I implemented topic modeling. Additionally, I trained sentence transformer models using SetFit for better results.

American Express 2014 - 2021

7+ years of progressive experience

Senior Data Scientist 2018 - 2021

I was part of the data science team working on Natural Language Understanding (NLU) layer of the AskAmex chatbot. My work involved:

Training transformer-based models (like BERT, distilBERT, RoBERTa etc.) for intent classification. I removed label noise from training datasets using various robust machine learning techniques which lead to 5% increase in prediction accuracy.
Building human-in-the-loop (HITL) pipelines for collecting labeled data at a minimal cost. I used weak supervision and active learning strategies to filter relevant data points for annotation. I built various interactive tools to help data labelers work efficiently. I introduced best practices and quality checks in the annotation pipelines to ensure high-quality output.
Collaborating with product teams to improve customer experience. I built interactive tools to visualize the performance of servicing journeys. These tools helped identify the edge cases that often lead to automation failures. I introduced tracking around sentiment level KPIs (apart from automation) to holistically capture the channel performance.

Data Scientist 2017 - 2018

I was part of the data science team working on an offer recommendation engine for the mobile app and website. My work involved:

Building factorization machine models to predict click-through rate. I built spark-based feature engineering pipelines to process terabytes of clickstream data for training these models. The models were part of the final stacked ensemble that got deployed in production.
Optimizing impression caps on offers to drive higher overall engagement on the channel. I built xgboost models to analyse the sensitivity of click-through rate with respect to impressions. I used the partial dependency plots from these models to identify the impression cap that maximised f-beta score.

Senior Data Analyst 2015 - 2017

I was part of the modeling team working on up-sell, cross-sell targeting via email campaigns. My work involved:

Building artificial neural network-based models. These were binary classification models which predicted the probability of an existing customer taking up a more premium product. These models replaced the legacy logistic regression models by delivering better performance while simultaneously driving operational efficiency.
Migrating the legacy data transformation and feature engineering pipelines from sas to python to support the deployment of above mentioned neural network models in production. Enabled automated re-training pipelines to solve for data drift.

Data Analyst 2014 - 2015

I joined the customer marketing team focusing on international markets (non-US). I worked on:

Targeting strategy for dynamic email campaigns in partnership with movable ink. The focus was to increase customer spending on small merchants in the UK. I analyzed transaction data to understand the location and category preferences of the customers. The analysis generated content-based recommendations displayed to the customer via dynamic emails. The open and click rates for these campaigns were significantly higher than the long term average.
Supporting a joint venture with Gurunavi. Amex partnered with Gurunavi to offer dining recommendations to customers in Japan. I designed customer segments by clustering spending patterns across various industry verticals. The customer segments mapped to different personas, each of which received an exclusive set of restaurant recommendations.

Education

Georgia Institute of Technology

Master of Science in Analytics

2020 - 2025

Indian Institute of Technology, Kharagpur

Dual Degree (BS + MS) in Economics

2008 - 2013

Summer Internship at Reserve Bank of India (2012)
Summer Internship at Economic Advisory Council to PM (2013)
Concentration vs Inequality Measures of Market Structure: An Exploration of Indian Manufacturing

Open Source & GitHub (@brightertiger)

Active contributor to the machine learning and data science community through open source projects and repositories.

GitHub Profile: github.com/brightertiger

Explore my repositories, contributions, and open source projects in machine learning, data science, and AI.

Open Source Packages

expstats

A/B Testing Calculator & Statistical Significance Analysis for Python

A unified Python library for experiment analysis, sample size calculation, and statistical significance testing. Features conversion rate analysis, revenue/magnitude testing, survival analysis, sequential testing, Bayesian A/B testing, and stakeholder report generation.

GitHub PyPI Docs Live Demo

pygarble

Detect gibberish, garbled text, and nonsense with high precision

A zero-dependency Python library for identifying random character sequences, keyboard mashing, encoding errors, and text corruption. Features 24 detection strategies including Markov chains, n-gram analysis, mojibake detection, and homoglyph detection with 99.5% precision.

GitHub PyPI Docs

Kaggle Achievements (@brightertiger)

Competitions Master

3 Gold, 12 Silver and 4 Bronze medals across various machine learning competitions as @brightertiger

Ranked 3rd / 1621 in Jigsaw Multilingual Toxic Comment Classification Challenge
Ranked 6th / 3308 in SIIM-ISIC Melanoma Detection Challenge
Ranked 16th / 3943 in Talking Data AdTracking Fraud Detection Challenge

Technical Articles

PyExpStats: AB Testing Done Right

A cohesive Python library that brings together sample size calculations, statistical analysis, and result interpretation for A/B testing—supporting conversion, magnitude, and timing experiments with both frequentist and sequential methods.

PyGarble: Detect Gibberish Text In

A Python library that detects gibberish, keyboard mashing, and corrupted text with 70%+ accuracy using 9 complementary detection strategies including keyboard pattern detection, vowel ratio analysis, and ensemble methods.