Rajagopal B.

About Me

Rajagopal is a highly skilled Data Engineer with advanced expertise in Python programming, relational databases, and designing scalable data pipelines. His expertise includes data modeling and transforming large, complex datasets (both structured and unstructured) using distributed systems like Apache Spark and Flink. Rajagopal is adept at query optimization and building efficient data solutions in cloud environments like AWS. He is a collaborative problem-solver focused on contributing to feature development and ensuring data integrity within fast-paced, AI-driven teams.

AI, ML & LLM

Backend

Database

DevOps

Other

Apache Flink Pinot Kafka Spark Pyspark Scala Power BI Tableau BigQuery Databricks Golang Looker Presto Hive Hadoop GraphX Elasticsearch SmartGWT MapReduce Dataproc Pandas Numpy

Work history

LinkedIn
LinkedIn
Staff Data Engineer
2024 - Present (1 year)
Bangalore, India
  • Spearheaded the setup of core processes and best practices for the Trust Data Science team, including data governance, reliability improvements, and foundational dataset development.

  • Developed a real-time metrics dashboard by building a Flink-based streaming service to process live event data and generate platform health metrics, significantly improving incident response.

  • Led an org-wide initiative to overhaul data logging practices, enabling self-serve access to key metrics and reducing dependency on the DS team.

  • Developed a centralized Supertable for the Trust Review Operations team as part of a data democratization initiative, saving 200+ man-hours annually.

Apache FlinkPinot KafkaSparkPysparkTrino AirflowJavaPythonScalaPower BI TableauGCPBigQuery Databricks Data EngineeringData Governance Datasets Dashboard Development
Uber
Uber
Data Engineer
2019 - 2024 (5 years)
Bangalore, India
  • Built and maintained a central warehouse that powers foundational datasets, ensuring high data quality and freshness within defined SLAs for 50+ Tier1/Tier2 datasets.

  • Designed a scalable analytics query gateway service for the Customer Obsession domain at Uber, enabling flexible metric querying across multiple data sources and reducing onboarding time.

  • Led a data governance project to enhance data management practices by establishing clear ownership for datasets and implementing Time to Live (TTL) policies.

  • Built anomaly detection systems that proactively detected business issues, saving ~$500K.

  • Built a data validation utility to compare datasets, which saved ~50 man-hours monthly.

  • Created access controls for all datasets based on a least privilege access control principle, helping in preventing accidental data loss, and recovery would take ~1 month.

  • Worked closely with ML teams to build features needed for ML models like R&A issuance, which has a budget of $1 billion annually.

  • Designed ETL pipelines that are GDPR compliant and adhering to security practices.

  • Owned 100+ Tier 1/Tier 2 datasets, responsible for their SLAs and performance optimization.

Apache FlinkApache Kafka Pinot Apache SparkGolang PysparkJavaPythonScalaLookerGCPBigQuery Presto AirflowKubernetesBig DataAWSSQLHiveSpark Streaming Apache HudiData EngineeringETL Pipelines Data Governance Data ManagementDatasets Data Validation Performance Optimization
Walmart
Walmart
Data Engineer | Big Data Intern
2016 - 2019 (3 years)
Remote
  • Developed a real-time audience size estimation that reduced processing time from 2 hours to 5 minutes with a small error margin.

  • Co-developed an ETL pipeline in Apache Spark for rule-based campaigns to improve efficiency for larger scales.

  • Co-developed an engineering pipeline to use Machine Learning models for campaigns, impacting revenue by $10M.

  • Developed an application that analyzes log messages and alerts when an application goes down.

  • Improved the existing algorithm by 3 times and disk space by 20 times.

  • Worked in the Walmart Ad Exchange (WMX) team, building data pipelines for audience targeting and campaign measurement.

  • Co-implemented a propensity segment pipeline as an ML-based audience targeter worth millions of dollars for Walmart and got an approximate estimate of audience size in real time using data sketches.

  • Migrated data between clusters and changed logic and code when upstream data source/infrastructure changed.

USC Information Sciences Institute
USC Information Sciences Institute
Graduate Data Scientist
2015 - 2016 (1 year)
Remote
  • Implemented clustering and graph algorithm techniques to find human traffickers and reported to DARPA.

  • Worked on DARPA-funded projects to detect online fraudulent activities in various domains like human trafficking, selling illegal weapons, and counterfeit electronics.

  • Gathered information from different websites and the entity linking them and finally visualized the results.

Fiorano
Fiorano
Software Developer
2013 - 2014 (1 year)
Hyderabad, India
  • Developed the front end of a web application to manage APIs.

  • Built the client side of a web application using SmartGWT, Java, and API management, where API service providers can create proxy projects with different requirements like customized quota, data output, added security functionality, and modify user permissions.

JavaSmartGWTObject-oriented Programming (OOP) Web Applications APIsAPI Management
Flipkart
Flipkart
Intern
2013 - 2013
Bangalore, India
  • Developed a sales dashboard showing all the analytics of what products are selling fast with different filters like category, location, and time.

  • Developed a VBScript module that sends a daily email update to topline management about the sales.

  • Developed an inventory planning dashboard that gives an estimate of required inventory based on previous sales and present pace of selling.

VbscriptDashboard Development

Showcase

Clustering, Graph Analytics on Human Trafficking
Clustering, Graph Analytics on Human Trafficking

This implementation was part of a human trafficking project. Grouped the ads that look syntactically similar together so we know that these are posted by the same guy most likely. Modelled the data as a graph using Apache Spark GraphX and found people who operate in groups and big guys using PageRank and connected components. Got very good results and work got featured in Forbes. Tech stack: Spark, GraphX, Scala, Clustering. Link (pdf): http://iswc2015.semanticweb.org/sites/iswc2015.semanticweb.org/files/93670175.pdf

ClickThrough Prediction Rate
ClickThrough Prediction Rate

This was a Kaggle project. Predicted whether an ad will be clicked or not using a feature set. Compared the results using different classifiers such as Decision Tree, Random Forest, and SVM. Used PCA algorithm for feature selection. Plotted the results of different classifiers on ROC Curve and found that Random Forest was the best classifier. Tech stack: Python, Machine Learning.

Recommendation System
Recommendation System

Dataset used Audioscrobbler-data, recommended artists to user using Latent Factor Model and got mean AUC for user as 0.96 (1 is best). Implemented user-user and item-item collaborative recommendation system on an IMDB movie dataset. Tech stack: Python, Spark MLlib, Scala.

Education

MSc Data Informatics
MSc Data Informatics
University of Southern California
2015 - 2016 (1 year)
BE Electrical and Electronics Engineering
BE Electrical and Electronics Engineering
BITS Pilani - India
2009 - 2013 (4 years)