Shreya S.

Shreya S.

Denton, TX, United States of America
Hire Shreya S. Hire Shreya S. Hire Shreya S.

About Me

Shreya is an IT professional specializing in database administration, data analytics, and Big Data technologies. She has extensive hands-on expertise with the Hadoop ecosystem, including Apache Spark, MapReduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Sqoop, and Oozie. Shreya designs and implements end-to-end ETL workflows using Azure Data Factory and Databricks, leveraging PySpark and Spark SQL for scalable data processing. She is also skilled in developing CI/CD scripts and managing automated deployment pipelines within Azure environments.

AI, ML & LLM

Apache Airflow Machine Learning Artificial Neural Networks (ANN) Naive Bayes

Backend

Database

DevOps

Other

Data Analytics Big Data Streaming Data Kafka Snowflake Pyspark Hadoop HDFS (Hadoop Distributed File System) Apache Spark MapReduce Spark Streaming Hive Sqoop Oozie ETL Pipelines Teradata HBase Apache Pig Apache Flume Apache Cassandra Apache Impala Apache ZooKeeper Numpy Pandas SciPy Scikit Learn Matplotlib Seaborn Scala Linux Shell Scripting Support Vector Machines (SVM) Decision Tree Random Forest K-nearest Neighbors (KNN) Gradient Boosting MapR Tableau OLAP Netezza Resilient Distributed Datasets (RDD)

Work history

Charter Communications
Charter Communications
Senior Database Engineer/Administrator
2025 - Present
Remote
  • Using Azure Data Factory (ADF) to orchestrate and automate data ingestion pipelines from diverse source systems into Snowflake.

  • Developing robust ADF pipelines and using Databricks with PySpark for scalable data transformation, cleansing, and aggregation.

  • Building and managing Databricks clusters and integrating Kafka for streaming ingestion.

  • Integrated LLM-based automation within data quality and monitoring workflows using OpenAI APIs to auto-summarize pipeline alerts and anomaly reports in Databricks.

  • Designed and deployed AI-driven metadata generation scripts that automatically tagged datasets and lineage in Snowflake, improving data discoverability and governance.

  • Collaborated with ML engineers to build feature-ready datasets for AI/ML pipelines, ensuring scalable ingestion from streaming and batch data sources.

  • Partnered with analytics teams to fine-tune LLM prompts for telecom-specific use cases.

  • Collaborating with data scientists and business stakeholders to design analytical data models in Snowflake that support self-service BI, machine learning, and real-time dashboards.

  • Led the development of a Kafka-Spark-Snowflake prototype to simulate real-time data ingestion and analytics for Big Data consulting use cases.

  • Migrated legacy Cosmos DB event sourcing components into a modern Snowflake-based architecture using Snowpipe for real-time ingestion and DBT for data modeling.

  • Implemented Azure Active Directory (AAD) integration for secure access control across services including Databricks, ADF, and Snowflake.

  • Supported large-scale SQL environments involving complex queries, stored procedures, triggers, and performance tuning across multiple servers and databases.

  • Supporting microservices deployment and orchestration in containerized environments using Docker and Kubernetes.

  • Migrated legacy ETL workflows to modern ETL pipelines using ADF, DBT, and Snowflake, significantly improving pipeline maintainability, scalability, and auditability.

  • Designed and implemented incremental data loading strategies and integrated Azure Key Vault for secure credential management in ADF and Databricks.

  • Using Snowflake Streams and Tasks for real-time change data capture (CDC) and developing robust data quality checks and validations using DBT tests.

  • Created parameterized and dynamic ADF pipelines and built reusable PySpark modules for complex joins, aggregations, and data enrichment operations.

Database EngineeringDatabase Administration (DBA) Azure Data FactoryData pipelinesSnowflakePysparkAzure DatabricksData Transformation Data Cleansing Data Aggregation KafkaKafka Streams OpenAILarge Language Models (LLMs) AI/ML Big DataSparkSnowpipe Data Build Tool (dbt) Data ModelingAzure Cosmos DB Azure Active DirectoryMySQL Performance Tuning SQL Stored Procedures SQL Triggers DockerKubernetesMicroservicesETLETL Pipelines Data Loading Azure Key VaultCDC
Walmart
Walmart
Senior Database Engineer
2024 - 2024
Bentonville, United States of America
  • Developed scalable Spark applications using Scala on Google Cloud Dataproc to process batch and streaming data from multiple RDBMS and messaging sources.

  • Designed and implemented real-time data pipelines using Kafka (hosted on GKE) and Spark Structured Streaming to process event-driven datasets.

  • Integrated Google Pub/Sub with Apache Spark for ingestion of real-time messages from various streaming sources, enabling seamless data movement in GCP.

  • Installed and configured Kafka Manager for consumer lag monitoring, topic management, and partition analysis within GCP Compute Engine clusters.

  • Created end-to-end AI data pipelines for training and inference workflows using Spark MLlib, integrating with Vertex AI for model deployment.

  • Built and deployed Scala-based microservices to consume real-time data streams and perform intelligent transformations before persisting into BigQuery.

  • Developed custom Machine Learning models using MLlib to classify streaming data for anomaly detection and user behavior prediction.

  • Created user-defined functions (UDFs) in Scala for custom business logic used within Spark and SQL transformations in BigQuery.

  • Integrated Cloud Functions to trigger downstream AI processes based on file drops or Pub/Sub events, enhancing automation and responsiveness.

  • Leveraged Google Cloud IAM for fine-grained access control and KMS for encrypting sensitive data within pipelines and ML models.

  • Built Spark MLlib-based recommendation engine prototype and trained it on customer interaction data stored in BigQuery, then exposed via REST APIs.

  • Implemented BigQuery ML for in-database machine learning to provide scalable insights without data movement, integrated with dashboards.

  • Developed advanced AI models on GCP Vertex AI, integrating with Dataproc for large-scale model training and Cloud Storage for dataset versioning.

  • Designed ELT automation scripts in Scala and Python to move and transform data from Cloud SQL, GCS, and external APIs.

  • Delivered detailed technical documentation and design artifacts for AI-driven data pipelines, including data flow diagrams, transformation logic, and operational runbooks.

Database EngineeringDatabase Administration (DBA) Google Cloud Dataproc SparkScalaRDBMSGoogle Kubernetes Engine (GKE) KafkaData pipelinesGCPApache SparkGoogle Pub/Sub Vertex AI AI Model Intergration MLlibGCP BigQueryMicroservicesMachine LearningApache Airflow Directed Acrylic Graphs (DAG) User-defined Functions (UDF) Google Cloud FunctionsAWS Key Management Service (KMS) IAMIdentity & Access Management (IAM) REST APIs Recommender Engine Dataproc AI Model Training AI Modeling PythonELT
CGI
CGI
Database Engineer
2021 - 2023 (2 years)
Hyderabad, India
  • Designed, developed, and deployed batch and streaming pipelines using AWS services.

  • Developed data pipelines using cloud and container services like Docker and Kubernetes, AWS Glue, and PySpark jobs in EMR cluster.

  • Designed and developed monitoring solutions using AWS CloudWatch, AWS IAM, AWS Glue, and AWS QuickSight.

  • Used Lambda, Glue, EMR, EC2, and EKS for data processing and developed data marts, data lakes, and data warehouses using AWS services.

  • Maintained the Hadoop cluster on AWS EMR and migrated an existing on-premises application to AWS.

  • Created, debugged, scheduled, and monitored jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.

  • Built ETL pipeline for data ingestion, data transformation, and data validation on cloud service AWS, working along with data steward under data compliance.

  • Designed and developed end-to-end ETL pipelines using Informatica and Python and implemented data validation and cleansing frameworks.

  • Optimized data transformation logic and SQL scripts, improving ETL performance and reducing load times by over 25%.

  • Automated recurring data ingestion workflows using Azure Data Factory (ADF) and Airflow, integrating structured and unstructured datasets across on-prem and cloud systems.

  • Developed Spark applications using Scala and Java and implemented Apache Spark data processing to handle data from various RDBMS and streaming sources.

Cyient
Cyient
Python Developer
2020 - 2021 (1 year)
Hyderabad, India
  • Built a web application using Django, Flask, Jinja, Python, WSGI, Redis, PostgreSQL, and DynamoDB.

  • Wrote Python scripts to parse XML documents and load data in the database.

  • Developed web-based applications using Python, CSS, and HTML.

  • Developed applications with XML, JSON, XSL (PHP, Django, Python, Rails).

  • Wrote subqueries, stored procedures, triggers, cursors, and functions on MySQL and PostgreSQL databases.

  • Developed web-based applications using Python, Django, PHP, C++, XML, CSS, HTML, DHTML, JavaScript, and jQuery.

  • Worked in WAMP (Windows, Apache, MySQL, and Python/PHP) and LAMP (Linux, Apache, My SQL, and Python/PHP) architecture.

  • Developed views and templates with Python and Django view controllers and templating language to create a user-friendly website interface.

  • Worked with various Python IDEs using PyCharm, PyScripter, Spyder, PyStudio, and PyDev.

PythonDjangoFlaskJinjaWeb Server Gateway Interface (WSGI) RedisPostgreSQLDynamoDBWeb App Development HTMLCSSPython ScriptingXMLDocument Parsing Data Loading JSONXSLPHPRailsSQL Stored Procedures SQL Triggers SQL Functions MySQLjQueryJavaScriptC++DHTML LAMPWamp PycharmSpyder

Education

MSc Computer Science
MSc Computer Science
University of North Texas
2023 - 2024 (1 year)
B.Tech Computer Science
B.Tech Computer Science
Gokaraju Rangaraju Institute of Engineering and Technology (GRIET) - India
2016 - 2020 (4 years)