Qualified Data Scientist specialized in Artificial Intelligence with 7+ years of experience in creative modeling of complex data for classification and prediction using Python and R. Mariel is proficient in Data Analysis and communicating results to audiences with diverse backgrounds. Contributed to multiple research projects, while being a Bioinformatics Ph.D. candidate in the School of Mathematics, Statistics and Applied Mathematics at the National University of Ireland - Galway with peer-reviewed publications in Artificial Intelligence and Bioinformatics.
The aim of this project is to perform the risk prediction in kidney transplant patients in a more appropriate manner using gene expression data. This project is based on the development of an Optimised Ensemble Learning System composed of five supervised learning algorithms using a binary comparator for antibody-mediated rejection prediction based on gene expression data. Performed data pre-processing and analysis, two random oversampling approached to deal with unbalanced data, Python developed an Ensemble Learning System with Particle Swarm Optimisation (PSO) approach for hyper-parameter optimization. Performed statistical analysis of the results and interpretation, as well as co-wrote the paper. Using publicly available data the Ensemble-PSO led to increased predictive performance when compared with the Ensemble alone when using a random oversampling technique. Technologies used in the project: Python, Sklearn, NumPy, Pandas.
The project comprises of using as user input two folders composed by vocalization files from two study groups and a file with the complete vocalization alphabet for the specific model, as the R Shiny app generates results for Entropy analysis, Markov Models, Linear Models, and Classification. The user can select levels of Entropy and pseudo count values for analysis and download all the results as a PDF report. I developed a Python implementation of Shannon’s Entropy function in four levels of complexity and used the Reticulate R package to generate the Python interface for the R Shiny code. Using the Reticulate interface and Python generated transition probability matrices, R developed a Markov Chain model to analyze call transitions, using Jensen-Shannon divergence as a metric of similarity of transition probabilities. R developed a mixed-effects linear model to assess the appropriate entropy level for classification analysis. R developed classification analysis using Boruta, a wrapper using the Random Forest approach. Developed R Shiny dashboard application. The application is already being used in collaboration work with a Psychiatry research group studying autism spectrum disorder. Technologies used in the project: R, R Shiny, Boruta, Reticulate, Python, NumPy, Pandas.
Tertiary protein structure prediction is one of the most challenging problems in structural bioinformatics based on the combinatorial explosion of plausible shapes that a protein can assume. In this project, a new computational approach for this problem was proposed using information from the Protein Data Bank regarding the neighborhood of amino acids (protein building blocks) and their propensity to assume a certain three-dimensional structure. This project involved the Python development of two metaheuristics to optimize the physicochemical function that governs the folding mechanism that is related to the active form of the protein, which is a possible therapeutic target for personalized medicine. I developed a Python implementation which is a Particle Swarm Optimisation (PSO) approach for Protein Structure Prediction, an NP-complete problem, using information from the Angle Probability List. Performed data analysis, interpretation of results and co-wrote the paper. My participation in this project with the PSO approach to protein structure prediction contributed to demonstrate the improvement in predictive performance related to the use of the angle probability list. This project was peer-reviewed published in the Journal of Computational Biology and Chemistry and has 31 citations. Technologies used in the project: Python, SciPy, NumPy, Artificial Intelligence, Optimization, Metaheuristics.
2020 - Present (3 years)
Create and implement data analysis pipelines, including data access, ingestion, munging / manipulation / cleansing, analysis / modelling, testing, deployment / integration into business applications and services.
Enhance operational aspects of businesses by increasing control of the company's data.
Working in cross-functional teams to provide data-driven solutions for increased efficiency and productivity.
Worked on the research of automated classification of ultrasonic vocalization patterns in a genetic mouse model of autism spectrum disorder.
Applied information-theoretic and statistical approaches to the study of vocalization patterns in wild-type and heterozygous Tbx1 mice, which show distinct phenotype in terms of neuronal development and social communication. Supervisor: Pilib Ó Broin, PhD
Performed data analysis, interpretation of results and co-wrote the paper.