Shih-Hsuan is an entrepreneur, data scientist, and top competitor in machine learning competitions. He specializes in analyzing data pipelines and modeling business problems to deliver data projects with business impact. He built a real-time analytics system that monitored national product roll-out and provided decision support. Shih-Hsuan excels at sales forecasting, niche image classification, short text classification, and conditional text generation, along with AI, ML, and statistics.
Built data pipelines to merge data from different sources in the company to a data warehouse.
Developed an automatic NLP merchandise classification system, including setting up an annotation procedure, data quality control, and experiment processes.
Built a real-time analytics system that monitored national product roll-out and provided decision support.
Challenge description: "In this third challenge based on the YouTube 8M dataset, Kagglers will localize video-level labels to the precise time in the video where the label appears and do this at an unprecedented scale. To put it another way, at what point in the video does the cat sneeze?" Solution: To deal with the limited number of annotated segments, video-level models were pre-trained on the YouTube-8M frame-level features dataset to create meaningful video representations from frames. The weights of the two models were used to build two types of segment classifiers: context-aware and context-agnostic.
This open-source project is to build models that are automatically paraphrasing English sentences. It fine-tunes a pretrained T5 transformer model using several public paraphrase datasets to obtain paraphrased sentences. The fine-tuned model can create both semantically and dramatically correct paraphrases. Two fine-tuned models have been published on Huggingface Model Hub.
Inspired by the recent development in self-supervised learning in CV, I speculated that an unsupervised/self-supervised domain adaptation approach might help these cases. We take a model pre-trained on Imagenet, and run self-supervised learning on an unlabeled dataset from a different domain, hoping that this process will transfer some general CV knowledge into the new domain. The goal is to achieve more label efficiency in the downstream tasks within the new domain.My preliminary experiments show visible improvements from the self-supervised domain adaptation approach using images from the downstream task. With longer pre-training and bigger unlabelled datasets, we can probably get further improvements.
High-ranking results in forecasting competitions:1. Corporación Favorita Grocery Sales Forecasting: predicting sales for a large grocery chain—placed 20th out of 1,671 teams2. Recruit Restaurant Visitor Forecasting: predicting how many future visitors a restaurant will receive—placed 21st out of 1,248 teams3. Web Traffic Time Series Forecasting: forecasting future traffic to Wikipedia pages—placed 43rd out of 1,095 teams
Education
Master's Degree in Applied Statistics
National Australian University
2014 - 2015 (1 year)
Bachelor of Science Degree in Computer Science and Information Engineering