Shih-hsuan L.

Shih-hsuan L.

Senior Data Scientist

Keelung, Taiwan
Hire Shih-hsuan L. Hire Shih-hsuan L. Hire Shih-hsuan L.

About Me

Shih-Hsuan is an entrepreneur, data scientist, and top competitor in machine learning competitions. He specializes in analyzing data pipelines and modeling business problems to deliver data projects with business impact. He built a real-time analytics system that monitored national product roll-out and provided decision support. Shih-Hsuan excels at sales forecasting, niche image classification, short text classification, and conditional text generation, along with AI, ML, and statistics.

Work history

Veritable Technology, Co.
Data Scientist and Founder
2018 - Present (6 years)
  • Won seventh place in the third YouTube Video Understanding Challenge and published a paper in its ICCV 2019 workshop.

  • Assisted clients that required expertise in data science, machine learning, and artificial intelligence.

  • Created open source research projects, indie data products, and public technical notes and tutorials to help democratize AI.

Chief Data Scientist
2017 - 2018 (1 year)
  • Built data pipelines to merge data from different sources in the company to a data warehouse.

  • Developed an automatic NLP merchandise classification system, including setting up an annotation procedure, data quality control, and experiment processes.

  • Built a real-time analytics system that monitored national product roll-out and provided decision support.

Senior Data Scientist
2015 - 2016 (1 year)
  • Developed a customer churn prediction system for a mobile phone company.

  • Developed a monitoring and forecast system of sales and inventory for a smart vending machine company.

  • Implemented anomaly detection algorithms in the company's analytics SaaS product.

Software Engineer
2013 - 2015 (2 years)
  • Maintained the back end of the company's NLP public opinion analysis product.

  • Developed data visualization in the dashboard facing customers.

  • Maintained the scrapping system and merged it with the firehoses from commercial data providers.


Seventh Place Solution to The Third YouTube-8M Video Understanding Challenge

Challenge description: "In this third challenge based on the YouTube 8M dataset, Kagglers will localize video-level labels to the precise time in the video where the label appears and do this at an unprecedented scale. To put it another way, at what point in the video does the cat sneeze?" Solution: To deal with the limited number of annotated segments, video-level models were pre-trained on the YouTube-8M frame-level features dataset to create meaningful video representations from frames. The weights of the two models were used to build two types of segment classifiers: context-aware and context-agnostic.

Paraphrasing English Sentences

This open-source project is to build models that are automatically paraphrasing English sentences. It fine-tunes a pretrained T5 transformer model using several public paraphrase datasets to obtain paraphrased sentences. The fine-tuned model can create both semantically and dramatically correct paraphrases. Two fine-tuned models have been published on Huggingface Model Hub.

Self-Supervised Domain Adaptation

Inspired by the recent development in self-supervised learning in CV, I speculated that an unsupervised/self-supervised domain adaptation approach might help these cases. We take a model pre-trained on Imagenet, and run self-supervised learning on an unlabeled dataset from a different domain, hoping that this process will transfer some general CV knowledge into the new domain. The goal is to achieve more label efficiency in the downstream tasks within the new domain.My preliminary experiments show visible improvements from the self-supervised domain adaptation approach using images from the downstream task. With longer pre-training and bigger unlabelled datasets, we can probably get further improvements.

Forecasting Challenges

High-ranking results in forecasting competitions:1. Corporación Favorita Grocery Sales Forecasting: predicting sales for a large grocery chain—placed 20th out of 1,671 teams2. Recruit Restaurant Visitor Forecasting: predicting how many future visitors a restaurant will receive—placed 21st out of 1,248 teams3. Web Traffic Time Series Forecasting: forecasting future traffic to Wikipedia pages—placed 43rd out of 1,095 teams


Master's Degree in Applied Statistics
National Australian University
2014 - 2015 (1 year)
Bachelor of Science Degree in Computer Science and Information Engineering
National Taiwan University
2004 - 2009 (5 years)