- Databases: MongoDB, PostgreSQL, Dbeaver, Neo4j
- Data visualization: PowerBI
- Cloud based platform: AWS, Databricks, Google Cloud Platform, Alicloud
- Other: Excel (VLOOKUP), KNIME (KNIME Python integration), Godot, SAS, Celonis (process mining), Docker
Experience
2025/1-
Washington University in St. Louis Research Data Analyst (Jan.2025-Present)
- REACT-APAC (Societal Resilience: Analysis of Concurrent Threats in the Asia Pacific Region):
Data Visualization: Create interactive dashboards to show the timeline of disasters by utilizing Shiny for Python. Mapping disaster data with geoPython.
Data Reporting: Present the visualizations to PIs and other stakeholders.
- Goualougo Triangle Ape Project:
Trained an open source wild animal identification model with camera trap data, achieved overall 85% accuracy with species of interest..
-
St Louis Dashboard:
On Google Cloud Platform, develop an ETL data pipeline to automate the data transformation and validation process,
accompanied with an admin user interface for no code customization of the ETL pipeline.
2024/9-2024/12
Northwestern University Teaching Assistant (Sept.2024- Dec.2024)
- Applied Statistics with R, Database Systems, Practical Machine Learning
-
Holding sessions to provide additional help and answer questions.
Explaining course material and concepts to students.
Organizing and overseeing group discussions or projects.
2024/6-2024/8
Baker Tilly Data Scientist Intern (June.2024-Aug.2024)
- Developed performance metrics to identify underutilized offices,
used geospatial data in Python to analyze the effect of office location on office
utilization by employees, projected to improve the cost efficiency by 10%
- Data validation and testing. Data storytelling with visualizations in PowerBI
2024/3-2024/10
Realix AI Data Scientist Intern (March.2024-Oct.2024)
- Prompt engineering: Fine tune training model (LLM), increased the BLEU score by 20%
- Text processing and data cleaning, version control, seed transcript generation. Designed and created automated ETL pipeline using AWS SDK for Python,
transforming raw user conversation data into ready for use training data.
2022
ARK.IO Data Analyst Intern (Jul.2022-Sep.2022)
- Monitored and managed AlibabaCloud database (memory and CPU utilizations, indexes, etc),
retrieved and updated data using SQL queries
- Generated timely reports on user and post data with Tableau
- Determined the most active users and created banners for them to boost user interaction and stored the data to cloud for future analysis
Education
Northwestern University, Chicago, IL
M.S. Data Science with Specialization in AI, Sep.2023-Aug.2024
University of California, Santa Barbara, Santa Barbara, CA
Major: Statistics and Data Science | Minor: Labor Studies, Sep.2019-June.2023
Projects
2024:
Neural Networks: Learning neural network architectures with experiments.(Including CNN, RNN, LSTM, etc)
[link]
2023:
German Credit Risk:The primary objective of this project is to develop a predictive model using the German credit risk dataset that can accurately
forecast the credit amount that potential borrowers are likely to receive.
[link]
Chicago Schools Data EDA: My main goal is to gain insights to educational equality in Chicago. To achieve that,
I would like to see the student demographic and performance by location. I am hoping to learn if the school's
location and student body affect the schools' college enrollmen rate.
[link]
2022:
Survival Analysis of Political Leaders:
This data set documents the party leadership succession in 23 parliamentary
democracies(as defined by Lijphart 1999). There are 25 columns and 4559 rows
in the data, it includes the country, party information, name, sex, and term
information about the leaders, and it also includes a status vector which use one
to indicate the leader is still in office and 0 to indicate that they are out of
office. There are, however, many missing values in the data set due to the lack
of information for some countries. In this project, we use tenure as the time
variable which shows the leader’s time in office (in years), and status as our
censoring data with 1 representing the leader’s still in office, 0 representing
the leader has finished their term. The original paper studied the effect of
succession on terms, in this project, however, we would like to find out if
there’s any relationship between time and the length of tenure (for example,
the more recent the election/ in office year is, the shorter the term is).
(Horiuchi and Laing, 2015)
[link]
Ground Ozone Data Time Series Analysis:
This report observes the monthly average of ground-level ozone in Los Angeles,
California from 2000 - 2020. Using data transformation and differencing,
I find out that the original time series does not need transformation, and
it has a seasonal pattern but no significant trend. I identified some models to
fit by looking at the ACF anf PACF of differentiated time series,
the best model with the lowest AICc for the time series is
a seasonal ARIMA model with (p, d, q)×(P, D, Q) = (1, 0, 1)×(1, 1, 2).
[link]