Marco SburlinoMarco Sburlino
HomeProjectsPostsAbout
  • Home
  • Projects
  • Posts
  • About

Marco Sburlino

Navigation

  • Home
  • Projects
  • Posts
  • About

Connect

GitHubEmailLinkedIn

© 2026 Marco Sburlino. All rights reserved.

Portfolio

Projects

A showcase of data science applications, interactive demos, and research papers. Filter by type, technology, or category to find what interests you.

Speakrow.com - Your Linguistic Reach

Full App

Speakrow.com - Your Linguistic Reach

Speakrow.com is a web app that shows what share of the world's population you can communicate with based on the languages you speak. You select the languages you know, and the app instantly updates an interactive world map to show where you can communicate. It also calculates your overall global coverage, provides a survival score, and breaks down your reach by continent with simple summary statistics.

Web Application
Claude Code
Geography
Fun Apps
Data Pipelines
+1
Forecasting Daily Electricity Load in Denmark: A Comparative Analysis with Renewable Energy Integration

Paper

Forecasting Daily Electricity Load in Denmark: A Comparative Analysis with Renewable Energy Integration

This report outlines my university project for the Predictive Analytics course at Copenhagen Business School. The study aimed to forecast daily electricity load in Denmark by comparing traditional time series approaches with models that integrate renewable energy generation data. Using a five-year dataset (2016–2020) from the Open Power System Data platform, the project followed a rigorous pipeline that included exploratory data analysis to identify strong weekly seasonality and structural breaks, followed by stationarity testing that necessitated first-order differencing. The core analysis involved training and evaluating three primary modeling techniques: a Seasonal Naive baseline, standard and seasonal ARIMA models, and a Dynamic Regression model incorporating wind and solar generation variables. These models were rigorously validated using Ljung-Box diagnostic tests and compared across different forecast horizons, ultimately revealing that while Auto-ARIMA performed best for short-term predictions, the Dynamic Regression model offered superior accuracy for longer 30-day forecasts by effectively capturing weather-driven demand fluctuations.

R
Predictive Analytics
Time Series
ARIMA
Dynamic Regression
Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach

Paper

Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach

This project investigates the use of machine learning to classify the severity of traffic delays caused by roadway accidents based on features available at the time of the incident. The problem addressed is the need for timely identification of high-impact events to support traffic management and routing decisions. The research question concerns how accident-related traffic delay severity can be predicted based on real-time features, with a focus on minimizing false negatives for high-severity cases. Concepts applied include supervised classification, class balancing, feature engineering, and model validation. The analysis is based on the US Accidents dataset containing over 7.7 million records, which was cleaned, binarized, balanced, and used to train four models. Histogram-Based Gradient Boosting achieved the highest recall at 0.79, outperforming Random Forest, Logistic Regression, and Multilayer Perceptron, which showed higher accuracy but lower sensitivity to severe cases. These results suggest that HGBoost is best suited for applications where the accurate identification of high-severity delays is prioritized. It is recommended as the preferred model when recall is the primary objective and training efficiency is also relevant.

Python
Machine Learning
Supervised Learning
Logistic Regression
Random Forest
+3
PII Masking and Risk Assessment in Unstructured Text: An NLP-Based Approach

Paper

PII Masking and Risk Assessment in Unstructured Text: An NLP-Based Approach

This report presents a university project for our Natural Language Processing and Text Analytics course at Copenhagen Business School, completed by our team: Eduard Aguado, Federico De Marinis, Noe Juarez, and Marco Sburlino. Our objective was to develop an automated system for detecting Personally Identifiable Information (PII) in unstructured text and classifying documents by privacy risk level to support GDPR-compliant data handling. Using the PII Masking 300K dataset, we designed an end-to-end pipeline that included data cleaning, exploratory analysis, BIO-based token labeling, weighted risk scoring, and binary risk classification (Low vs. High). We implemented and evaluated multiple approaches: TF-IDF with Logistic Regression and Multinomial Naive Bayes as baselines, a fine-tuned DistilBERT model for contextual token classification, and GPT-3.5 via API for zero-shot PII extraction. Models were compared using recall, F1-score, and accuracy, with particular emphasis on minimizing false negatives in high-risk cases. DistilBERT achieved the strongest overall performance, demonstrating the advantage of transformer-based architectures for context-aware PII detection and privacy risk assessment in text.

Python
Natural Language Processing
Cybersecurity
Logistic Regression
TF-IDF
+3