Projects | Marco Sburlino

Full App

Koinkat

Koinkat is a local-first, multi-currency personal finance manager built as a native desktop app. It connects European banks via PSD2 or manual accounts, tracks net worth across currencies at daily FX rates, and categorizes transactions with a learning rule engine — all stored on-device with no cloud, no accounts, and no telemetry.

Full App

Speakrow.com - Your Linguistic Reach

Speakrow.com is a web app that shows what share of the world's population you can communicate with based on the languages you speak. You select the languages you know, and the app instantly updates an interactive world map to show where you can communicate. It also calculates your overall global coverage, provides a survival score, and breaks down your reach by continent with simple summary statistics.

Paper

Forecasting Daily Electricity Load in Denmark: A Comparative Analysis with Renewable Energy Integration

This report outlines my university project for the Predictive Analytics course at Copenhagen Business School. The study aimed to forecast daily electricity load in Denmark by comparing traditional time series approaches with models that integrate renewable energy generation data. Using a five-year dataset (2016–2020) from the Open Power System Data platform, the project followed a rigorous pipeline that included exploratory data analysis to identify strong weekly seasonality and structural breaks, followed by stationarity testing that necessitated first-order differencing. The core analysis involved training and evaluating three primary modeling techniques: a Seasonal Naive baseline, standard and seasonal ARIMA models, and a Dynamic Regression model incorporating wind and solar generation variables. These models were rigorously validated using Ljung-Box diagnostic tests and compared across different forecast horizons, ultimately revealing that while Auto-ARIMA performed best for short-term predictions, the Dynamic Regression model offered superior accuracy for longer 30-day forecasts by effectively capturing weather-driven demand fluctuations.

Paper

Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach

This project investigates the use of machine learning to classify the severity of traffic delays caused by roadway accidents based on features available at the time of the incident. The problem addressed is the need for timely identification of high-impact events to support traffic management and routing decisions. The research question concerns how accident-related traffic delay severity can be predicted based on real-time features, with a focus on minimizing false negatives for high-severity cases. Concepts applied include supervised classification, class balancing, feature engineering, and model validation. The analysis is based on the US Accidents dataset containing over 7.7 million records, which was cleaned, binarized, balanced, and used to train four models. Histogram-Based Gradient Boosting achieved the highest recall at 0.79, outperforming Random Forest, Logistic Regression, and Multilayer Perceptron, which showed higher accuracy but lower sensitivity to severe cases. These results suggest that HGBoost is best suited for applications where the accurate identification of high-severity delays is prioritized. It is recommended as the preferred model when recall is the primary objective and training efficiency is also relevant.

Paper

PII Masking and Risk Assessment in Unstructured Text: An NLP-Based Approach

This report presents a university project for our Natural Language Processing and Text Analytics course at Copenhagen Business School, completed by our team: Eduard Aguado, Federico De Marinis, Noe Juarez, and Marco Sburlino. Our objective was to develop an automated system for detecting Personally Identifiable Information (PII) in unstructured text and classifying documents by privacy risk level to support GDPR-compliant data handling. Using the PII Masking 300K dataset, we designed an end-to-end pipeline that included data cleaning, exploratory analysis, BIO-based token labeling, weighted risk scoring, and binary risk classification (Low vs. High). We implemented and evaluated multiple approaches: TF-IDF with Logistic Regression and Multinomial Naive Bayes as baselines, a fine-tuned DistilBERT model for contextual token classification, and GPT-3.5 via API for zero-shot PII extraction. Models were compared using recall, F1-score, and accuracy, with particular emphasis on minimizing false negatives in high-risk cases. DistilBERT achieved the strongest overall performance, demonstrating the advantage of transformer-based architectures for context-aware PII detection and privacy risk assessment in text.

Python

Natural Language Processing