
This report presents a university project for our Natural Language Processing and Text Analytics course at Copenhagen Business School, completed by our team: Eduard Aguado, Federico De Marinis, Noe Juarez, and Marco Sburlino. Our objective was to develop an automated system for detecting Personally Identifiable Information (PII) in unstructured text and classifying documents by privacy risk level to support GDPR-compliant data handling. Using the PII Masking 300K dataset, we designed an end-to-end pipeline that included data cleaning, exploratory analysis, BIO-based token labeling, weighted risk scoring, and binary risk classification (Low vs. High). We implemented and evaluated multiple approaches: TF-IDF with Logistic Regression and Multinomial Naive Bayes as baselines, a fine-tuned DistilBERT model for contextual token classification, and GPT-3.5 via API for zero-shot PII extraction. Models were compared using recall, F1-score, and accuracy, with particular emphasis on minimizing false negatives in high-risk cases. DistilBERT achieved the strongest overall performance, demonstrating the advantage of transformer-based architectures for context-aware PII detection and privacy risk assessment in text.
Built With
Topics