Paper

This report presents a university project for our Natural Language Processing and Text Analytics course at Copenhagen Business School, completed by our team: Eduard Aguado, Federico De Marinis, Noe Juarez, and Marco Sburlino. Our objective was to develop an automated system for detecting Personally Identifiable Information (PII) in unstructured text and classifying documents by privacy risk level to support GDPR-compliant data handling. Using the PII Masking 300K dataset, we designed an end-to-end pipeline that included data cleaning, exploratory analysis, BIO-based token labeling, weighted risk scoring, and binary risk classification (Low vs. High). We implemented and evaluated multiple approaches: TF-IDF with Logistic Regression and Multinomial Naive Bayes as baselines, a fine-tuned DistilBERT model for contextual token classification, and GPT-3.5 via API for zero-shot PII extraction. Models were compared using recall, F1-score, and accuracy, with particular emphasis on minimizing false negatives in high-risk cases. DistilBERT achieved the strongest overall performance, demonstrating the advantage of transformer-based architectures for context-aware PII detection and privacy risk assessment in text.
Built With
Topics
| Abbreviation | Definition |
|---|---|
| PII | Personally Identifiable Information |
| BIO | Beginning-Inside-Outside tagging scheme |
| NER | Named Entity Recognition |
| GDPR | General Data Protection Regulation |
| TF-IDF | Term Frequency–Inverse Document Frequency |
| BERT | Bidirectional Encoder Representations from Transofrmers |
| GPT | Generative Pre-training Transformer |
| API | Application Programming Interface |
Cybersecurity is a growing concern as organizations increasingly rely on digital systems to store and process sensitive information. Cybercrime is projected to cause global damages exceeding $9 trillion in 2024, making it one of the most pressing global risks (World Economic Forum 2023). The growing use of unstructured text such as emails, support tickets, and chat logs has further intensified the risk of Personally Identifiable Information (PII) exposure, raising serious privacy concerns, including risks of identity theft, financial loss, discrimination, and reputational harm. One example is the cyberattack on the UK’s Legal Aid Agency, which compromised over two million sensitive records and highlighted the urgent need for more effective protection measures (“Fraud and Extortion Risk After Cyberattack on Legal Aid Agency” 2025).
To address this, automated PII masking techniques have gained importance. These methods aim to detect and redact sensitive information before it is stored or shared. Traditional rule-based approaches offer limited flexibility and often fail in multilingual or ambiguous contexts. Machine learning and Natural Language Processing methods, particularly Named Entity Recognition (NER), offer more adaptive and scalable solutions by learning to identify PII from labeled data.
This project investigates the use of machine learning and natural language processing models to classify documents by privacy risk based on the PII found in multilingual text. The main objectives are to train a sequence labeling model using BIO (Beginning-Inside-Outside) tagging for PII detection and to develop a document-level risk classifier using features derived from the detected entities. This study examines how accurately models can detect and assess the privacy risk of unstructured text containing PII.
Despite these regulatory safeguards, PII masking techniques have become increasingly important. The work of Kulkarni Poornima and N. K. Cauvery addresses the challenges of detecting PII in large volumes of unstructured text, proposing a hybrid unsupervised model called C-PIIM to enhance PII protection (Kulkarni and K 2021). Their findings indicate that personal emails contain the highest concentration of PII, followed by work emails, with email headers containing more PII than message bodies due to metadata such as sender and recipient information. The proposed model outperforms traditional hierarchical clustering methods in clustering quality but is limited to detecting direct identifiers, excluding indirect ones.
Additionally, applications such as Morpheus developed by NVIDIA presented the potential in using NLP to detect different categories of sensitive information and assess real-time threat detection (credit card numbers, passwords, and user ID). Morpheus uses AI and GPU acceleration to inspect network traffic with minimal delay. Moreover, Mohammedi, A. explores several NLP-based text anonymization methods, including suppression, pseudonymization, noising, and generalization, with the purpose to protect PII and be GDPR-compliant. It emphasizes the use of Microsoft’s open-source NLP tool, Presidio, for automated text redaction and anonymization. The research shows that combining these methods actually improves the PII protection, while preserving data usefulness (Mohammedi 2023).
BIO tagging is a widely adopted scheme in natural language processing
for identifying both the type and span of entities within text
sequences. In the context of privacy protection, it is particularly
effective for detecting PII in unstructured text, enabling accurate
masking or redaction. The scheme assigns each token a label indicating
its position within an entity: B (beginning) for the first
token, I (inside) for subsequent tokens, and O
(outside) for tokens not part of any entity.
| Token | BIO Label | Masked Output |
|---|---|---|
| My | O | My |
| name | O | name |
| is | O | is |
| John | B-PER | [NAME] |
| Doe | I-PER | [NAME] |
| and | O | and |
| my | O | my |
| O | ||
| is | O | is |
| john.doe | B-EMAIL | [EMAIL] |
| @ | I-EMAIL | [EMAIL] |
| gmail.com | I-EMAIL | [EMAIL] |
| . | O | . |
Table 1 presents an example of a BIO-tagged sentence and its corresponding masked output. Each PII sub-token is labeled based on its position within the entity (beginning or inside) followed by the entity type. After labeling, a masked version of the sentence is generated, in which detected PII tokens are replaced with bracketed entity types, while non-PII tokens are preserved. This masked output is useful for producing anonymized text where PII is systematically redacted.
This project aims to address two tasks. The first involves assigning a risk level, from 1 (low) to 3 (high), to each observation based on the frequency and sensitivity of the detected PII entities on the masked text. Weights were defined for each entity type according to its practical privacy relevance, with more sensitive entities such as passwords receiving higher weights than less critical ones like usernames. These weighted frequencies were aggregated to compute a cumulative score for each document, which was then mapped to a discrete risk category using defined thresholds.
The second task focuses on predicting the assigned risk level using the original version of the text. The objective is to train a model to infer privacy risk from anonymized content, simulating a real-world deployment scenario where raw data is inaccessible. Since the test set was labeled during the first task, it was used to evaluate the model’s ability to predict risk levels accurately.
Customer support platforms routinely handle vast volumes of unstructured textual data in the form of support tickets, chat transcripts, and complaint emails. These documents often contain sensitive user details such as names, addresses, phone numbers, IDs, or payment data, that must be handled with strict privacy safeguards to comply with data protection laws such as the General Data Protection Regulation (GDPR).
In this context, our system provides a practical solution: it automatically detects Personally Identifiable Information (PII) and classifies the associated document’s privacy risk level as Low-Risk or High-Risk. This allows organizations to:
Automatically redact or mask sensitive information before storage or sharing.
Assign higher-risk cases to specialized privacy-aware workflows.
Reduce legal exposure and improve customer trust by safeguarding data in real time.
This use case exemplifies how NLP can be applied to automated compliance, data minimization, and secure document handling in live enterprise settings.
The dataset used in this project is the PII Masking 300K dataset introduced by AI4Privacy (2023). It is designed for training and evaluating models in the task of detecting and masking Personally Identifiable Information (PII) in text. The dataset comprises 225,405 annotated text samples spanning six languages (English, French, German, Italian, Dutch, and Spanish), with localized content across eight jurisdictions. Each entry consists of synthetic or semi-synthetic text generated using proprietary algorithms, ensuring no privacy violations.
The dataset is divided into two sub-corpora: OpenPII-220K, which contains 27 general PII types such as names, emails, phone numbers, IDs, and passwords, and FinPII-80K, which includes approximately 20 additional types specific to financial and insurance domains. It comprises over 30 million tokens, with around 7.6 million labeled as PII. The annotations were validated through a human-in-the-loop process, achieving a token-level accuracy of 98.3% on a manually reviewed sample. A predefined training/test split of 78.8% and 21.2% is provided.
As seen in Figure 1, each instance contains the following
fields: source_text, which represents the original unmasked
input; target_text, which contains the PII-masked version
of the text; and privacy_mask and span_labels,
which specify the locations and categories of the identified PII spans.
The field mbert_text_tokens provides the tokenized version
of the input text, aligned with the multilingual BERT tokenizer, while
mbert_bio_labels contains the corresponding BIO-format
annotations used for sequence labeling. Each example also includes a
unique id and a language tag to support
multilingual training and evaluation.
Although the raw data initially appeared clean, a sanity check was
performed to verify the absence of missing values and exact duplicates.
To conduct this, the training and test datasets were merged and
examined, confirming that no such issues were present. A second check
was then carried out to detect empty strings in the
source_text field, revealing 25 instances, which were
subsequently removed. Following these steps, the dataset was deemed
fully cleaned and ready for preprocessing.
First, to understand the structure and composition of our data, an initial exploratory analysis was conducted. Since the dataset contains text in multiple languages, the language distribution was examined to identify the most predominant language and assess balance across categories. While the overall distribution was relatively even, the models were restricted to English-language texts to ensure consistency and reduce complexity during preprocessing and modeling.
Secondly, to analyze the structure and distribution of PII types,
each category was plotted to assess its frequency. As shown in Figure 2, the most common entities were
time, username, and email, which
are generally considered lower risk. In the mid-frequency range (7,500
to 10,000 occurrences), 17 other PII types appear, including higher-risk
entities such as passport and IP. The least
frequent PII categories were typically the most sensitive, such as
password and cardissuer.
Finally, to assess PII density at the document level, a histogram was plotted showing the number of PII entities per document. As illustrated in Figure 3, the distribution is right-skewed, with most documents containing fewer than 10 PII instances and a long tail extending up to 35 entities.
Accurate tokenization and label alignment are essential for sequence tagging tasks (Lample et al. 2016), and the NER (Named Entity Recognition) pipeline provides a structured approach to achieve this. First, the raw text was cleaned to ensure consistency in the corpus and prepare it for modeling by applying a series of standard NLP preprocessing steps such as lowercase, trimming, and normalizing punctuation. After cleaning the document, simple whitespace splitting was applied and common English stopwords were removed to focus on more relevant words.
Furthermore, to properly assigned the importance of each PII in the document, a weight was assigned for each type according to its risk level and analyzed its frequency through a log-scaled inverse frequency formula. The range goes from 1 to 3, 3 being the highest risk. This ensures that a text containing password and IP addresses contributes higher to the document’s overall risk level than a one containing only email addresses and names. The weights are used to compute a numeric risk score for each document by summing the weighted PII counts.
As a final step in the preprocessing, each document’s risk score is turned into a category labeled either "High" or "Low", reflecting the privacy risk they carry and serving as targets for the supervised classification performed afterwards. To ensure class balance, threshold cut-off at the 50th percentile was computed based on whether the document’s risk score falls above or below the median. The resulting class distributions for both training and test sets are shown in Figure 4
Once the dataset has been preprocessed and risk labels assigned, an
additional step is required to prepare it for the second task, which
involves predicting risk levels on test data using the non-masked text.
Specifically, tokens must be lemmatized to reduce each word to its base
form (e.g., "running" becomes "run") and then converted into clean,
space-separated strings. To do so, the small English spaCy
model was used. This process produces a more compact and consistent
vocabulary, enabling more effective model training and improving
classification performance (Manning et al. 2008).
After cleaning and preprocessing the dataset, the next step was to select, train, and evaluate classification models. Three approaches were applied. First, two binary classification models using TF-IDF tokenization were implemented: Logistic Regression and Multinomial Naive Bayes. Second, a pre-trained transformer-based language model (BERT) was used to enhance the model’s contextual understanding. Finally, the OpenAI GPT-3.5 API was employed to identify PII in the text, and its performance was compared to the previous models.
For the first modeling approach, TF-IDF (Term Frequency–Inverse
Document Frequency) was chosen, as it evaluates the importance of a word
in a document relative to the entire corpus, unlike the Bag-of-Words
model (Manning et al. 2008). This
is essential for risk identification, as it reduces the influence of
generic language while emphasizing terms that may indicate the presence
of sensitive information. TfidfVectorizer was used to
extract features with a specific configuration. The
ngram_range was set to (1, 3) to capture unigrams, bigrams,
and trigrams, modeling both individual terms and short sequences.
max_df=0.9 was used to remove very common terms appearing
in over 90% of documents. max_features was set to
5,000 to retain the most informative terms and reduce
dimensionality. Finally, sublinear_tf=True applied
logarithmic scaling to mitigate the dominance of high-frequency
terms.
With this setup, logistic regression was selected as the first classification model. A machine learning pipeline was implemented to integrate the TF-IDF feature extraction step with the logistic regression classifier. The model was trained on the lemmatized text, with TF-IDF converting the input into weighted feature vectors, and logistic regression used to predict the risk level. Finally, L1 regularization was applied to promote sparsity and reduce overfitting, which is particularly important when working with high-dimensional text data. The second model using the TF-IDF approach was Multinomial Naive Bayes (MNB), a probabilistic classifier that assumes all features (e.g., word occurrences) are conditionally independent given the class (Manning et al. 2008). The model was configured with a small smoothing parameter () to increase sensitivity to subtle differences in word frequency.
BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that processes input-text contextually (Devlin et al. 2019). It captures the meaning of each token based on its surrounding context, for example, distinguishing between “bank” in “river bank” and “bank account.” This is achieved through its bidirectional architecture, which considers both left and right context simultaneously, making BERT particularly effective for classification tasks. As a result, fine tuned BERT models often outperform traditional approaches such as logistic regression or Naive Bayes (Devlin et al. 2019).
DistilBERT was selected as the model for the BERT-based approach. As Sanh et al. (2019) describes, DistilBERT is a compact version of BERT that retains most of its performance while being significantly smaller and faster. The objective of the model is to assign BIO labels to individual tokens, rather than to classify entire documents. After token-level prediction, the "B-" and "I-" prefixes are stripped from the PII entities, and risk scores are computed based on predefined risk weights. These scores are then converted into "Low" or "High" risk categories, enabling document-level classification.
To set up the model, the input text was first tokenized and aligned
with the original BIO labels, as DistilBERT (like all BERT-based models)
operates on subword tokens and requires token-level supervision. Second,
the TrainingArguments class from the
transformers library was used to define the training
configuration. These settings included evaluation at the end of each
epoch, a low learning rate with logging every 50 steps to monitor
training progress, and a weight_decay of 0.01, serving as
regularization to prevent overfitting. Finally, the Trainer
class, also from the transformers library, was initialized
to handle the training loop, model evaluation, and checkpoint
management.
To evaluate the potential of a Large Language Model for the task of PII detection and privacy risk assessment, the GPT-3.5 model developed by OpenAI was integrated into the pipeline. The rationale behind this choice was based on the model’s ability to interpret natural language prompts in a zero-shot setting and produce structured responses. This makes it particularly suitable for identifying entities in text without the need for supervised fine-tuning.
Among the available large language models, GPT-3.5 was selected due to its reliable performance in comparable use cases, the availability of extensive documentation, and ease of access through the OpenAI API. Compared to more advanced models such as GPT-4, GPT-3.5 represents a cost-effective alternative with sufficient capabilities for the needs of this project1. At the time of use, the API usage involved a small monetary cost, which was considered acceptable within the scope of the assignment.
To access the model, an OpenAI account was created and an API key was configured in the environment. The implementation relied on the use of the official Python package provided by OpenAI, which enabled programmatic access to the model as part of the processing pipeline.
Each document contained in the test set, was submitted to the API through a prompt requesting the extraction of all PII types present in the text. The model was instructed to return the output in the form of a JSON dictionary, where each key represented a PII type (such as EMAIL, PASSWORD, or USERNAME), and the corresponding value indicated the number of times it occurred in the input. These PII types were consistent with those used in the earlier rule-based counting phase.
This approach enabled the extraction of PII information in a structured format that was directly compatible with the existing risk computation framework. Once the model returned the count of each PII type per document, these counts were used as inputs to the predefined risk scoring formula, which applied specific weights to each PII category based on its sensitivity. The resulting numeric score was then used to assign the binary risk label – either high or low – following the same thresholding strategy applied in the rule-based method. This allowed for a direct comparison of the LLM-driven classification results with those obtained from the traditional pipeline.
Aligned with the use case outlined in Section 3, evaluation metrics were selected to prioritize the accurate identification of high-risk text. For the token-level NER task, particular emphasis was placed on minimizing false negatives, where text containing sensitive PII is not labeled as risky. Therefore, recall was chosen as the primary evaluation metric, as performance in this context depends more on minimizing false negatives than on avoiding false positives. Recall is defined as follows:
Given the trade-off between precision and recall, where increasing recall may lead to more false positives, the -score was used to monitor the balance between these metrics. Although the primary focus is on minimizing false negatives, it is also important to control false positives, particularly in the document-level classification task, where precision helps reduce unnecessary escalations while ensuring that sensitive cases are correctly identified. Lastly, accuracy was included to assess the overall correctness of the model’s predictions. These metrics are defined as follows:
Table 2 summarizes the performance of four models across four key metrics. The DistilBERT model achieved the highest Recall (0.92), clearly outperforming the others. It demonstrates a strong ability to correctly identify positive cases while also maintaining high overall accuracy (0.89), making it the most effective model to minimize false negatives in comparison.
Logistic Regression and Multinomial Naive Bayes, both using TF-IDF features, showed consistent and well-balanced results, with scores of 0.85 and 0.83, respectively, across evaluation metrics. These results indicate strong baseline performance, with Logistic Regression slightly outperforming Naive Bayes across all metrics, maintaining an improved rate to identify both false positives and false negatives.
The OpenAI GPT model recorded the lowest performance overall, with a recall and -score of 0.77 and an accuracy of 0.78. While its precision is relatively high at 0.82, the model underperforms in capturing positive cases, which is critical when prioritizing sensitive in high-risk classifications.
| Model | Recall | -score | Accuracy | Precision |
|---|---|---|---|---|
| Logistic Regression (TF-IDF) | 0.85 | 0.85 | 0.85 | 0.85 |
| Multinomial Naive Bayes (TF-IDF) | 0.83 | 0.83 | 0.83 | 0.83 |
| DistilBERT (BERT) | 0.92 | 0.87 | 0.89 | 0.84 |
| OpenAI GPT -3.5 API | 0,77 | 0,77 | 0.78 | 0,82 |
In addition to performance metrics, it was also evaluated the computational efficiency of each model by measuring total execution time. The traditional TF-IDF models demonstrated the fastest runtimes, with Logistic Regression completing in 27.2 seconds and Naive Bayes in 43.7 seconds. These results are expected, as both models operate on sparse vectorized inputs without any contextual embedding or external dependencies.
DistilBERT, while achieving the best predictive performance, required substantially more time, completing in 4,832.9 seconds. This increased runtime is attributable to the computational demands of transformer-based architectures, which process inputs using multi-head attention mechanisms across multiple layers. Despite this, DistilBERT remains the best model for deployment in time-sensitive environments due to its balance of accuracy and efficiency.
In contrast, the GPT-3.5 API approach took 22,071.1 seconds – over 4.5 times longer than BERT. This extended runtime is largely due to the API-based setup, where latency arises from network requests, external processing on OpenAI’s servers, and serialization of results. Furthermore, the model processes each document individually in a conversational interface, contributing to the slower overall throughput. While informative for comparison purposes, the LLM-based method may not currently be suitable for large-scale, real-time deployment scenarios due to this processing overhead.
As discussed previously, DistilBERT demonstrated the strongest overall performance among the four models, achieving the highest scores across all key evaluation metrics. Table 3 presents a detailed breakdown of its performance by class. The model performs particularly well on high-risk cases, with a precision of 0.99 and a recall of 0.87, resulting in an -score of 0.92. This indicates that DistilBERT is highly effective at correctly identifying high-risk instances with minimal false positives, which aligns with the evaluation priorities outlined in Section 4.5. In contrast, performance on low-risk cases reflects a trade-off: while recall is very high (0.98), precision decreases to 0.70, yielding a lower -score of 0.81. As illustrated in the confusion matrix in Figure 5, this reduction in precision is attributable to a higher number of false positives in the low-risk class, likely influenced by the model’s emphasis on maximizing sensitivity to high-risk instances.
| Precision | Recall | -score | Support | |
|---|---|---|---|---|
| High-Risk (1) | 0.99 | 0.87 | 0.92 | 5773 |
| Low-Risk (0) | 0.70 | 0.98 | 0.81 | 1798 |
| Accuracy | - | - | 0.89 | 7571 |
| Macro avg | 0.84 | 0.92 | 0.87 | 7571 |
| Weighted avg | 0.92 | 0.89 | 0.90 | 7571 |
The test set contains 7,571 records, of which 5,773 are labeled as high-risk and 1,798 as low-risk. This class imbalance may influence the model’s performance, particularly by reducing precision for the minority class. The confusion matrix in Figure 5 illustrates the model’s predictions compared to the actual labels, revealing a moderate false negative rate. It correctly classifies 5,022 high-risk instances but also shows a notable number of false negatives (751), where high-risk cases were misclassified as low-risk. Conversely, the model performs exceptionally well on low-risk classifications, correctly identifying 1,763 instances while producing only 35 false positives. These outcomes are consistent with the classification report in Table 3, where the recall for low-risk cases is nearly perfect (0.98), while the recall for high-risk cases is slightly lower at 0.87.
Overfitting was a primary concern and was addressed by evaluating classification performance on both training and test sets. As shown in Table 4 in the Appendix, there is no significant variance across performance metrics for any of the models. To mitigate overfitting, regularization techniques were applied. For example, since logistic regression is prone to overfitting in high-dimensional text settings, L1 regularization (Lasso) was used to penalize large coefficients and improve generalization.
Throughout this study, several limitations were encountered that could impact the accuracy and generalizability of the results. First, for privacy reasons, the dataset consisted of synthetic data generated using proprietary algorithms, rather than real-world data. Although the data was manually reviewed and achieved a validation accuracy of approximately 98.3%, its synthetic nature may limit its applicability to real-world scenarios.
Additional challenges were observed in the TF-IDF-based models and the GPT-3.5 API model. The TF-IDF models (Logistic regression and Multinomial Naive Bayes) do not capture semantic or sequential relationships between words, so these models rely on word frequency, not meaning or context. As for the GPT-3.5 API, its integration posed practical limitations: the model operates as a black box, lacks fine-tuning capabilities, and incurs latency and cost constraints. Moreover, its responses may vary across calls, introducing potential inconsistencies in the results
This study evaluated four models for binary classification of privacy risk in text. Logistic Regression and Multinomial Naive Bayes were implemented using TF-IDF features, while DistilBERT was used for contextual token classification. These were compared with the GPT 3.5 API used for external PII extraction. As explained in Section 5.1, while the TF-IDF-based models offered competitive performance and interpretability, DistilBERT emerged as the best-performing approach, achieving the highest values across most evaluation metrics (see Table 4 in the Appendix), likely due to its ability to capture bidirectional context in high-dimensional text. These results suggest that transformer-based models are better suited for nuanced PII detection and risk classification tasks, especially when semantic context plays a key role.
Regarding future work, given the strong performance achieved by the models, it is reasonable to extend the project to more practical applications. One potential direction is to deploy the risk classification pipeline as an API integrated into corporate platforms like email or Microsoft Teams. The API would process raw text input, classify PII using the trained model, and return a risk label, triggering actions such as anonymization, encryption, or redaction. This would broaden the applicability of the classification models and contribute to enhancing corporate privacy compliance.
| Metric | Logistic Regression | Naive Bayes | DistilBERT | |||
|---|---|---|---|---|---|---|
| 2-3 (lr)4-5 (lr)6-7 | Train | Test | Train | Test | Train | Test |
| Accuracy | 0.85 | 0.85 | 0.83 | 0.83 | 0.90 | 0.89 |
| Precision | 0.85 | 0.85 | 0.83 | 0.83 | 0.85 | 0.84 |
| Recall | 0.85 | 0.85 | 0.83 | 0.83 | 0.92 | 0.92 |
| F1 Score | 0.85 | 0.85 | 0.83 | 0.83 | 0.87 | 0.87 |
As of May 2025, the API pricing for GPT-3.5 is $0.50 per 1M input tokens and $1.50 per 1M output tokens.↩︎