Repository for testing Personally Identifiable Information (PII) detection methods, including:
- Regex and spaCy NER based detection
- Proximity analysis (context-aware pattern matches)
- Graph-based analysis (clustering related PII)
- Deduplication across methods
conda activate gensec
python -m pip install -q pytest spacy networkx matplotlib
python -m spacy download en_core_web_sm -qPII_testing/PII_Logging_2.ipynb: Enhanced detector demo notebooktests/test_pii_logging.py: Pytests for detector behaviortest_data/: Sample input files (logs, emails) for manual/CLI testing
From the Homeworks directory:
conda activate gensec
pytest -qUse the ready-made inputs under test_data/:
test_data/log_with_pii.txt: Contains email, phone, card, CVV, SSNtest_data/log_without_pii.txt: Clean operational logstest_data/email_with_pii.eml: Names, emails, phone, SSN, addresstest_data/email_without_pii.eml: Clean messagetest_data/mixed_network_and_address.log: IPv4/IPv6, address, license
Open the notebook in this folder to explore the detector with examples:
jupyter lab PII_testing/PII_Logging_2.ipynbEnsure the gensec environment is active so spaCy and its model are available.