How to Automatically Remove Sensitive Metadata from PDFs for Compliance
PDFs embed metadata—creator names, timestamps, GPS, software versions—that can leak sensitive data. For GDPR, HIPAA, and enterprise compliance, automating metadata removal is essential.
Why metadata matters for compliance
Documents shared with clients, regulators, or partners must not expose internal data. Metadata can include:
- Author and creator names
- Modification dates and application paths
- EXIF data from embedded images (GPS, camera info)
- XMP and custom document properties
Manual cleanup does not scale. Automate with an API that strips metadata on every document before storage or distribution.
Automatic metadata scrubbing with PDF Squeezer
Set stripMetadata=true (default) on the compress endpoint. The API removes metadata from the PDF and embedded images in one request:
curl -X POST https://api.pdfsqueezer.io/v1/compress \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "stripMetadata=true" \
-o document_clean.pdf
Integrate into your pipeline
Hook the API into S3 triggers, webhooks, or batch jobs. Every PDF that passes through gets scrubbed automatically:
# Example: process uploads in a queue
def process_document(file_path):
with open(file_path, 'rb') as f:
r = requests.post(
'https://api.pdfsqueezer.io/v1/compress',
files={'file': f},
headers={'Authorization': f'Bearer {API_KEY}'},
params={'stripMetadata': True}
)
if r.ok:
save_to_compliant_storage(r.content)