How to Automatically Remove Sensitive Metadata from PDFs for Compliance

PDFs embed metadata—creator names, timestamps, GPS, software versions—that can leak sensitive data. For GDPR, HIPAA, and enterprise compliance, automating metadata removal is essential.

Why metadata matters for compliance

Documents shared with clients, regulators, or partners must not expose internal data. Metadata can include:

Author and creator names
Modification dates and application paths
EXIF data from embedded images (GPS, camera info)
XMP and custom document properties

Manual cleanup does not scale. Automate with an API that strips metadata on every document before storage or distribution.

Automatic metadata scrubbing with PDF Squeezer

Set stripMetadata=true (default) on the compress endpoint. The API removes metadata from the PDF and embedded images in one request:

curl -X POST https://api.pdfsqueezer.io/v1/compress \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "stripMetadata=true" \
  -o document_clean.pdf

Integrate into your pipeline

Hook the API into S3 triggers, webhooks, or batch jobs. Every PDF that passes through gets scrubbed automatically:

# Example: process uploads in a queue
def process_document(file_path):
    with open(file_path, 'rb') as f:
        r = requests.post(
            'https://api.pdfsqueezer.io/v1/compress',
            files={'file': f},
            headers={'Authorization': f'Bearer {API_KEY}'},
            params={'stripMetadata': True}
        )
    if r.ok:
        save_to_compliant_storage(r.content)

Remove EXIF metadata · Endpoints · Docs index