Tracing the origin of S3 objects with Amazon Athena

A while back, I opened the console for one of my busier S3 buckets and there, nestling among hundreds of perfectly ordinary files, sat something called quarterly_report_20260305.pdf. It looked exactly like the mystery jar of jam you find at the back of the fridge after a family gathering: you know it wasn’t there yesterday, yet nobody owns up to putting it there. In shared buckets, this sort of thing happens all the time, and the console, bless it, only tells you the last modified date. It never whispers who the actual culprit was.

That is where Amazon Athena and S3 server access logs come in. With a few straightforward steps, you can turn those silent log files into a clear answer delivered in plain SQL. The whole process feels pleasantly like detective work you can finish while your tea is still hot, and the article you are reading now should take you no more than ten minutes from start to finish.

The tech stack

Amazon S3 is the sturdy cupboard that holds nearly everything these days, yet for all its virtues, it keeps the identity of each uploader politely hidden. To make that identity visible, you must first switch on server access logging. Once enabled, every request (upload, download, delete) is written as plain text and dropped into a bucket you choose. The logs arrive free of charge to enable, though you do pay the modest storage cost in the destination bucket, a small price for sanity.

Enter Amazon Athena, a serverless query service that lets you run ordinary SQL straight against those log files without moving a single byte. The first time I used it, I felt the same quiet thrill you get when you finally locate the right key in a drawer full of oddments: everything suddenly becomes possible with almost no effort.

Step 1. Ensure logging is enabled

Go to the Properties tab of your source bucket and turn on Server access logging. Point the logs at a separate bucket (I like to call mine something obvious like logs-companyname) and give them a clean prefix such as access-logs/.

A small practical note I learned after waiting in vain one afternoon: the logs can take up to a few hours to appear, rather like a post that has gone round the long way. They do not backfill old activity, so if you need history, you must wait until new events occur. Also, check the target bucket’s policy; a missing permission once left me staring at empty folders and muttering mild British curses at the screen.

Step 2. Create the Athena table with partitioning

Now tell Athena how to read the logs. The following DDL uses partition projection, a quietly brilliant feature that means you never have to run MSCK REPAIR TABLE again. It works out the date folders automatically.

Run this in the Athena query editor (I have changed the names and paths from my real ones so nothing sensitive slips through):

CREATE DATABASE IF NOT EXISTS s3_logs_database;

CREATE EXTERNAL TABLE IF NOT EXISTS access_logs_table (
    bucketowner string,
    bucket_name string,
    requestdatetime string,
    remoteip string,
    requester string,
    requestid string,
    operation string,
    key string,
    request_uri string,
    httpstatus string,
    errorcode string,
    bytessent bigint,
    objectsize bigint,
    totaltime string,
    turnaroundtime string,
    referrer string,
    useragent string,
    versionid string,
    hostid string,
    sigv string,
    ciphersuite string,
    authtype string,
    endpoint string,
    tlsversion string,
    accesspointarn string,
    aclrequired string
)
PARTITIONED BY (timestamp string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
)
LOCATION 's3://my-company-logs/access-logs/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.timestamp.type' = 'date',
    'projection.timestamp.range' = '2024/01/01,NOW',
    'projection.timestamp.format' = 'yyyy/MM/dd',
    'projection.timestamp.interval' = '1',
    'projection.timestamp.interval.unit' = 'DAYS',
    'storage.location.template' = 's3://my-company-logs/access-logs/${timestamp}/'
);

Replace the LOCATION and storage.location.template with your own bucket and prefix. The regex looks like the sort of thing a medieval monk would doodle in the margin, but it is the official AWS pattern and works reliably.

Step 3. Uncover the upload details

With the table ready, finding who dropped that quarterly report is almost childishly simple. Here is the query I use (again with changed details):

SELECT 
    requestdatetime AS upload_time,
    requester,
    remoteip,
    operation,
    key AS file_path
FROM access_logs_table
WHERE timestamp = '2026/03/05'
  AND key LIKE '%quarterly_report_20260305%'
  AND operation LIKE '%PUT%'
LIMIT 10;

The requester column is the star of the show: it shows the IAM user ARN or role that performed the action. Requestdatetime gives the exact second in UTC. Operation tells you it was indeed an upload. I once traced a suspicious file only to discover it was my own weekend script behaving like an overenthusiastic puppy that had wandered off and come back with a stick.

If the requester says Anonymous, you know the upload came from public access, a gentle reminder to tighten the bucket policy before the next family gathering in the cloud.

Cost saving tips with partitions

Athena charges five dollars per terabyte scanned. Without the timestamp filter, you can accidentally scan months of logs while hunting for one file, an expense roughly equivalent to buying an entire chocolate cake when you only wanted one slice. With partitioning, the query reads only the folder for that single day, and the bill usually stays under the price of a decent biscuit.

I speak from experience: one forgetful broad query once produced a polite note from AWS that made my eyebrows rise. Since then, I filter on timestamp first and sleep better at night.

Wrapping up

What began as a mild annoyance with a stray file has turned into a small daily pleasure. A few lines of SQL, a properly partitioned table, and the cloud cupboard becomes pleasantly transparent. The same technique helps with compliance checks, pipeline debugging, and the quiet satisfaction of knowing exactly what is going on in your storage.

If you have a few minutes this afternoon, turn on logging for a test bucket and give the table a try. You may find, as I did, that the small effort pays back in clarity and calm far beyond the few pennies it costs. And the next time something mysterious appears in your S3 cupboard, you will know exactly who left it there and when.

Tags: AmazonAthena AWS CloudAuditing CloudOperations DevOps S3 ServerAccessLogging SRE

Tracing the origin of S3 objects with Amazon Athena

The tech stack

Step 1. Ensure logging is enabled

Step 2. Create the Athena table with partitioning

Step 3. Uncover the upload details

Cost saving tips with partitions

Wrapping up

You may also like...

Leave a Reply Cancel reply

The tech stack

Step 1. Ensure logging is enabled

Step 2. Create the Athena table with partitioning

Step 3. Uncover the upload details

Cost saving tips with partitions

Wrapping up

You may also like...

Your cloud bill is just a mirror of your engineering soul

Route 53 and Global Accelerator compared for AWS Multi-Region performance

Simplifying AWS Lambda. Understanding Reserved vs. Provisioned Concurrency

Leave a Reply Cancel reply