Academic Project · Big Data / Streaming

Real-Time Anomaly Detection

A streaming pipeline that processes user clicks on an e-commerce site and flags abnormal behavior (bots, scrapers, credential stuffing) as it happens. If a user generates more than 50 clicks in a single minute, the system raises an alert in real time.

Apache KafkaPyFlinkDockerPythonpytestCount-Min Sketch

The Pipeline

1. Ingest

Kafka

Every click event (user_id, event, product_id, timestamp) goes into a Kafka topic. A Python producer simulates normal traffic. A separate anomaly simulator injects 10 000 rapid clicks for testing.

2. Process

Flink

PyFlink consumes the stream, assigns watermarks based on event time, and groups clicks into 1-minute tumbling windows per user. Uses Count-Min Sketch for memory-efficient counting.

3. Alert

Detection

When a window closes and a user has more than 50 clicks, Flink emits an anomaly alert with the user_id, window range, and click count.

Why Kafka + Flink

Kafka as the buffer

Kafka absorbs traffic spikes and stores events on disk. If Flink goes down, events pile up safely in Kafka and get processed when it comes back. It decouples ingestion from processing completely.

Flink for stateful streaming

Flink processes events one by one with sub-millisecond latency. It handles event time natively and can close time windows correctly even when data arrives late or out of order. That matters a lot for detecting real anomalies vs false positives.

What happens when things break

If Flink crashes

Events pile up in Kafka (disk retention). When Flink restarts, it resumes from its last checkpoint and reprocesses the backlog with the same accuracy as live.

If Kafka goes down

Producers can't send events. They'd need a local cache with retry logic, or accept temporary data loss. This is the weaker link in the chain.

Project Report

Full report covering the architecture, event time vs processing time, Count-Min Sketch implementation, fault tolerance, and execution screenshots.

Open in New Tab Download PDF