Academic Project · Big Data / Streaming
Real-Time Anomaly Detection
A streaming pipeline that processes user clicks on an e-commerce site and flags abnormal behavior (bots, scrapers, credential stuffing) as it happens. If a user generates more than 50 clicks in a single minute, the system raises an alert in real time.
The Pipeline
1. Ingest
Kafka
Every click event (user_id, event, product_id, timestamp) goes into a Kafka topic. A Python producer simulates normal traffic. A separate anomaly simulator injects 10 000 rapid clicks for testing.
2. Process
Flink
PyFlink consumes the stream, assigns watermarks based on event time, and groups clicks into 1-minute tumbling windows per user. Uses Count-Min Sketch for memory-efficient counting.
3. Alert
Detection
When a window closes and a user has more than 50 clicks, Flink emits an anomaly alert with the user_id, window range, and click count.
Why Kafka + Flink
Kafka as the buffer
Kafka absorbs traffic spikes and stores events on disk. If Flink goes down, events pile up safely in Kafka and get processed when it comes back. It decouples ingestion from processing completely.
Flink for stateful streaming
Flink processes events one by one with sub-millisecond latency. It handles event time natively and can close time windows correctly even when data arrives late or out of order. That matters a lot for detecting real anomalies vs false positives.
What happens when things break
If Flink crashes
Events pile up in Kafka (disk retention). When Flink restarts, it resumes from its last checkpoint and reprocesses the backlog with the same accuracy as live.
If Kafka goes down
Producers can't send events. They'd need a local cache with retry logic, or accept temporary data loss. This is the weaker link in the chain.
Project Report
Full report covering the architecture, event time vs processing time, Count-Min Sketch implementation, fault tolerance, and execution screenshots.