Fixing a Failing Notification System at Scale

·3 min readbackend

How migrating from polling to WebSockets improved delivery and reduced costs

This blog post shares a real-world case study of fixing a failing notification system by replacing inefficient polling with WebSockets, implementing intelligent queuing, and ensuring reliable message delivery at scale.

Blog thumbnail
Introduction

Scaling real-time systems sounds straightforward—until it isn’t. What seems like a “simple” implementation can silently fail under load, impacting users when it matters most. In our case, the notification system was dropping around 23% of messages during peak hours, especially late at night when traffic spiked unexpectedly.

The root cause? A polling-based architecture that refreshed every few seconds. While easy to implement, polling created unnecessary load, increased latency, and failed to deliver messages reliably in real-time.

Identifying the Problem

Our system relied on frequent polling to check for new notifications. Every client was making repeated API calls every 5 seconds, regardless of whether new data existed. This resulted in excessive server load, wasted resources, and delayed message delivery.

During peak usage, the system simply couldn’t keep up. Messages were delayed or dropped entirely, leading to a poor user experience and reduced engagement.

Migrating to WebSockets

The first major improvement was replacing polling with WebSockets, enabling real-time, bidirectional communication between clients and the server.

// Example using Socket.io
 
const io = require('socket.io')(server, {
    cors: {
        origin: '*',
    },
});
 
io.on('connection', (socket) => {
    console.log('User connected:', socket.id);
 
    socket.on('disconnect', () => {
        console.log('User disconnected:', socket.id);
    });
});

We integrated Socket.io with a Redis adapter to support horizontal scaling. This ensured that messages could be shared across multiple server instances efficiently.

The impact was immediate—server load dropped by 40%, and unnecessary API calls were completely eliminated.

Building Intelligent Queuing

Real-time delivery alone isn’t enough; systems must also handle spikes gracefully. To address this, we introduced a robust queuing mechanism using BullMQ for background job processing.

// Example queue setup with BullMQ
 
const { Queue } = require('bullmq');
 
const notificationQueue = new Queue('notifications', {
    connection: {
        host: '127.0.0.1',
        port: 6379,
    },
});
 
await notificationQueue.add('send-notification', {
    userId: 1,
    message: 'New alert!',
});

We implemented priority queues for urgent notifications and ensured the system could degrade gracefully under heavy load. This prevented system crashes and ensured critical messages were always delivered first.

Ensuring Delivery Guarantees

To make the system reliable, we added multiple layers of delivery assurance. This included retry mechanisms, failure handling, and visibility into message status.

// Retry logic with exponential backoff
 
await notificationQueue.add(
    'send-notification',
    { userId: 1, message: 'Retry example' },
    {
        attempts: 5,
        backoff: {
            type: 'exponential',
            delay: 1000,
        },
    }
);

Failed messages were routed to a dead letter queue for further inspection. Additionally, we implemented real-time delivery tracking so both the system and users could monitor notification status.

Results After 30 Days

The improvements were measurable and significant. After implementing these changes, the system became more efficient, reliable, and cost-effective:

Notification delivery rate improved to 99.7%, server costs decreased by 35%, and user engagement increased by 28%. Additionally, the engineering team saved around 12 hours per week previously spent troubleshooting and maintaining the old system.

Conclusion

This experience highlights a crucial lesson in system design: simple solutions like polling may work initially, but they often become costly and inefficient at scale. Investing in the right architecture early—or fixing it decisively—can dramatically improve performance, reliability, and overall user experience.

By leveraging WebSockets, intelligent queuing, and delivery guarantees, we transformed a failing notification system into a scalable and resilient real-time infrastructure.

If you're building real-time features, it's worth asking: is your system truly real-time, or just pretending to be?

Stay tuned for more real-world backend scaling stories and happy coding!

Author

Masum Billah

Full-Stack Engineer