Reactive web crawling system [1] Introduction

Disclaimer: This is the first post of a series which will introduce an evolving, public github project of a crawling system.

The reactive web crawling system gives every person an opportunity to, after creating an account, hold a dashboard of web crawler queries and results. A user can scrap data from any website – check sport equipment price changes, collect news or compare book reviews from different bookshops.

It is the most ambitious side project for me so far. Among many reasons I continue to devote my free time to this project, because:

it allows to explore quite an interesting technology stack that I’d introduce later in this post,
on-demand web crawling could be quite a valuable tool for business users and data people,
it’s fun to create a production-ready solution that costs literally zero money,
I can focus on DDD and managing responsibilities without any pressure of time.

In this series I will cover many topics that I have stumbled upon during the development process, such as:

defining requirements for a software project,
following the common closure principle to create reactive architecture,
establishing communication between components via RabbitMQ,
web crawling using scrapy and python,
using Marten for C# as an easy way to free document DB,
applying websockets and therefore removing http communication,
authentication and authorization using external Auth0 provider with web sockets,
creating tests and documentation,
deploying everything to heroku platform without event providing personal data.

See the GitHub repository.

Leave a Reply Cancel reply