Disclaimer: This is the first post of a series which will introduce an evolving, public github project of a crawling system.
The reactive web crawling system gives every person an opportunity to, after creating an account, hold a dashboard of web crawler queries and results. A user can scrap data from any website – check sport equipment price changes, collect news or compare book reviews from different bookshops.
It is the most ambitious side project for me so far. Among many reasons I continue to devote my free time to this project, because:
- it allows to explore quite an interesting technology stack that I’d introduce later in this post,
- on-demand web crawling could be quite a valuable tool for business users and data people,
- it’s fun to create a production-ready solution that costs literally zero money,
- I can focus on DDD and managing responsibilities without any pressure of time.
In this series I will cover many topics that I have stumbled upon during the development process, such as:
- defining requirements for a software project,
- following the common closure principle to create reactive architecture,
- establishing communication between components via RabbitMQ,
- web crawling using scrapy and python,
- using Marten for C# as an easy way to free document DB,
- applying websockets and therefore removing http communication,
- authentication and authorization using external Auth0 provider with web sockets,
- creating tests and documentation,
- deploying everything to heroku platform without event providing personal data.
See the GitHub repository.