Categories
Uncategorized

Reactive web crawling system [1] Introduction

Disclaimer: This is the first post of a series which will introduce an evolving, public github project of a crawling system.

The reactive web crawling system gives every person an opportunity to, after creating an account, hold a dashboard of web crawler queries and results. A user can scrap data from any website – check sport equipment price changes, collect news or compare book reviews from different bookshops.

It is the most ambitious side project for me so far. Among many reasons I continue to devote my free time to this project, because:

  • it allows to explore quite an interesting technology stack that I’d introduce later in this post,
  • on-demand web crawling could be quite a valuable tool for business users and data people,
  • it’s fun to create a production-ready solution that costs literally zero money,
  • I can focus on DDD and managing responsibilities without any pressure of time.
Runtime view

In this series I will cover many topics that I have stumbled upon during the development process, such as:

  • defining requirements for a software project,
  • following the common closure principle to create reactive architecture,
  • establishing communication between components via RabbitMQ,
  • web crawling using scrapy and python,
  • using Marten for C# as an easy way to free document DB,
  • applying websockets and therefore removing http communication,
  • authentication and authorization using external Auth0 provider with web sockets,
  • creating tests and documentation,
  • deploying everything to heroku platform without event providing personal data.

See the GitHub repository.

Leave a Reply

Your email address will not be published. Required fields are marked *