Developing a massive scraper where users could search more than 200 million people
Juan Pablo Balarini, CTO
Inception - Evolve
Massive scraping with scalable search engine
Ruby on Rails Python
The idea behind Wolfy was to use all the public data that’s already available on the internet to create a search engine where people could find information about companies and their employees.
One of Wolfy’s main technical challenges was to create a massive scraper that could process all the information we needed. This needed to be able to run while being able to run periodically in a cost-efficient way, in order to keep Wolfy’s information up-to-date.
Another important requirement was to find a person’s work email address, given their name and the company where they worked, while being polite to email service providers.
We also faced the challenge of finding product/market fit. Wolfy targeted the Latin American market and what we found was that in order for companies to use a product like Wolfy, most of the companies needed to change at least some part of their sales pipeline. They were using traditional tools like buying outdated databases and making phone calls, instead of using more modern approaches like cold emailing. This made the sales process difficult because we first had to show customers that they could get better results by introducing new tools/processes to their sales pipeline.
We designed and implemented a web application where companies could apply advanced filters to find their target prospects. This was powered by a massive scraper and search engine that could handle complex, real-time queries.
In order to obtain all the information we needed, we had to create a massive parallel scraping architecture. The project used Python and Scrapy for the scraper part, since it allows for quick iterations and has a great community behind it.
The backend was implemented in Ruby on Rails, which served as an API for our frontend. For the frontend we used Angular, since as a Single Page Application fitted perfectly our use case: less than 10 different screens with lots of expected user interaction on/between them. In retrospect, it was a great choice since the web application performed almost as fast as a desktop application, without any lag/delay between page changes. The code turned out to be easily maintainable since everything was divided into not-so-big components.
In order to handle the amount of data that Wolfy had to query and to improve speed, we had to design a sharded database using PostgreSQL. Each shard was responsible for handling data related to one specific country.
In order to improve scraping speed, we used a pool of thousands of IPs that were rotated between scrapers using proxies.
We launched a platform where users could search more than 200 million people and their work emails, using data publicly obtained from the internet.
Wolfy allowed users to apply simple filters to search over millions of data points to find their target audience. A massive data pipeline was designed and implemented to scrape millions of web pages periodically, to maintain Wolfy’s information up to date. Our marketing strategy was to offer users a free trial and after trying our product, offer subscription to a recurring plan using their credit card.
It was our first time raising capital, and we learned when to raise private funding vs public capital, how to make a good and effective pitch, and what’s important to an investor. Now we can transmit our experience to entrepreneurs and startups.
Another enriching experience was defining our business model, choosing between B2C or B2B. We went for the B2B model after concluding it would be the most efficient way to scale. This experience is key to helping entrepreneurs conclude which option is most suitable for their project.
Wolfy was one of the first products that we crafted at Eagerworks, where we took an idea, and we created a product that customers were paying for. We learned a lot of lessons that up until this day, we are happy and proud to apply them on a daily basis to our customers’ products.