The Appraisal Lane

How we managed to scrape 10 million cars per day and contribute to a successful exit for a multi-million dollar sum.

A real-time used-car community that connects dealers and consumers with a live team of expert appraisers to create transparency and efficiency in the trade-in process.

Website
Project cover

Challenge

Scrape each day more than 52.000 US car dealer websites with more than 10 million cars in total in a performant way, to provide real-time market insight.

Solution

We built a massive parallel scraper and data pipeline to process and store the scraped information.

Outcome

Scalable and cost-effective scraper. Data pipeline and ETL process to handle and process billions of data points. Multi-million dollar exit.
client image

"They are easy to work with; it's clear they have a great organizational culture, being professional, result-oriented, and fun to work with, all at the same time."

CHALLENGE

The Appraisal Lane wanted to build a platform that provided them with real-time analytics.

This included information about:

among others. 

Any customer could later run business analytics on a certain US region for a certain kind of car to figure out specific properties about that market. Using this information they could improve their marketing effort, detect tendencies, and most importantly improve the accuracy of their appraise (so it’s closer to the real market price).

The main challenge was to build a massive parallel scraper that could handle scraping 10 million vehicles each day in a cost-effective way. Performance was a priority from the ground up since we wanted to scrape as much as possible, with as few AWS EC2 instances as possible.
Also, we would need to store, process, and make accessible (ETL) all that daily data for later usage (billions of car data points). This data needed to be queried in real time, so this was another important challenge to tackle.

All the work that we did on Wolfy, gave us the experience and know-how to tackle this big technical challenge that we had.

 

 

SOLUTION

We built a distributed master-slave scraper architecture using Python as the main programming language.

Scrapy for building the actual scraper, Selenium to automate the browser processes, and Frontera for implementing a crawl frontier.

Every process was containerized and managed using Kubernetes, all running on AWS’s EKS. The solution was deployed using multiple AWS services like EKS, ECR, EC2, Elastic Search Service, Redis Service, RDS, S3 among others.
A master-slave architecture was implemented so that every day the master process would spin up several EC2 machines with several slave spider processes each. Each spider scraped several dealerships, always being polite to their robots.txt file. When each process was finished, the master process would stop the idle instances to avoid extra costs.
Spiders saved all the scraped data to an RDS database that we later dumped to S3 files (as a backup).

Finally, we used the S3 dumps to perform an ETL process to load the scraped data into an Elasticsearch search engine. Using the processed data from Elasticsearch, we exposed all the important information through an API that customers could easily query.

The Appraisal Lane currently has billions of vehicle data points generated by the scraping process, and they can query them to gain real-time insight into the market data.

 

 

OUTCOME

We were able to create an efficient and scalable scraper taking into account the daily cost constraints presented by the client, to scrape the entire US car market.

TAL was our first client where we had to build a team of 5 people, that augmented and integrated with their existing team and processes.

The ETL process we created integrated with some of the client's existing data and we were able to process and store more than 10 million data points per day (that’s more than 3.5 billion per year!).

Users can now create custom API calls that query billions of car data points (with sub-second response times) to get real-time information about the market and improve their sales process.

The Appraisal Lane had a successful exit and was bought by Reynolds and Reynolds for a multi-million dollar sum; which led to R&R integrating TAL’s innovative product into their existing systems.