Outsource VS In-House: The Great Web Scraping Proxy Debate

On the subject of web scraping, there are a few key metrics that determine how scalable and robust your project is. Among those are the tool’s extraction speed, the number of concurrent requests you can make, and the size of your proxy pool.

In this article, we’ll focus on the latter and dive into the subject of proxy infrastructure management, to help you find the right kind of infrastructure that will ensure effective scraping while not wasting money.

It may sound like a rather dry subject, but it’s still an important one since it affects your bottom line.

Before we get started, here’s a friendly piece of advice: it’s better to get a resilient and efficient proxy infrastructure even if it’s not a big project. It’s better to waste a bit of money on superior measures than to cheap out and end up with a scraping project that can’t get you any valuable data.

· The must-have proxy features
· Building your own in house solution
When should you choose this option?
How to do it?
· Using a pre-built proxy management product
When should you choose this option?
How to do it?

The must-have proxy features

Web scrapers are built to solve practical problems, so let’s start with those and work our way towards solutions.

First of all, you can’t expect to scrape too many pages without attracting some attention, and once you do, bye-bye website because it’ll block you. The very first solution proxies present is the fact that they’ll take the heat for you. This way, your IP stays squeaky clean.

Second, if a proxy gets blocked, your number of options goes down. That can’t be outright solved, but with enough IPs, it’s more of a drop in the bucket instead of a water balloon to the face. The number of proxies you need has less to do with how many websites you scrape, but rather with how many pages on the same website you’re targeting, since that’s the case when blocking becomes a risk. Anyhow, you’ll need a proxy pool large enough to accommodate the project.

Third, you need to consider how you’ll manage these IPs. Throwing everything at a webpage until something sticks might work, but it’s not efficient. Websites can have many built-in countermeasures, so you’ll need to get crafty too. The solution is an easy way of rotating proxies between your requests and including random delays. This way, you can better disguise the scraping robot into a bunch of regular visitors.

Lastly, there’s the issue of proxy quality and their location. Some websites restrict their content based on geographical regions, while others can detect datacenter IPs and realize they’re proxies. The answer to both these problems is the use of residential proxies. These are the closest thing to real users, with their own location and ISP. Get a network of residential IPs from around the globe, and you’ll be able to access just about any content while staying incognito.

In short, you need a good number of quality proxies and the means to use them to their full potential. Easy as pie.

Building your own in house solution

We’ll be up-front about this: creating your own proxy network takes time and quite a bit of work. Don’t expect to compete with existing proxy vendors after a few weekends of coding. Once you do build your solution, though, you have absolute control over it.

When should you choose this option?

If you’re only flirting with the idea of web scraping or have a small project (hundreds of requests), you won’t need something too complex. So creating your own infrastructure is totally an option. An equally valid option is to get the free version of a web scraping tool, just so you know.

Besides the low volume of requests, you also have to consider the types of pages you’re after. Plenty of websites are pretty lax about web scraping, then the IPs you gather should be enough. But if you’re looking for data from big sites, like Amazon, Linkedin or Google, a home-brewed proxy pool just won’t cut it.

How to do it?

Setting up datacenter proxies is fairly straightforward. You buy some storage space on the cloud (we’re happy with Amazon Web Services), you configure your proxy server (HAProxy helps with that), you get some IPs from the web service provider, and you’re set.

Residential proxies, though, are much harder to come by. In fact, if you want more than just a few residential IPs, the best thing is to buy them. It’s a lot less effort. You’d have to find real people with real devices that are willing to offer you temporary access to their internet connection. That implies installing certain software on their computers and, as you’d imagine, there isn’t a long line for that, unless you’re paying. It’s a lot of work, and it’s not worth it for a small scraping project.

Alternatively, you can try to get them for free. There are several free proxy options out there, but you get what you pay for with these IPs. You can expect plenty of IPs to be already blocked, unresponsive, or painfully slow. There’s also the threat of someone offering their IP with malicious intent, trying to infect users with malware. Always be careful with free proxies.

Using a pre-built proxy management product

In business, time means money, and if you value the former more than the latter, building your own system may be more costly than just buying one. If you decide to get an already existing product, it will mean more costs, so what matters most is that you get the functionalities and stability you’ll need for the scraping project.

When should you choose this option?

As we said, if time is tight, or you don’t have the human resources to dedicate to the task of creating a proxy management system, it’s easier to buy one. It would cut down weeks of coding into one or two afternoons to get familiar with the software and learn how to use it.

Additionally, professionally made proxy tools come with the features and IP pool necessary for large-scale projects. So, if you’re looking to send thousands upon thousands of requests, gather data from advanced websites, or extend the project, pre-built products tend to be more robust and scalable.

A point worth remembering is that you don’t necessarily have to pay for millions of requests. Plenty of web scraping APIs come with incorporated proxies and all you have to do is choose a plan that fits your needs.

How to do it?

Luckily, proxy management products are meant to make your job easier, so the hardest part is just choosing the right solution. If you’re planning to create your own web scraper, all you need are IPs, which aren’t hard to find.

Alternatively, if you’ll also consider getting a pre-built data extraction tool, it’s best if you look for one that has its own proxies. It would mean using one piece of software instead of two, eliminating the task of integrating them and ensuring efficiency.

Here’s a list of proxy providers specifically selected for web scraping projects. Consider each option, its features, and pricing.

We hope that this article has given you some things to think about and some useful information. A topic that deserves its own article is the importance of the proxy pool and its composition, we’d recommend you read that too so you’ll get an even better picture as to how proxies fit into your web scraping endeavors.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store