The goal of every web scraper is to not standout, instead doing everything you can to blend into a websites normal traffic.
However, not being flagged as scraper is getting harder and harder as anti-bot technologies get ever more sophisticated and more widely used.
Today, you can be detected by:
- IP Address
- TLS or TCP/IP fingerprint
- HTTP headers (values, order and cases used)
- Browser fingerprints
- Cookies/Sessions
In this guide we're going to share with you some of the common ways detect you as a scraper, and how you optimise your scrapers so that you can blend into a websites normal traffic and not get blocked.
- Header Optimisation
- Browser Fingerprinting
- TLS Fingerprinting
- Request Profiling
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Proxy Manager
Scraper Monitoring
Header Optimization
The first step every developer should take when making their scrapers production ready so they don't get blocked is optimising the headers they use with their requests.
The headers you use is one of the easiest ways for a website to detect that you are a scraper and not a real user.
You need to make sure the headers you use are like the headers a real web browser would send, and that they are consistent with who you are trying to pretend to be.
In our header optimisation guide, we go through in detail how you should optimise your headers when scraping, however, here are the main points:
1. Use Real Web Browser Headers
By default, most HTTP libraries (Python Requests, Scrapy, NodeJs Axios, etc.) either don't attach real browser headers to your requests or include headers that identify the library that is being used. Both of which immediately tell the website you are trying to scrape that you are scraper, not a real user.
So with every request you should use a real set of headers, and vary them with each request. For example, here are example headers for using Chrome on a MacOS machine:
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
2. Pay Attention To Header Order
Most web browsers attach headers in a certain order that doesn't change, however, a lot of HTTP clients either use their own header ordering or randomise their order making the identification of web scrapers very easy.
Some HTTP clients, like the popular Python Requests library does not respect the header order you define in your request (see issue 5814). Making it easier for websites to detect requests using unnatural headers.
To combat this you should use HTTP clients that respect the header orders you define, so you can match them exactly to how a browser would send them. In the case of Python, the httpx library does respect header orders so it is a good alternative to Python requests.
3. Optimise Headers For Specific Websites
Depending on what website you are scraping, using specific header combinations may increase your scraping performance.
Sometimes websites require you to include specific headers when accessing lower level pages on the site, or certain header combinations may increase your success rates (for example, setting the referer
header as facebook.com
versus google.com
).
A big part of the reason why proxy services like ScraperAPI, Scrapingbee, or ScrapingAnt can get much higher performances out of data center proxies than you can, is because they are much better at managing headers and have systems in place to constantly split test various header combinations to maximise performance.
Takeaways
- Use real browser header and user-agent combinations with every request.
- Make sure you use a HTTP client that respects header ordering so you can make the header order seem natural.
- Optimise your headers for specific websites.
IP Addresses & Proxies
If you want to scrape a website quickly or at scale (more than a couple thousand requests per day) then next bottleneck you will run into is websites determining you are a scraper from the IP address you are using.
A real human user will rarely request more than 5 pages per second from the same website.
However, a scraper making concurrent request certainly can do this and it is pretty obvious to the website that the requests aren't coming from a real user.
This means that if you want to disquise your scraper then you will have to start using proxies.
You have numerous options when it comes to picking a proxy solution for your scrapers, from using:
- Free proxy lists
- Data center proxies
- Residential or mobile proxies
- Proxy APIs
- Building your own proxy network
All of which have their own pros and cons, from better performance to lower/higher costs, however, you will likely need a proxy solution at some point if you want to prevent your scrapers from getting blocked.
Depending on which option you go with, you will also need to build the system to manage your proxy infrastructure. Including proxy selection, rotation, blacklisting, unblocking, etc.
Picking the best proxy provider for your particular use case is quite a big topic, so if you want more info then check out our proxy guides and tools:
- Proxy Pool Optimization Guide
- How To Pick The Best Proxy Solution For Your Use Case
- Proxy Provider Comparison Tool
Takeaways
Check out of Proxy Pool Optimization Guide for more information, but here are some of the key things to remember.
- If scraping protected websites, or scraping at scale then it is very likely you will need to use proxies to disquise your scraper.
- The type of proxy you use (datacenter, ISP, residential, mobile, etc.) can have a big affect on your performance.
- You need a diversified pool of proxies (proxies from different subnets, not just different IP addresses).
- You need to put in place a system to manage your proxies, otherwise you will get poor performance and utlimately burn out the proxy pool.
Request Profiling
A common mistake we see developers make, is making requests that are very obvious to the target website that they are a web scraper and not a real user.
Sometimes two users can be scraping the same website, and use the exact same header/IP systems and one is getting consistently blocked and the other is scraping without any issues solely because of how they structure their requests.
One users requests are believable as real users, whereas the others is obviously a scraper.
Here are some of the common mistakes that can quickly give you away as a scraper:
1. Unrealistic URLs
The URL you use to make the request can often give you away as a scraper. The question you should be asking yourself is, are you sending requests to URLs that a real user would never use?
A common example of this is when scraping e-commerce sites. To keep their scrapers as simple as possible, a lot of developers will design their scrapers to request URLs that use a products ASIN number. For example: https://www.shop.com/product/[product_id]
These URLs can work, however, if you are browsing the website as a real user you will rarely see the website format the URL like this. Making it much easier for the website to detect you as a scraper.
Instead, you should make your URLs look like the URL a real user would request. For example: https://www.walmart.com/ip/Surface-Bassu-Moisture-Conditioner-2-oz-Pack-of-2/643905888
will have much higher success rate than https://www.walmart.com/ip/643905888
both of which are URLs for the same product.
2. Request Patterns
The pattern you make the requests can also be a easy give away that you are a scraper and not a real user.
For example, if I want to scrape all the products in a category from a e-commerce store, and I started with page 1 and product 1, and scraped the entire category in sequential order and at a constant rate then it is highly likely the requests are automated.
Instead, of scraping every page and product in order, you should randomly scrape different pages/products and vary the interval between your requests (from seconds to minutes).
3. Location
A big giveaway for websites that have a very specific geographic focus is making requests from a very weird location that normal users would never use.
If you are scraping a South American real estate platform, but you are making all your requests using Russion proxies then it will show up very quickly in their website analytics that there is suspisous traffic coming from Russia. So they might decide to show more CAPTCHAs to Russion traffic in future or block it completely.
Takeaways
- Use believeable URLs that normal users will request when scraping a website, not the shortest/simplest to code into your scrapers.
- Randomise how you scrape a site and the intervals between your requests to make it less obvious the requests are automated.
- Make requests from locations that the websites real users actually live, not a country on the other side of the world.
Browser Fingerprinting
Increasingly, to combat websites using antibot technologies, lots of developers are turning to using headless browsers like Puppeteer, Playwright, or Selenium to avoid getting blocked when scraping a website.
Using a headless browser does make your requests seem more like a real user than using a HTTP client, however, they aren't a magic bullet and open up a pandoras box of ways for websites to test if you are a scraper or a real user.
Modern antibot technologies use browser fingerprinting, can detect browser automation leaks and integrate honeytraps and other challenges into the page that your scrapers could fail.
Here are some of the major issues:
1. Fixing Browser Leaks
Browsers provide information about themselves in the Javascript execution context, which the client (i.e. website) can access to verify that the browser is in fact a real user and not a bot.
By default, most headless browsers leak information that tells the website that the browser is automated and not a real user. To avoid being blocked you need to patch these leaks and fortify your browser fingerprint so that your scraper isn't detected.
In our guide to fortifying your headless browser we go through in detail what some of these fingerprint leaks are, and how to patch them.
However, when you are using a headless browser for web scraping you should always use the stealth versions, as they often have the most common leaks fixed.
2. Consistent Identity
A common issue that many developers overlook is making sure the identity you are displaying in your headers, user-agent, browser, server and proxy are all consistent and match each other.
For example, if you are using a headless browser you need to make sure the user-agent string you defined matches the browser version you are using. Or if you are running your scraper on a Linux machine, but your user-agent says it is a Windows machine.
Inconsistencies like this won't happen for real users visiting a website, so if you don't make sure you use a consistent identity with every request then you are likely to get blocked.
Takeaways
When you start using headless browsers for web scraping you can get a performance boost, however, websites can still detect you. And the number of ways they can detect your is truely massive. Checkout our guide to fortifying your headless browser which goes into much more detail on the topic.
- Always use the stealth version of your automation library, be it puppeteer-stealth, playwright-stealth or selenium-stealth.
- Make sure the identity you present with the request is consistent amongst the headers, user-agent, browser, server and proxies you use.
FAQs
How do you not get blocked while scraping? ›
- IP Rotation. ...
- Set a Real User Agent. ...
- Set Other Request Headers. ...
- Set Random Intervals In Between Your Requests. ...
- Set a Referrer. ...
- Use a Headless Browser. ...
- Avoid Honeypot Traps. ...
- Detect Website Changes.
Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.
How do I hide my IP when scraping a website? ›Use IP Rotation
To avoid that, use proxy servers or a virtual private network to send your requests through a series of different IP addresses. Your real IP will be hidden. Accordingly, you will be able to scrape most of the sites without an issue.
Where proxies provide a layer of protection by masking the IP address of your web scraper, a VPN also masks the data that flows between your scraper and the target site through an encrypted tunnel. This will make the content that you are scraping invisible to ISPs and anyone else with access to your network.
How do I bypass blocked sites on Chrome? ›- Open the Chrome app.
- Select More on the right (the three vertical dots).
- Click on Settings.
- Go to Privacy and Security.
- Select Site Settings.
- Unblock your desired website.
- Bypass Chrome Warning.
- Disable Windows Firewall.
- Remove sites from the Restricted List.
- Make Changes to the Host File.
- Use a VPN Service.
- Bottom Line.
- FAQs.
IP blocks. When you scrape data with a bot, the Google site will block your IP address from any further scraping. This is because when you send multiple requests from the same IP address, the target website will recognize your activity and ban you.
Does Google search allow web scraping? ›Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.
Does a VPN hide your IP from websites? ›VPNs can hide your search history and other browsing activity, like search terms, links clicked, and websites visited, as well as masking your IP address.
Does VPN hide source IP? ›A VPN hides your IP address and encrypts your online activity for maximum privacy and security. It does this by connecting you to an encrypted, private VPN server, instead of the ones owned by your ISP. This means your activity can't be tracked, stored, or mishandled by third-parties.
Will private browsing hide my IP? ›
But Incognito mode doesn't hide your info from websites, advertisers, your Internet service provider (ISP), or Big Tech companies. Even in Incognito mode, Google and others can still track you. Incognito does not hide your IP address.
Is it ethical to web scrape a website? ›Ethics of Web Scraping
All your data scraping efforts must be ethical. Here are few approaches to ensure the Web Scraping process is completely transparent and ethical: Use a Public API when available and avoid scraping all together if the data you're looking for is available through the API.
The IP address never remains the same once you connect to a VPN server. Your external (public IP) is now that of the server and all sites you intend to visit will see the IP of the VPN server instead of your own connection.
Is proxy or VPN better for scraping? ›Cost: Proxies can be free or low-cost, while VPNs can be a bit more expensive. This makes proxies a better option for tasks like web scraping where you might want to source thousands or millions of different IPs for making automated requests.
How to bypass blocked website without VPN? ›- Method 1: Use a proxy.
- Method 2: Use the Google cache.
- Method 3: Try a URL shortener.
- Method 4: Try the IP address.
- Method 5: Unblock websites in Chrome and Safari.
- Method 6: Switch between HTTP and HTTPS.
- Method 7: Use Tor Browser.
- Method 8: Use a VPN.
The Tor browser is a free web browser that is used to keep you anonymous on the web by routing your web traffic through a series of proxy servers. Tor is often used to access websites that are blocked by the country or region you live in. You can install it on your computer directly or use it as a portable browser.
Why does Google Chrome keep blocking websites? ›If Google Chrome blocks a site automatically, it may be because Google deems that site unsafe, or because your employer or school has chosen to prevent access to that site, so you should proceed with caution.
How do I bypass blocked websites at school? ›Use a VPN: The most effective way to unblock websites at school. Use a Web Proxy: The fastest way to bypass school restrictions. Use Mobile Data: A free way to access websites that are blocked on school WiFi. Use a URL Shortener: A simple, free tool for beating URL-based website blocks.
Does Bing block scraping? ›However, you may find it difficult if you are looking to constantly scrape Bing search results to gather data for your research, analysis, or SEO campaigns, as you will get blocked immediately due to their bot detection algorithm.
Does Google block illegal content? ›We do not allow content that: is illegal, promotes illegal activity or infringes on the legal rights of others.
Is scraping reddit legal? ›
The scraping itself is legal, sure. All Wikipedia text is available under the Creative Commons Attribution-ShareAlike License (CC-BY-SA). So long as any reuse follows the terms of that license, that reuse is also legal.
How do you get around blocks in discord? ›- Use VPN. The first and most popular way to unlock Discord is to use a VPN Service. ...
- Use Proxy. Web Proxy is one of the simplest ways to access restricted websites. ...
- Change DNS. ...
- Use Web Version. ...
- Access Discord With IP Address. ...
- Use Discord On Smartphone. ...
- Discord Unblocked App.
IP Rotation will let your web scraper use a different IP every time it makes a request from a website. This way, even if the website is blocking some of the IPs your web scraping is using, your web scraper will be able to rotate to new IPs and avoid the blocks.
How do you know if you can scrape a website? ›In order to check whether the website supports web scraping, you should append “/robots. txt” to the end of the URL of the website you are targeting. In such a case, you have to check on that special site dedicated to web scraping. Always be aware of copyright and read up on fair use.
How many requests per second web scraping? ›A safe figure would be to send 1 request per second when scraping data from websites. How does this affect your overall crawling speed? Assuming you're performing 10 million requests per second each IP with 1000 data center IPs, your procedure might take approximately 3 hours.
How to bypass school block Discord? ›- Use a VPN to Access Discord. ...
- Try the Browser App. ...
- Copy the IP Address (Windows Only) ...
- Use a Web Proxy. ...
- Install Discord from External Storage. ...
- Use Discord App from Another Device. ...
- Utilize the Alpha Testing Version of Discord. ...
- Use a TOR Mirror.
No, Discord does not provide users with the IP address of others due to privacy and security reasons.
How do you get a GREY box around text in Discord? ›To create boxed text in Discord, you will need to write in code blocks. To do that you will incorporate the backtick key. This is on the same key as the tilde, not a parenthesis. If you are only creating a single lined block, you will need to write your text between two backticks.
How to do IP masking? ›- Use a proxy. Proxy or a proxy server has its own IP address and acts as an intermediary between you and the internet. ...
- Use a VPN. VPN stands for Virtual Private Network, and this is the most common way to hide your IP address. ...
- Use TOR. ...
- Use mobile network. ...
- Connect to public Wi-Fi.
An IP Sweep attack occurs when an attacker sends ICMP echo requests (pings) to multiple destination addresses. If a target host replies, the reply reveals the target's IP address to the attacker.
How to manipulate IP? ›
- Find settings.
- Select Connections, then choose Wi-Fi.
- Find your connected network and choose the settings gear to the right of it.
- Select IP settings.
- Choose Static.
- Add the new IP address in the provided box.
- Select Save to save your changes.
Using fingerprinting to detect web scraping
Application Security Manager (ASM) can identify web scraping attacks on web sites that ASM protects by using information gathered about clients through fingerprinting or persistent identification.
- Toscrape. Toscrape is a web scraping sandbox, ideal for both beginners and advanced scrapers. ...
- Scrapethissite. Another great sandbox for learning web scraping, Scrapethissite, strongly resembles Toscrape. ...
- 3. Yahoo! Finance. ...
- Wikipedia. ...
- Reddit.