TL, DR
When you crawl the web to collect data you should set a User Agent that identifies you. Or one that hides the tool you are using. Here you can find how to set the User Agent in Python Requests, Scrapy, and Selenium.
What is the User Agent?
A User Agent is a string of text that is sent by a web browser – or by another tool making web requests, such as Scrapy – to a server as part of a web request. It identifies the browser and its version, operating system, and device information to the server.
The User Agent allows the server to determine which features and technologies the client browser supports and adjust the content accordingly.
The User Agent may also be used from the server to decide whether to decline the request, especially if the website owner uses automatic tools such as Cloudflare Bot Management to prevent scraping.
Therefore, it is quite important to set your User Agent properly if you want to access the web in an automated way. We go over three of the tools that you could use to automatically crawl the web with Python, and explain how to set the User Agent in each of them.
Setting User Agent in Requests
Requests is a basic but powerful workhorse for getting data from the web with Python. You can set the User Agent for each single call:
import requests
url = 'https://randomds.com/'
headers = {
'User-Agent': 'Anything you want 1.0',
'From': 'youremail@gmail.com'
}
response = requests.get(url, headers=headers)
The field From
is useful if you want to leave a contact for the managers of the websites you crawl. You can also set the User Agent for multiple calls using a session:
s = requests.Session()
s.headers.update(headers)
response = s.get(url)
Setting User Agent with Scrapy
Scrapy is a comprehensive framework to extract data from the web. If you want to set your User Agent, you need to locate your settings.py
file in the Scrapy project and uncomment the USER_AGENT
value:
USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
Setting the User Agent in Selenium
Selenium is the last resort of web scrapers and the workbench of people testing websites. If you want to set your User Agent using the Chrome bindings you can do as follow:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=Your Favorite agent")
driver = webdriver.Chrome(chrome_options=opts)
User Agent databases
If you lack fantasy, you can find plenty of User Agents to get a new identity. A simple Google search will provide you plenty, here I report two resources for your convenience: WhatIsMyBrowser and User Agents.
That’s it, now you can choose your favorite identity when you crawl the web! I will publish further tutorials on webscraping, you can find them at this link (also reported below).
Related links
- More posts about scraping link
- Cloudflare Bot Management link
- Python Requests link
- Python Scrapy link
- Python Selenium link
- WhatIsMyBrowser User Agent Database link
- User Agents database link
Do you like our content? Check more of our posts in our blog!