RE: Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1

You are viewing a single comment's thread from:

Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1

in utopian-io •  7 years ago 

Great post and well explained tutorial

As an addition for the web crawling tutorial some websites will kinda block you from scrapping them by detecting that you are using some crawlers because by default requests module of python send the http request with a user agent "python-requests/{version_of_module}" so to bypass this you just need to send the request with a custom user agent "requests.get(URL, headers={'user-agent': 'YOUR USER AGENT'})"

user_agent = Is the identity of your browser and the operation system that you are using

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

I haven't (deliberately!) explained how to use all the Requests attributes and methods, including setting a user agent. But please know that I am well-aware of (temporary) web crawler blocking done by some web servers detecting crawls from a certain IP. Please also know that in those type of cases only setting a self-defined user agent adds zero value: you will get blocked nonetheless.

There are several workarounds for that (e.g. block detection combined with using a multitude of VPNs and/or Onion IPs, and/or even IP spoofing). But since I'm an ethical person, in situations such as those, I ask myself "would it be OK if I used those techniques on this webserver?" And my answer to that is: "No, let's look somewhere else for the data I need, or contact the web admin of that webserver to discuss if they are willing to voluntarily provide me with the data I'm looking for."

As a rule of thumb I always try to treat people like I'd like them to treat me. That's not always possible if the other party thinks and feels differently, but in such cases I prefer to interact with people that do feel the same as I do.

Google's motto is (was?) "Don't be evil" For me personally I'm taking it one step further: "Be Me, always, which equals: Be a Good Person".

@scipio