如何搭建爬虫代理ip池(代理ip实现过程)

in spider •  2 years ago 

做爬虫抓取时,我们经常会碰到网站针对IP地址封锁的反爬虫策略。但只要有大量可用的代理IP资源,问题自然迎刃而解。

以前尝试过自己抓取网络上免费代理IP来搭建代理池,可免费IP质量参差不齐,不仅资源少、速度慢,而且失效快,满足不了快速密集抓取的需求。

收费代理提供的代理资源质量明显提升,经过多家测试,最终选定使用四叶天代理作为代理提供平台。

四叶天代理IP(a-2.cn)平台每天能提供大概200万个不重复的短效高匿代理,每个代理存活期为1-30分钟,总IP数有200多万,IP数量足够使用。价套餐灵活,按照ip数量与时长计费,可以按照日结,周结与月结,还有半年及一年的套餐可供选择。只要能满足项目要求,提供优质稳定的服务,这些成本值得付出。

四叶天免费的代理ip中, 过滤后剩下的ip的访问成功率基本在98%+

第一、其实最简单的方式就是根据服务器开放的端口来判断, 如果服务器有开放80, 3389, 3306, 22之类的端口, 那么说明服务器还有别的服务在运行, 挂掉的几率很小, 如果是政府、学校的服务器, 那么更加稳定。当然也有可能开放别的端口

第二、服务器的访问速度判断, 需要访问多个不同的网址, 来取平均数, 这样的访问速度才比较稳

第三、代理ip的存活时间, 越长越稳定, 当然这个是在你搭建抓取后, 来进行计算。

第四、代理类型的重新检测, 通过访问不同的http和https网站, 判断代理到底是http还是https, 并且进行划分, http的代理, 那就访问http网址的时候使用, https的代理给https访问提供服务, 这样访问的几率才能提高。

所以根据这几点, 我重新写了一套代理ip池的项目, 目前抓取ip 4500+, 长期稳定的ip在1200+左右, 虽然少,但是相当稳定。

#代理ip池实现过程#

1.首先获取代理平台提供的代理服务器资源

o建议购买短效代理,购买后在后台获取API地址并设置IP白名单等参数

2.将获取到的代理服务器写入squid配置文件

o解析网站提供的代理服务器,按照一定规则写入/etc/squid/squid.conf

3.重新配置squid

o写入配置文件之后重新加载最新的文件,不会造成中断

4.自动更新,重复1-3

o由于网站提供的代理存活时间只有1-30分钟(由套餐决定),所以需要每隔一段时间重新获取一批新IP。

以上就是关于代理ip的搭建,作为多年的爬虫师,还是比较推荐四叶天代理ip(a-2.cn),没体验过的可以去试试,不管是连通率还是速度都是很不错的。

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

This is a one-time notice about a free service on steem.
There are communities that help support the little guy 😊, you might like ours, we join forces with lots of other small accounts to help each other grow!
Finally a good curation trail that helps its users achieve rapid growth, its fun on a bun! check it out. https://anentrypoint.github.io/school-of-minnows-landing/

A note on other bots warnings:

It's come to our attention that some of the people on this network (keys-defender run guityparties, and bots run by pfunk) have been attacking our advertorial notices by calling it a scam/fraud.

We have contacted the owners of those systems, we've shared our complete source code and processes, and explained that we've been running for longer that they have, and have been trusted by large subsets of users troughout, expressed all of our processes, which are simple, free, opensource and legitimate, and beneficial the blockchain and its users.

After doing lots of research and speaking to many other developers on this network, it's become clear that they use these false policing services to demote other projects in order to promote their own paid upvote scams and vote-abuse systems where they demote anything thats not designed to upvote their friends.
We respect their right to communicate what they want to, even if its false,however our project is highly respected, as well as open source, its already been audited by many users and its easy to confirm that there is no risk in using it.
Both our enrollment system and upvote bot is open source and whitelisted by MalwareBytes, accepted by Github, and we've serviced thousands of users since 2017, our bot is free and will only ever vote on your behalf if your idle reaches 100%.
We respect our users freedom, enrollement as well as unenrollment from our system is done directly on the blockchain and you do not need our services to join/leave.

Bot source: https://github.com/AnEntrypoint/school-of-minnows

Landing page source: https://github.com/AnEntrypoint/school-of-minnows-landing

School of minnows is FREE OPEN SOURCE software, we run the bot on our own resources and maintain it for free, if you have any questions about the platform, the quickest way to make contact is directly contacting the lead developer on discord: moonshine#6211 if you want to add a friend directly, or on the entrypoint discord: https://discord.gg/NED33mNpms
We are always active and happy to answer any questions you may have.