+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

无法将https代理与重新使用基于asyncio构建的脚本中的同一会话一起使用

发布于2020-07-13 10:52     阅读(2585)     评论(0)     点赞(21)     收藏(5)


我正在尝试在利用httpsasyncio库的异步请求中使用代理。关于使用http代理,这里有一个明确的说明但是在使用https代理的情况下我会陷入困境而且,我想重用同一会话,而不是每次发送请求时都创建一个新会话。

到目前为止,我已经尝试过(proxies used within the script are directly taken from a free proxy site, so consider them as placeholders):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    proxy = f'http://{proxy_url}'
    print("trying using:",proxy)
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()
        except Exception:
            proxy_url = proxies.pop()
            return await get_text(url)

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = None
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

如何https在脚本中使用代理以及重复使用代理session


解决方案


该脚本创建dictionary proxy_session_map,其中键是代理,值是会话。这样,我们知道哪个代理属于哪个会话。

如果使用代理时出现错误,请添加此代理以进行disabled_proxies设置,因此不再使用该代理:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

from random import choice

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

disabled_proxies = set()

proxy_session_map = {}

async def get_text(url):
    while True:
        try:
            available_proxies = [p for p in proxies if p not in disabled_proxies]

            if available_proxies:
                proxy = choice(available_proxies)
            else:
                proxy = None

            if proxy not in proxy_session_map:
                proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))

            print("trying using:",proxy)

            async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()

        except Exception as e:
            if proxy:
                print("error, disabling:",proxy)
                disabled_proxies.add(proxy)
            else:
                # we haven't used proxy, so return empty string
                return ''


async def field_info(field_link):
    text = await get_text(field_link)
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

async def main():
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    tasks = [field_info(url) for url in links]

    await asyncio.gather(
        *tasks
    )

    # close all sessions:
    for s in proxy_session_map.values():
        await s.close()

if __name__ == '__main__':
    asyncio.run(main())

打印(例如):

trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrap instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?

... and so on.


所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接: https://www.pythonheidong.com/blog/article/443204/

来源: python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

21 0
收藏该文
已收藏

评论内容:(最多支持255个字符)