发布于2019-09-03 12:54 阅读(1436) 评论(0) 点赞(5) 收藏(4)
1、打开oreilly free主页:
http://www.oreilly.com/programming/free/
在页面上检查元素,执行以下JS代码,获得书籍下载链接列表
$.map($('body > article:nth-child(4) > div > section > div > a'), function(e){return e.href.replace(/free/, "free/files").replace(/csp.*/, "pdf")})
得到的列表如下 :
["http://www.oreilly.com/programming/free/files/open-source-in-brazil.pdf",
"http://www.oreilly.com/programming/free/files/ten-steps-to-linux-survival.pdf",
"http://www.oreilly.com/programming/free/files/open-by-design.pdf",
"http://www.oreilly.com/programming/free/files/getting-started-with-innersource.pdf",
"http://www.oreilly.com/programming/free/files/microservices-in-production.pdf",
"https://info.lightbend.com/COLL-20XX-Developing-Reactive-Microservices_Landing-Page.html?lst=OR",
"http://www.oreilly.com/programming/free/files/microservices-antipatterns-and-pitfalls.pdf",
"http://www.oreilly.com/programming/free/files/microservices-vs-service-oriented-architecture.pdf",
"http://www.oreilly.com/programming/free/files/evolving-architectures-of-fintech.pdf",
"http://www.oreilly.com/programming/free/files/software-architecture-patterns.pdf",
"http://www.oreilly.com/programming/free/files/migrating-cloud-native-application-architectures.pdf",
"http://www.oreilly.com/programming/free/files/reactive-microservices-architecture-orm.pdf"]
2、编写Python代码执行下载:
第一版代码:直接使用urllib库的urlretrieve函数进行下载,得到的列表中有可能存在非法值,在循环里进行判断并跳过。
import urllib
path = "G:\\books\\auto_dowloading\\"
def downloading(books):
for book in books:
tmp = book.split("/")
if '.pdf' not in book:
continue
print "downloading %s" %(tmp[-1])
urllib.urlretrieve(book, path+tmp[-1])
print "download %s is over!" %(tmp[-1])
print "all job done"
第二版代码:通过输入网址链接,爬取所有书籍的地址列表,将列表传入进程池调用下载函数进行下载。
import urllib
import os
import re
from multiprocessing import Pool
path = "G:\\books_new\\"
job =[]
def get_booklist(url):
page = urllib.urlopen(url)
html = page.read()
tmp = re.findall(r'http://.*?\.csp',html)
tmp2 = [i.replace('free','free/files').replace('csp','pdf') for i in tmp ]
job.extend(tmp2)
def download_book(url,path=path):
if '.pdf' not in url:
return
name = url.split("/")[-1]
print "downloading %s" %(name)
urllib.urlretrieve(url, path+name)
print "download %s is over!" %(name)
if __name__=='__main__':
get_booklist('http://www.oreilly.com/programming/free/')
pool=Pool()
pool.map(download_book,job)
print('The documents have been downloaded successfully !')
作者:239289
链接:https://www.pythonheidong.com/blog/article/84945/3c779125906fb6e2f94e/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!