一,什么是爬虫?
描述: 本质是一个自动化程序,一个模拟浏览器向某一个服务器发送请求获取响应资源的过程.
爬虫的基本流程
robots.txt协议
编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。
二,http协议
import requests'''1,GET:GET可以说是最常见的了,它本质就是发送一个请求来取得服务器上的某一资源。资源通过一组HTTP头和呈现据(如HTML文本,或者图片或者视频等)返回给客户端。GET请求中,永远不会包含呈现数据。'''res= requests.get("http://httpbin.org/get")print("get请求>>>>res:",res.text)'''2,POST请求同PUT请求类似,都是向服务器端发送数据的,但是该请求会改变数据的种类等资源,就像数据库的insert操作一样,会创建新的内容。几乎目前所有的提交操作都是用POST请求的。有请求体'''res1=requests.post("http://httpbin.org/post")print("post请求>>>>res1",res1.text)'''3、PUT请求是向服务器端发送数据的,从而改变信息,该请求就像数据库的update操作一样,用来修改数据的内容,但是不会增加数据的种类等,也就是说无论进行多少次PUT操作,其结果并没有不同。'''res2=requests.put("http://httpbin.org/put")print("put请求>>>>res2:",res2.text)'''4删除请求'''res3=requests.delete("http://httpbin.org/delete")print("delete删除请求>>>>res3:",res3.text)'''5,HEAD请求常常被忽略,但是能提供很多有用的信息,特别是在有限的速度和带宽下。主要有以下特点:1、只请求资源的首部;2、检查超链接的有效性;3、检查网页是否被修改;4、多用于自动搜索机器人获取网页的标志信息,获取rss种子信息,或者传递安全认证信息等'''res4=requests.head("http://httpbin.org/get")print(">>>>res4:",res4.text)'''6,options请求是用于请求服务器对于某些接口等资源的支持情况的,包括各种请求方法、头部的支持情况,仅作查询使用。来个栗子'''res5=requests.options("http://httpbin.org/get")print(">>>>res5:",res5.text)
D:\python3.6\python.exe F:/爬虫/request请求方式.pyget请求>>>>res: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/get"}post请求>>>>res1 { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "0", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/post"}put请求>>>>res2: { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "0", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/put"}delete删除请求>>>>res3: { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/delete"}>>>>res4: >>>>res5:
三,requests的属性和方法
requests的下载 : pip install requests
import requests########一 get基本请求请求# 案例1:爬取京东首页# url ="https://www.jd.com/"# res1=requests.get(url)# with open("jd.html","w",encoding="utf-8") as f:# f.write(res1.text)########二 含参数,请求头# 案例1:爬取百度搜索首页# url2 ="https://image.baidu.com/"# res2=requests.get(url2,params={ # "wd":"刘传盛"# },# headers={ # "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"# }# )# with open("baidu.html","wb") as f:# f.write(res2.content)#案例2 爬取抽屉网# url="https://dig.chouti.com/" #该网站做ua反扒所以需携带User-Agent参数# res=requests.get(url,# headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"# })# with open("chouti.html","wb") as f:# f.write(res.content)########三 cookie参数# import uuid #随机生成uuid字符串# import requests# url = 'http://httpbin.org/cookies'# cookies={"sbid":str(uuid.uuid4()),"a":"1"}# res=requests.get(url,cookies=cookies)# print(res.text)########三 session对象# res=requests.post("/login/")# dic={}# requests.get("/index/",cookies=dic)## session=requests.session()# session.post("/login/")# session.get("/index/")########四 post请求# res1=requests.post(url="http://httpbin.org/post?a=1",data={ # "name":"yuan"# })# print(res1.text)## res2=requests.post(url="http://httpbin.org/post?a=1",data={ # "name":"alex"# })# print(res2.text)########五 IP代理res=requests.get('http://httpbin.org/ip', proxies={ 'http':'111.177.177.87:9999'}).json()print(res)
四,resquests的爬取数据的方法
import requests######## 一 content text# response=requests.get("http://www.autohome.com.cn/beijing/")#爬取文件方式1# print(response.content) #打印的是一堆字节# print(response.encoding) #打印爬取数据所用的编码# print(response.text) #打印文本文件# response.encoding="gbk"# with open("autohome.html","wb") as f:# f.write(response.content)#爬取文件方式2# with open("autohome.html","wb") as f:# f.write(response.content)######## 二 爬取图片,音频视频# res=requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551598670426&di=cd0b0fe51a124afed16efad2269215ae&imgtype=0&src=http%3A%2F%2Fn.sinaimg.cn%2Fsinacn22%2F23%2Fw504h319%2F20180819%2Fb69e-hhxaafy7949630.jpg")# with open("鞠.jpg","wb") as f :# f.write(res.content)## res1=requests.get("http://y.syasn.com/p/p95.mp4")# with open("xiao.mp4","wb") as f:# for line in res1.iter_content():# f.write(line)######## 三 响应json数据# res=requests.get("http://httpbin.org/get")# print(res.text)# print(type(res.text)) #打印结果# import json# print(json.loads(res.text))# '''打印结果 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding':# 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'},# 'origin': '61.144.173.127, 61.144.173.127', 'url': 'https://httpbin.org/get'}'''# print(type(json.loads(res.text))) #打印结果 # print("-----------")# print(res.json())# print(type(res.json()))######## 四 重定向# res=requests.get("http://www.jd.com/")# print(res.history) #[ ]# print(res.text)# print(res.status_code) #200# res=requests.get("http://www.jd.com/" ,allow_redirects=False)#allow_redirects不能重定向# print(res.history) #[]# print(res.status_code) #302
五案例爬取githu的首页
#该网站做了反扒措施,不能直接爬取首页,需先爬取login登录页面,获取token值import requestsimport resession=requests.session()# login请求:目的获取动态token值 res1=session.get("https://github.com/login")token=re.findall('',res1.text,re.S)[0]print(token)res2=session.post("https://github.com/session",data={ "commit": "Sign in", "utf8":"✓", "authenticity_token": token, "login": "yuanchenqi0316@163.com", "password": "yuanchenqi0316"})# res=requests.get("https://github.com/settings/emails")with open("github.html","wb") as f: f.write(res2.content)print(res2.history)
六重点:请求格式
请求协议格式: 请求首行 请求头 空行 请求体 请求头: content-type 浏览器------------------>服务器 1 针对post请求(post请求才有请求体) 2 向服务器发送post请求有哪些形式: form表单 (urlencoded编码格式) user=yuan pwd=123 Ajax(urlencoded编码格式) a=1 b=2 请求协议格式字符串 发送urlencoded编码格式数据 ''' 请求首行 请求头 content-type:"urlencoded" 空行 请求体 # user=yuan&pwd=123 urlencoded编码格式 ''' 发送json数据 ''' 请求首行 请求头 content-type:"json" 空行 请求体 # {"user":"yuan","pwd":123} json编码格式