爬取补天公益厂商

  1. 获取cid值
  2. 获取目标url
  3. 源码

之前写过补天爬虫的脚本,感觉写的比较烂,重新写个吧。

  • 爬虫环境:Python2.7 Sublime text
  • 依赖库: requests bs4
  • 思路:
    1、获取厂商的cid值
    2、通过id值获取相关信息

讲讲为什么要这么做,如下图你可以看到,当我们点击提交漏洞的时候,会重定向到一个厂商专属的资料网站(每个厂商的区别在于这个cid值),然后才能在这个网站中获取我们想要的信息。
bug
bug

思路就是这样了,开始写吧!

获取cid值

我们在项目大厅中点击公益SRC,分析下这个发送的请求:
bug
用POST提交了参数,测试一下,返回了这个玩意:
bug

1
{"status":1,"info":"\u68d2\u68d2\u54d2","data":{"count":171,"current":1,"list":[{"company_id":"61539","company_name":"\u5317\u4eac\u5dc5\u5cf0\u6e05\u5f71\u5546\u8d38\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61535","company_name":"\u5409\u6797\u7701\u59d4\u7ec4\u7ec7\u90e8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61517","company_name":"\u6613\u65b9\u79d1\u8d38\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61514","company_name":"\u5409\u6797\u7701\u6167\u6d77\u79d1\u6280\u4fe1\u606f\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61509","company_name":"\u897f\u5b89\u6559\u80b2\u7535\u89c6\u53f0","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61506","company_name":"\u4e1c\u839e\u5e02\u77f3\u9f99\u6cf0\u5766\u7f51\u7edc\u7ecf\u8425\u90e8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61497","company_name":"\u5c71\u897f\u7701\u53d1\u5c55\u548c\u6539\u9769\u59d4\u5458\u4f1a","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61494","company_name":"\u9655\u897f\u7701\u7a0e\u52a1\u5c40","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61493","company_name":"\u8f66\u597d\u591a\u65e7\u673a\u52a8\u8f66\u7ecf\u7eaa\uff08\u5317\u4eac\uff09\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p0.qhimg.com\/t019b6f6338cec04620.png"},{"company_id":"61491","company_name":"\u5317\u4eac\u4eca\u59cb\u79d1\u6280\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61488","company_name":"\u6d59\u6c5f\u5357\u6e56\u91d1\u878d\u8d44\u4ea7\u4ea4\u6613\u4e2d\u5fc3\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61486","company_name":"\u5e7f\u5dde\u5e7f\u4e4b\u65c5\u56fd\u9645\u65c5\u884c\u793e\u80a1\u4efd\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61485","company_name":"\u73e0\u6d77\u5e02\u5353\u8f69\u79d1\u6280\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61484","company_name":"\u90d1\u5dde\u5e02\u57ce\u4e61\u5efa\u8bbe\u59d4\u5458\u4f1a","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61476","company_name":"\u5e7f\u4e1c\u9f99\u90a6\u7269\u6d41\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61471","company_name":"\u5317\u4eac\u534f\u548c\u533b\u9662","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61469","company_name":"\u8bfa\u8fbe\u6559\u80b2","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61468","company_name":"\u6cb3\u5357\u7701\u4fe1\u606f\u4e2d\u5fc3","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61466","company_name":"\u897f\u5b89\u91d1\u878d\u7535\u5b50\u7ed3\u7b97\u4e2d\u5fc3","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61465","company_name":"\u676d\u5dde\u8d1d\u8d2d\u79d1\u6280\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61463","company_name":"\u5408\u80a5\u5f7c\u5cb8\u4e92\u8054\u4fe1\u606f\u6280\u672f\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p0.qhimg.com\/t019d543937529f2155.jpg"},{"company_id":"61458","company_name":"\u82cf\u5dde\u5e02\u6c11\u5361\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61455","company_name":"\u4f9d\u6ce2\u7cbe\u54c1\uff08\u6df1\u5733\uff09\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61450","company_name":"\u6b66\u6c49\u6d77\u8baf\u79d1\u6280\u4f1a\u52a1\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61449","company_name":"\u5c71\u897f\u7701\u516c\u5171\u8d44\u6e90\u4ea4\u6613\u4e2d\u5fc3\uff08\u5c71\u897f\u7701\u653f\u52a1\u670d\u52a1\u4e2d\u5fc3\uff09","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61446","company_name":"\u6b66\u6c49\u5c14\u6e7e\u6587\u5316\u4f20\u64ad\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61445","company_name":"\u6210\u90fd\u5e02\u7b2c\u4e00\u4eba\u6c11\u533b\u9662","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61442","company_name":"\u6e29\u5dde\u5e02\u6c34\u5229\u5c40","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"},{"company_id":"61441","company_name":"\u5b9c\u5bb6\u7535\u5b50\u5546\u52a1\uff08\u4e2d\u56fd\uff09\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p0.qhimg.com\/t0114c4c991894ba083.png"},{"company_id":"61436","company_name":"\u5317\u4eac\u91cd\u8f7d\u667a\u5b50\u79d1\u6280\u6709\u9650\u516c\u53f8","avatar":"http:\/\/p1.qhmsg.com\/dm\/150_150_100\/t011655040b3ed000bf.jpg"}]}}

看见了吧,companyid值就是我们要获取的。
发送请求:

1
2
3
4
5
6
7
8
9
10
11
import requests

url = "https://butian.360.cn/Reward/pub"
data = {
"s":"1",
"p":"1", #这里的p代表page
"token":""
}

r = requests.post(url,data=data)
print(r.content)

返回的这段字符串,我们要如何去获取companyid值,可以用正则匹配,还有一中我们将str转换成dict类型进行操作,这里要用eval()函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
print(type(r.content))
#<type 'str'>

data = eval(r.content)
print(type(data))
#<type 'dict'>

print(data.values())
#获取字典中建的值

print(data.values()[2])
#这个字典中的第三个值又是一个新的字典

list = data.values()[2]['list']
print(list)
#获取新的字典中list的值,这个值又是一个列表

for i in range(len(list)):
print(list[i]['company_id'])
#最终这个可以打印出company_id的值

获取目标url

获取了cid之后,通过拼凑成完整的url,访问这个url,通过bs4来获取相关的信息,不过这里要注意,要查看这个信息必须先登录,登录可以用Post来请求发送cookie就可以了

1
跳转url:https://butian.360.cn/Loo/submit?cid= + 获取的cid值

获取网站的url和名称:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import requests
from bs4 import BeautifulSoup

url = "https://butian.360.cn/Loo/submit?cid=61485"
headers = {
"Cookie":"Your_cookie"
}
r = requests.get(url,headers=headers)
# print(r.content)
soup = BeautifulSoup(r.content,"lxml")
src_url = soup.select("#tabs > form > div.tabs-con.tabs-con-loo > ul > li:nth-of-type(3) > input")[0]['value']
src_name = soup.select("#inputCompy")[0]['value']
print(src_url)
print(src_name)

1、这里要添加你的cookie
2、(line11-12)取出列表中的第一个元素并获取value的属性值

源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
##-*-coding:utf-8-*-
import requests
import threading
import time
from multiprocessing import Pool
from bs4 import BeautifulSoup


def get_company_info(url):
headers = {
"Cookie":"test_cookie_enable=null; __huid=11lbR8R9YT3n9IMpOFraKisB76KmfY6jPjg39WPUhCLbo=; __guid=126328276.1383538542413199872.1547444498000.5032; PHPSESSID=vu90rg45n3nhrqgolcer67gli5; quCapStyle=4; quCryptCode=D6TqMQxFspm0aHYsu0xeLpzVHENvQ1SJ8aOBrdPNuz8W%252BmLucfp%252B%252B6D5n89lYe3R5hPKRwYEpvw%253D; Q=u%3D360H2606142586%26n%3D%26le%3D%26m%3DZGZ1WGWOWGWOWGWOWGWOWGWOAGH5%26qid%3D2606142586%26im%3D1_t01923d359dad425928%26src%3Dpcw_webscan%26t%3D1; T=s%3Db96c76c3d3adfbe12164bf6be707c0f1%26t%3D1547453424%26lm%3D%26lf%3D2%26sk%3D5d4b0f9c769e3b39326b07d65eaae9f2%26mt%3D1547453424%26rc%3D%26v%3D2.0%26a%3D1; _currentUrl_=%2FMessage; UM_distinctid=1684b69d4254cd-0ddc2da97cba7e-5d1f3b1c-144000-1684b69d426b72; wafenterurl=L0xvby9zdWJtaXQ/Y2lkPTYxNDMy; wafcookie=e2946f643a4db7d52214a7a49c3ea3b7; __utma=138613664.879611355.1547454590.1547454590.1547454590.1; __utmc=138613664; __utmz=138613664.1547454590.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); wafverify=8be7fa7539de848e41dd18181986605f; test_cookie_enable=null; __DC_sid=138613664.1481644363583867100.1547458857863.3704; __DC_monitor_count=33; __DC_gid=138613664.153314943.1547453244760.1547459088892.91; __q__=1547459252895"
}
r = requests.get(url,headers=headers)
# print(r.content)
soup = BeautifulSoup(r.content,"lxml")
curl = soup.select("#tabs > form > div.tabs-con.tabs-con-loo > ul > li:nth-of-type(3) > input")[0]['value']
cname = soup.select("#inputCompy")[0]['value']
src_url.append(curl)
src_name.append(cname)
print("[+]" + curl + ":" + cname)


def get_company_id(url,page):
data = {
"s":"1",
"p":str(page),
"token":""
}

r = requests.post(url,data=data)
list = eval(r.content).values()[2]['list']
for i in range(len(list)):
id_url.append("https://butian.360.cn/Loo/submit?cid=" + list[i]['company_id'])
return id_url

if __name__ == '__main__':
id_url = []
src_url = []
src_name = []
thread_List =[]
page = 1
url = "https://butian.360.cn/Reward/pub"
for p in range(1,page+1):
get_company_id(url,p)
print("finish get id_url")
for url in id_url:
get_company_info(url)

#多线程
# for i in range(len(id_url)):
# url = id_url[i]
# t = threading.Thread(target=get_company_info,args=(url,))
# thread_List.append(t)
# t.start()
# for t in thread_List:
# t.join()
# get_company_info("https://butian.360.cn/Loo/submit?cid=61485")
#多进程
# p = Pool(2)
# for url in id_url:
# p.apply_async(get_company_info, (url,))
# p.close()
# p.join()
with open("src%s.txt" %time.time(),"w+") as f:
for url,name in zip(src_url,src_name):
f.write(str(url) + "\t" + str(name.encode("utf-8")) + "\n")
f.close()
# print(src_url)
# print(src_name)

这里先贴一个,多进程不晓得为什么老是搞不出来,速度不快,先这样吧。


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 sher10cksec@foxmail.com

文章标题:爬取补天公益厂商

本文作者:sher10ck

发布时间:2019-01-14, 16:36:02

最后更新:2020-01-13, 12:59:32

原始链接:http://sherlocz.github.io/2019/01/14/butian-spider/

版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。

目录