编写脚本对大量url进行处理

  1. 前期准备
  2. 对域名进行去重
  3. 检查域名是否能够正常访问
  4. 获取域名的title
  5. 总了个结

最近在准备挖某个SRC的时候进行信息收集,在跑子域名的时候用Layer爆破出了200个,用subdomainbrute跑出了5000个,于是乎就有想法来进行一些操作了:
1、对域名进行去重
2、检查域名是否能够正常访问
3、获取域名的title

前期准备

将Layer爆破出来的域名放在layer.txt中,subdomainbrute跑出来的域名放在brute.txt中:

1
2
3
4
5
6
7
8
9
10
11
#-*-coding:utf-8-*-
with open("layer.txt","r+") as f:
print("layer.txt 一共有{}个子域名".format(len(f.readlines())))

with open("brute.txt","r+") as f:
print("brute.txt 一共有{}个子域名".format(len(f.readlines())))

result:
layer.txt 一共有187个子域名
brute.txt 一共有5163个子域名
[Finished in 0.9s]

又发现一个问题,从subdomainbrute当中跑出来的格式为”域名+ip”,但是IP不是我们想要的呀。
brute_bug
方法一:
修改subdomainbrute源码,在subdomainbrute.py line 190:

1
2
源代码:self.outfile.write(cur_sub_domain.ljust(30) + '\t' + ips + '\n')
修改成:self.outfile.write(cur_sub_domain.ljust(30) + '\n')

修改之后重新跑一次,成功
brute_bug

方法二:
使用python脚本进行切割:

1
2
3
4
5
6
7
8
#-*-coding:utf-8-*-
with open("eastmoney.com.txt","r+") as f:
with open("brute.txt","w+") as ff:
for url_ip in f.readlines():
url = url_ip.split("\t")[0]
ff.write(url.strip() + "\n")
ff.close()
f.close()

成功获取url,都准备好啦
brute_bug

对域名进行去重

从两个文件中读取url,添加到url_list中,再将这些域名筛选添加到new_url_list中,最后写到文件url.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#-*-coding:utf-8-*-

def write_in_file(file_name,url_list): #去重并写入到文件中
new_url_list = []
for url in url_list:
if url not in new_url_list:
new_url_list.append(url)
print(len(new_url_list))
with open(file_name,"w+") as f:
for url in new_url_list:
f.write(url.strip() + "\n")
f.close()

def get_urls(file_name,url_list): #从文件中读取url
with open(file_name,"r+") as f:
for url in f.readlines():
url_list.append(url.strip())
return url_list


if __name__ == '__main__':
url_list = []
get_urls("layer.txt",url_list)
# print(len(url_list))
get_urls("brute.txt",url_list)
# print(len(url_list))
write_in_file("urls.txt",url_list)

result:
5371
[Finished in 0.6s]

最终得到去重后域名一共5371个

检查域名是否能够正常访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- conding:utf-8 -*- 

import queue
import requests
import threading

def write_in_file(file_name,list): #写入文件
with open(file_name,"w+") as f:
for url in list:
f.write(url.strip() + "\n")

class Mythread(threading.Thread):
def __init__(self,q):
threading.Thread.__init__(self)
self.q = q

def run(self): #重写threading的run方法
while self.q.empty() == False:
url = self.q.get()
urls = "http://" + url
try:
r = requests.get(urls,timeout=3,allow_redirects=False)
# print(r.status_code)
if str(r.status_code) == '200':
# print(r.status_code)
list.append(urls)
# print(list)
print("Find url:{}".format(urls))
except:
pass
# return list


if __name__ == '__main__':
q =queue.Queue()
list = []
with open("urls.txt","r+") as f: //从文件中读取url
for url in f.readlines():
q.put(url.strip())
print(q.qsize())
threads = []
for i in range(100):
thread1 = Mythread(q)
thread1.start()
threads.append(thread1)

for t in threads:
t.join()
# print("success")
print(list)
write_in_file("end.txt",list)

这里发现另外一个问题,有些域名打开重定向到一个网页了,所以这里requests的时候不允许重定向设置allow_redirects=False,最终脚本保存到了end.txt文件中,就这么点了。

1
2
3
4
5
6
with open("end.txt","r+") as f:
print(len(f.readlines()))

result:
94
[Finished in 0.4s]

获取域名的title

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# -*- conding:utf-8 -*- 
#获取网站title

import requests
from bs4 import BeautifulSoup
from requests.exceptions import ReadTimeout

with open("end.txt","r+") as f:
for url in f.readlines():
url = url.replace("\n","")
# print(url)
try:
r = requests.get(url,timeout=3)
soup = BeautifulSoup(r.content,"lxml")
# print(soup)
print("[+]" + url + "| title: " + soup.title.text.strip())
except:
print("[-]" + url + "| connect failed " )

总了个结

思路很简单,但在写脚本的过程中可能会发生不一样的问题,源码放github上了:https://github.com/SherLocZ/deal_with_urls


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 sher10cksec@foxmail.com

文章标题:编写脚本对大量url进行处理

本文作者:sher10ck

发布时间:2019-01-13, 15:31:40

最后更新:2020-01-13, 13:05:40

原始链接:http://sherlocz.github.io/2019/01/13/batch-urls/

版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。

目录