域名向量化记录
要完成域名到向量的转换,就需要解决两个问题:
- 域名字符到数字的一一映射问题
- 域名不等长问题
下面的步骤将依次解决这两个问题,完成域名到向量的转换
初始数据
我们先来看一看域名的初始数据模样
1len(domains)
28
3domains
4['jobhero-com-1.disqus.com', 'vstu.by', 'diamondpeak.com', 'netcollections.net', 'pfgc.com.s7a1.psmtp.com', 'dudeiwantthat.com', 'sascha-frank.com', 'milavia.net']
5
6# 按行分开展示就是下面这样
7for i in range(len(domains)):
8... print(domains[i])
9...
10jobhero-com-1.disqus.com
11vstu.by
12diamondpeak.com
13netcollections.net
14pfgc.com.s7a1.psmtp.com
15dudeiwantthat.com
16sascha-frank.com
17milavia.net
构建映射关系
第一步是构建字符到数字的映射关系,所以要统计所有域名中一共出现了哪些字符,然后再给这些字符每个对应一个数字就行了。
统计出现了哪些字符相当于对所有字符进行去重,因此把所有域名组成的字符串丢进一个集合中就可以了
1uni = set(''.join(domains))
2uni
3{'a', 'r', 'o', 'b', '-', 'j', 'u', 'g', '7', 'c', 'y', '1', 'd', 'n', 'w', 'v', 'i', 't', 'm', 'p', 'h', 'k', 's', 'f', 'l', 'e', 'q', '.'}
4len(uni)
528
上面得到了所有用到的字符,一共28个,下面就需要给每个字符对应一个数字,这里可以使用字符的索引
1char_dict = {x:i+1 for i, x in enumerate(uni)}
2char_dict
3{'a': 1, 'r': 2, 'o': 3, 'b': 4, '-': 5, 'j': 6, 'u': 7, 'g': 8, '7': 9, 'c': 10, 'y': 11, '1': 12, 'd': 13, 'n': 14, 'w': 15, 'v': 16, 'i': 17, 't': 18, 'm': 19, 'p': 20, 'h': 21, 'k': 22, 's': 23, 'f': 24, 'l': 25, 'e': 26, 'q': 27, '.': 28}
这样,28 个字符都有了自己的序号,这样就构建了字符到数字的映射关系。利用这个映射关系,域名中的每个字符就可以被替换为数字了。
1X = [[char_dict[char] for char in domain] for domain in domains]
2X
3[[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19], [16, 23, 18, 7, 28, 4, 11], [13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19], [14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18], [20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19], [13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19], [23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19], [19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18]]
4len(domains)
58
6# 这样 8 个域名就被转换成了 8 行数据。但上面这样不太好观察,我们将它按行打印出来
7for row in X:
8... print(row)
9...
10[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19]
11[16, 23, 18, 7, 28, 4, 11]
12[13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19]
13[14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18]
14[20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19]
15[13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19]
16[23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19]
17[19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18]
这样便完成了域名到数字的转换,不难发现,每行数据是不规则的,所以需要对每行数据进行补齐
先得到最长的一行有多少个元素
1maxlen = max([len(row) for row in X])
2maxlen
324
对不够长的列表在末尾用零补齐
1for row in X:
2... for i in range(maxlen-len(row)):
3... row.append(0)
4...
5X
60, 0, 0, 0, 0, 0, 0, 0], [13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0], [14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0], [20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19, 0], [13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0], [23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0], [19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
7# 按行打印便于观察
8for row in X:
9... print(row)
10...
11[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19]
12[16, 23, 18, 7, 28, 4, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
13[13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0]
14[14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0]
15[20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19, 0]
16[13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0]
17[23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0]
18[19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
这样二维列表的每一行都是等长的了,为方便运算,需要将数据从 list 转换为 ndarray 的形式
1train = np.array(X)
2train.shape
3(8, 24)
4for row in train:
5... print(row)
6...
7[ 6 3 4 21 26 2 3 5 10 3 19 5 12 28 13 17 23 27 7 23 28 10 3 19]
8[16 23 18 7 28 4 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
9[13 17 1 19 3 14 13 20 26 1 22 28 10 3 19 0 0 0 0 0 0 0 0 0]
10[14 26 18 10 3 25 25 26 10 18 17 3 14 23 28 14 26 18 0 0 0 0 0 0]
11[20 24 8 10 28 10 3 19 28 23 9 1 12 28 20 23 19 18 20 28 10 3 19 0]
12[13 7 13 26 17 15 1 14 18 18 21 1 18 28 10 3 19 0 0 0 0 0 0 0]
13[23 1 23 10 21 1 5 24 2 1 14 22 28 10 3 19 0 0 0 0 0 0 0 0]
14[19 17 25 1 16 17 1 28 14 26 18 0 0 0 0 0 0 0 0 0 0 0 0 0]
15# 这样数据转换成了 ndarray 的形式,上面就是矩阵的模样
补齐和转换成 ndarray 可以直接利用 keras.preprocessing.sequence.pad_sequences() 方法,处理大量数据时的效率会高很多
下面是完整代码,其中补齐部分调用了 Keras 模块里的方法
1import pickle
2from keras.preprocessing import sequence
3
4def domain2vector(domain_path, save_path):
5 with open(domain_path, 'rb') as f: # 加载 pkl 数据
6 domains = pickle.load(f)
7 char_dict = {x:i+1 for i, x in enumerate(''.join(domains))} # 构建字符到序号的映射
8 X = [[char_dict[char] for char in domain] for domain in domains] # 域名转换为数字
9 maxlen = max([len(row) for row in X])
10 vector = sequence.pad_sequences(X, maxlen=maxlen) # 补齐至最大长度并转换为 ndarray
11 with open(save_path, 'wb') as f: # 保存向量化结果
12 pickle.dump(vector, f, pickle.HIGHEST_PROTOCOL)
13
14 return vector