域名的向量化

将域名(字符串)转换为向量形式

域名向量化记录

要完成域名到向量的转换,就需要解决两个问题:

  • 域名字符到数字的一一映射问题
  • 域名不等长问题

下面的步骤将依次解决这两个问题,完成域名到向量的转换

初始数据
我们先来看一看域名的初始数据模样

>>> len(domains)
8
>>> domains
['jobhero-com-1.disqus.com', 'vstu.by', 'diamondpeak.com', 'netcollections.net', 'pfgc.com.s7a1.psmtp.com', 'dudeiwantthat.com', 'sascha-frank.com', 'milavia.net']

>>> # 按行分开展示就是下面这样
>>> for i in range(len(domains)):
...     print(domains[i])
...
jobhero-com-1.disqus.com
vstu.by
diamondpeak.com
netcollections.net
pfgc.com.s7a1.psmtp.com
dudeiwantthat.com
sascha-frank.com
milavia.net

构建映射关系
第一步是构建字符到数字的映射关系,所以要统计所有域名中一共出现了哪些字符,然后再给这些字符每个对应一个数字就行了。

  • 统计出现了哪些字符相当于对所有字符进行去重,因此把所有域名组成的字符串丢进一个集合中就可以了
>>> uni = set(''.join(domains))
>>> uni
{'a', 'r', 'o', 'b', '-', 'j', 'u', 'g', '7', 'c', 'y', '1', 'd', 'n', 'w', 'v', 'i', 't', 'm', 'p', 'h', 'k', 's', 'f', 'l', 'e', 'q', '.'}
>>> len(uni)
28
  • 上面得到了所有用到的字符,一共28个,下面就需要给每个字符对应一个数字,这里可以使用字符的索引
>>> char_dict = {x:i+1 for i, x in enumerate(uni)}
>>> char_dict
{'a': 1, 'r': 2, 'o': 3, 'b': 4, '-': 5, 'j': 6, 'u': 7, 'g': 8, '7': 9, 'c': 10, 'y': 11, '1': 12, 'd': 13, 'n': 14, 'w': 15, 'v': 16, 'i': 17, 't': 18, 'm': 19, 'p': 20, 'h': 21, 'k': 22, 's': 23, 'f': 24, 'l': 25, 'e': 26, 'q': 27, '.': 28}>>>
  • 这样,28 个字符都有了自己的序号,这样就构建了字符到数字的映射关系。利用这个映射关系,域名中的每个字符就可以被替换为数字了。
>>> X = [[char_dict[char] for char in domain] for domain in domains]
>>> X
[[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19], [16, 23, 18, 7, 28, 4, 11], [13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19], [14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18], [20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19], [13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19], [23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19], [19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18]]
>>> len(domains)
8
>>> # 这样 8 个域名就被转换成了 8 行数据。但上面这样不太好观察,我们将它按行打印出来
>>> for row in X:
...     print(row)
...
[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19]
[16, 23, 18, 7, 28, 4, 11]
[13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19]
[14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18]
[20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19]
[13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19]
[23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19]
[19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18]
  • 这样便完成了域名到数字的转换,不难发现,每行数据是不规则的,所以需要对每行数据进行补齐
  • 先得到最长的一行有多少个元素
>>> maxlen = max([len(row) for row in X])
>>> maxlen
24
  • 对不够长的列表在末尾用零补齐
>>> for row in X:
...     for i in range(maxlen-len(row)):
...         row.append(0)
...
>>> X
0, 0, 0, 0, 0, 0, 0, 0], [13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0], [14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0], [20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19, 0], [13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0], [23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0], [19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
>>> # 按行打印便于观察
>>> for row in X:
...     print(row)
...
[6, 3, 4, 21, 26, 2, 3, 5, 10, 3, 19, 5, 12, 28, 13, 17, 23, 27, 7, 23, 28, 10, 3, 19]
[16, 23, 18, 7, 28, 4, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[13, 17, 1, 19, 3, 14, 13, 20, 26, 1, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[14, 26, 18, 10, 3, 25, 25, 26, 10, 18, 17, 3, 14, 23, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0]
[20, 24, 8, 10, 28, 10, 3, 19, 28, 23, 9, 1, 12, 28, 20, 23, 19, 18, 20, 28, 10, 3, 19, 0]
[13, 7, 13, 26, 17, 15, 1, 14, 18, 18, 21, 1, 18, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0]
[23, 1, 23, 10, 21, 1, 5, 24, 2, 1, 14, 22, 28, 10, 3, 19, 0, 0, 0, 0, 0, 0, 0, 0]
[19, 17, 25, 1, 16, 17, 1, 28, 14, 26, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>>
  • 这样二维列表的每一行都是等长的了,为方便运算,需要将数据从 list 转换为 ndarray 的形式
>>> train = np.array(X)
>>> train.shape
(8, 24)
>>> for row in train:
...     print(row)
...
[ 6  3  4 21 26  2  3  5 10  3 19  5 12 28 13 17 23 27  7 23 28 10  3 19]
[16 23 18  7 28  4 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
[13 17  1 19  3 14 13 20 26  1 22 28 10  3 19  0  0  0  0  0  0  0  0  0]
[14 26 18 10  3 25 25 26 10 18 17  3 14 23 28 14 26 18  0  0  0  0  0  0]
[20 24  8 10 28 10  3 19 28 23  9  1 12 28 20 23 19 18 20 28 10  3 19  0]
[13  7 13 26 17 15  1 14 18 18 21  1 18 28 10  3 19  0  0  0  0  0  0  0]
[23  1 23 10 21  1  5 24  2  1 14 22 28 10  3 19  0  0  0  0  0  0  0  0]
[19 17 25  1 16 17  1 28 14 26 18  0  0  0  0  0  0  0  0  0  0  0  0  0]
>>> # 这样数据转换成了 ndarray 的形式,上面就是矩阵的模样
  • 补齐和转换成 ndarray 可以直接利用 keras.preprocessing.sequence.pad_sequences() 方法,处理大量数据时的效率会高很多
  • 下面是完整代码,其中补齐部分调用了 Keras 模块里的方法
import pickle
from keras.preprocessing import sequence

def domain2vector(domain_path, save_path):
    with open(domain_path, 'rb') as f:  # 加载 pkl 数据
        domains = pickle.load(f)
    char_dict = {x:i+1 for i, x in enumerate(''.join(domains))}  # 构建字符到序号的映射
    X = [[char_dict[char] for char in domain] for domain in domains]  # 域名转换为数字
    maxlen = max([len(row) for row in X])
    vector = sequence.pad_sequences(X, maxlen=maxlen)  # 补齐至最大长度并转换为 ndarray
    with open(save_path, 'wb') as f:  # 保存向量化结果
        pickle.dump(vector, f, pickle.HIGHEST_PROTOCOL)

    return vector

发表评论

电子邮件地址不会被公开。 必填项已用*标注