【机器学习】(5.4)聚类–密度聚类(DBSCAN、MDCA)

【机器学习】(5.4)聚类–密度聚类(DBSCAN、MDCA)密度聚类 DBSCAN 密度聚类

大家好,欢迎来到IT知识分享网。

1. 密度聚类方法

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

2. DBSCAN

DBSCAN(Density-Based Spatial Clustering of  Applications with Noise)。一个比较有代表性的基于密度的聚类算法。与划分和层次聚类方法不同,它将簇定义为
密度相连的点的最大集合,能够把具有足够高密度的区域划分为簇,并可在有“噪声”的数据中发现任意形状的聚类。

2.1 DBSCAN算法的若干概念

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

2.2 DBSCAN算法流程

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

代码中运用了并查集,并查集详解 ——图文解说,简单易懂(转)_mjiansun的博客-CSDN博客 

import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.datasets as ds from sklearn.preprocessing import StandardScaler import math def expand(a, b): d = (b - a) * 0.1 return a-d, b+d def euler_distance(point1: np.ndarray, point2: list) -> float: """ 计算两点之间的欧拉距离,支持多维 """ distance = 0.0 for a, b in zip(point1, point2): distance += math.pow(a - b, 2) return math.sqrt(distance) class UnionFindSet(object): def __init__(self, data_num = 0): self.data_num = data_num self.prev = {} self.init_uf() def init_uf(self): for i in range(self.data_num): self.prev[i] = i def union_data(self, x, y): x_root = self.find_data(x) y_root = self.find_data(y) if x_root != y_root: self.prev[y_root] = x_root def find_data(self, x): x_root = self.prev[x] while self.prev[x_root] != x_root: x_root = self.prev[x_root] # 路径压缩,不写这部也没事,这里只是为了加速用 x_copy = x while x_copy != x_root: temp = self.prev[x_copy] self.prev[x_copy] = x_root x_copy = temp return x_root def get_uf_set(self): return self.prev class DBSCAN(object): def __init__(self, eps, min_samples): self.eps = eps self.min_samples = min_samples def get_cluster(self, data): eps =self.eps min_samples = self.min_samples data_num, feature_num = data.shape sim = [[] for i in range(data_num)] for i in range(data_num): for j in range(data_num): d = euler_distance(data[i, :], data[j, :]) if d < eps: sim[i].append(j) uf = UnionFindSet(data_num) for i in range(data_num): single_data = sim[i] if len(single_data) <= min_samples: continue for j in single_data: uf.union_data(i, j) cluster_cls = {} cls_num = 0 for i in range(data_num): r = uf.find_data(i) if len(sim[i]) <= min_samples: continue if r not in cluster_cls: cls_num += 1 cluster_cls[r] = cls_num data_cls = [0] * data_num for i in range(data_num): r = uf.find_data(i) if r != i or len(sim[i]) > min_samples: data_cls[i] = cluster_cls[r] return data_cls if __name__ == "__main__": N = 1000 centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]] data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.5, 0.25, 0.7, 0.5], random_state=0) data = StandardScaler().fit_transform(data) params = ((0.19, 5), (0.2, 10), (0.2, 15), (0.3, 5), (0.3, 10), (0.3, 15)) eps, min_samples = params[0] dbscan = DBSCAN(eps, min_samples) data_cls = dbscan.get_cluster(data) y_hat = np.array(data_cls) core_indices = np.zeros_like(y_hat, dtype=bool) core_indices[np.where(y_hat >= 1)] = True y_unique = np.unique(y_hat) n_clusters = y_unique.size - (1 if 0 in y_hat else 0) print(y_unique, '聚类簇的个数为:', n_clusters) # plt.subplot(2, 3, i + 1) clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size)) print(clrs) for k, clr in zip(y_unique, clrs): cur = (y_hat == k) if k == 0: plt.scatter(data[cur, 0], data[cur, 1], s=10, c='k') continue plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='k') plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=30, c=clr, marker='o', edgecolors='k') x1_min, x2_min = np.min(data, axis=0) x1_max, x2_max = np.max(data, axis=0) x1_min, x1_max = expand(x1_min, x1_max) x2_min, x2_max = expand(x2_min, x2_max) plt.xlim((x1_min, x1_max)) plt.ylim((x2_min, x2_max)) plt.plot() plt.grid(b=True, ls=':', color='#') plt.title(r'$\epsilon$ = %.1f m = %d,聚类数目:%d' % (eps, min_samples, n_clusters), fontsize=12) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

3. MDCA

MDCA(Maximum Density Clustering Application):将基于密度的思想引入到划分聚类中,使用密度而不是初始质心作为考察簇归属情况的依据,能够自动确定簇数量并发现任意形状的簇。MDCA一般不保留噪声,因此也避免了由于阈值选择不当而造成大量对象丢弃情况。

我是依据参考文献2实现的内容进行的复现,我不清楚是否正确,因为我对分类结果不是很满意,与我的预期有所差别。

3.1 核心思想

1.最大密度点:可用K近邻距离之和的倒数表示密度

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

2. 密度曲线:根据所有对象与x的欧式距离对数据集重新排序

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

3. 将密度曲线中第一个谷值之前的数据归为一类,并将其剔除

4. 重复步骤1,2,3直到所有的点都在ρ0之下或者ρ0之上

5. 两个簇Ci和Cj,用最近样本距离作为簇间距离

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

6. 根据簇间距离阈值d0,判断是否需要合并两类

3.2 代码实战

import os import numpy as np import sklearn.datasets as ds import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler def expand(a, b): d = (b - a) * 0.1 return a-d, b+d def find_cls(p, y_pred, distance, density_min, cls_num): candidate_indices = np.where(y_pred == 0)[0] cur_p = p[candidate_indices] cur_distance = distance[np.argmax(cur_p), candidate_indices] # 按照到最大密度点的距离从小到大排序 cur_distance_sort = np.argsort(cur_distance) cur_curve = cur_p[cur_distance_sort] # 密度按照离“最高密度”的距离排序 ori_indices = candidate_indices[cur_distance_sort] # 对应到最原始的索引值 # plt.plot(cur_distance[cur_distance_sort], cur_curve) # plt.show() if np.max(cur_curve) > density_min: loc = np.where(cur_curve <= density_min)[0] loc_length = len(loc) if loc_length >= 2: for i in range(loc_length - 1): if loc[i + 1] - loc[i] <= 2: break y_pred[ori_indices[:loc[i+1]]] = cls_num y_pred, cls_num = find_cls(p, y_pred, distance, density_min, cls_num+1) return y_pred, cls_num if __name__ == "__main__": N = 1000 centers = [[1, 2], [-1, -1], [1, -1], [-1, 1]] data, y = ds.make_blobs(N, n_features=2, centers=centers, cluster_std=[0.25, 0.25, 0.25, 0.25], random_state=0) # data = StandardScaler().fit_transform(data) max_data = np.max(data) min_data = np.min(data) data = (data - min_data) / (max_data- min_data) # 判断密度时检测周围点的个数 k = 3 # 最小阈值密度,设置该值主要是使用K近邻距离之和的倒数表示密度,但是这个值怎么选出来暂时不知道 density_min = 10000 # 最小阈值距离 distance_min = 0.008 # 样本数,特征数目 sample_num, feature_num = data.shape # 预测结果 y_pred = np.zeros(sample_num) # p为每个样本的密度 p = np.zeros(sample_num) # distance(i, j) 代表第i个样本到第j个样本的距离 distance = np.zeros((sample_num,sample_num)) # 利用k近邻均值定义密度较好,不会出现很多密度相同的点。如果采用半径内个数的定义方法,可能一块区域会出现很多的类别 for i in range(sample_num): distance[i, :] = np.sum((data[i, :] - data)2, axis=1) # temp_sort = np.sort(distance[i, :]) # p[i] = k / np.sum(distance[i, distance[i, :] <= temp_sort[k]]) temp_sort = np.argsort(distance[i, :]) p[i] = k / np.sum(distance[i, temp_sort[:k+1]]) cls_num = 1 # 确定初始的类簇 y_pred, cls_num = find_cls(p, y_pred, distance, density_min, cls_num) # 合并距离较近的类 while True: flag_1 = 0 flag_2 = 0 for i in range(1, cls_num): for j in range(i+1, cls_num): a = np.where(y_pred == i)[0] b = np.where(y_pred == j)[0] temp = distance[a, :][:, b] if np.min(temp) <= distance_min: y_pred[y_pred==j] = i y_pred[y_pred > j] = y_pred[y_pred > j] - 1 cls_num = cls_num - 1 flag_1 = 1 flag_2 = 0 break #跳出的第二个for if flag_1 == 1: #跳出的第一个for break flag_2 = 1 if flag_2 == 1: # 跳出的while break # 合并离聚类团较小的散点 scatter_indices = np.where(y_pred==0)[0] loc_indices = np.where(y_pred > 0)[0] scatter_cls_distance = distance[scatter_indices,:][:,loc_indices] scatter_cls_min_distance = np.min(scatter_cls_distance, axis=1) scatter_cls_min_indices = np.argmin(scatter_cls_distance, axis=1) a = y_pred[loc_indices[scatter_cls_min_indices]] b = (scatter_cls_min_distance <= distance_min) y_pred[scatter_indices] = y_pred[loc_indices[scatter_cls_min_indices]] * (scatter_cls_min_distance <= distance_min) # 画图 y_hat = np.array(y_pred) core_indices = np.zeros_like(y_hat, dtype=bool) core_indices[np.where(y_hat >= 1)] = True y_unique = np.unique(y_hat) n_clusters = y_unique.size - (1 if 0 in y_hat else 0) print(y_unique, '聚类簇的个数为:', n_clusters) # plt.subplot(2, 3, i + 1) clrs = plt.cm.Spectral(np.linspace(0, 0.8, y_unique.size)) print(clrs) for k, clr in zip(y_unique, clrs): cur = (y_hat == k) if k == 0: plt.scatter(data[cur, 0], data[cur, 1], s=10, c='k') continue plt.scatter(data[cur, 0], data[cur, 1], s=15, c=clr, edgecolors='k') plt.scatter(data[cur & core_indices][:, 0], data[cur & core_indices][:, 1], s=30, c=clr, marker='o', edgecolors='k') x1_min, x2_min = np.min(data, axis=0) x1_max, x2_max = np.max(data, axis=0) x1_min, x1_max = expand(x1_min, x1_max) x2_min, x2_max = expand(x2_min, x2_max) plt.xlim((x1_min, x1_max)) plt.ylim((x2_min, x2_max)) plt.plot() plt.grid(b=True, ls=':', color='#') # plt.title(r'$\epsilon$ = %.1f m = %d,聚类数目:%d' % (1, 1, n_clusters), fontsize=12) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show() print("end")

3.3 实验结果

【机器学习】(5.4)聚类--密度聚类(DBSCAN、MDCA)

3.4 性能比较(个人感觉不太好用,参数太难调了)

  • 优点:
    • 对噪声数据不敏感
    • 不依赖初始数据点的选择
    • 可以完成任意形状的聚类
  • 缺点:
    • 算法复杂,分类速度较慢
    • 需要在测试前确定密度阈值
    • 对于高维数据,距离的度量并不是很好
    • 不适合数据集密度差异较大或整体密度基本相同的情况

参考

1. 并查集详解 ——图文解说,简单易懂(转)_mjiansun的博客-CSDN博客

2. 密度最大值聚类(MDCA) | GitHub

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://haidsoft.com/128604.html

(0)
上一篇 2025-08-30 16:15
下一篇 2025-08-30 16:20

相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注微信