Python 演算法 Day 11 - Feature Extraction

Chap.II Machine Learning 机器学习

http://img2.58codes.com/2024/20138527JSvpTArTcw.png

https://yourfreetemplates.com/free-machine-learning-diagram/

Part 2. Feature Engineering 特徵工程

2-3. Feature Extraction 特徵萃取

同特徵选择,特徵萃取一样有几种方式。以下介绍 PCA、LDA & Kernal PCA

A. Principal Component Analysis (PCA) 主成分分析

为非监督式学习(不需要有 Y),可线性分离。
将原数据转换,投影到较低维度的特徵空间,使该座标轴保留最多资讯为前提的压缩方式。
因为其非监督式学习的特性,对于预测资料来说泛用度较高,是大企业常用的工具。

*下图为例,将 x1+ x2 转换为 PC1 + PC2。
http://img2.58codes.com/2024/201385273KNzfN4NfD.png

以下使用 wine 作为範例:

# 1. Datasetsimport pandas as pddf_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'                      'machine-learning-databases/wine/wine.data',                      header=None)df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',                   'Alcalinity of ash', 'Magnesium', 'Total phenols',                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',                   'Color intensity', 'Hue',                   'OD280/OD315 of diluted wines', 'Proline']df_wine.head()

http://img2.58codes.com/2024/20138527prjZINFHi3.png

# 定义 ydf_wine['Class label'].value_counts()>>  2    71    1    59    3    48    Name: Class label, dtype: int64# 2. Clean Data# 不需要# 3. Split Data# 拆 30% 出来作测试资料from sklearn.model_selection import train_test_splitX, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)4-1. Standardlizationfrom sklearn.preprocessing import StandardScalersc = StandardScaler()X_train_std = sc.fit_transform(X_train)X_test_std = sc.transform(X_test)# 有 13 个特徵X_train_std.shape, X_test_std.shape>>  ((124, 13), (54, 13))

Sklearn 内建 PCA

(自行开发 PCA 见补充 1.)

from sklearn.decomposition import PCApca1 = PCA()X_train_pca = pca1.fit_transform(X_train_std)eigen_vals = pca1.explained_variance_ratio_eigen_vals>>  array([0.36951469, 0.18434927, 0.11815159, 0.07334252, 0.06422108,           0.05051724, 0.03954654, 0.02643918, 0.02389319, 0.01629614,           0.01380021, 0.01172226, 0.00820609])

使用权重 w = 2 项

pca2 = PCA(n_components=2)X_train_pca = pca2.fit_transform(X_train_std)X_test_pca = pca2.transform(X_test_std)X_train_pca.shape, X_test_pca.shape>>  ((124, 2), (54, 2))

作图

plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1])plt.xlabel('PC 1')plt.ylabel('PC 2')plt.show()

http://img2.58codes.com/2024/20138527b0baL8xw3Q.png

也可以指定若要达到 90% 覆盖率,要取多少项

pca5 = PCA(0.9)X_train_pca = pca5.fit_transform(X_train_std)pca5.explained_variance_ratio_

PS. 只是这样就没办法画出图片了~毕竟维度提升到了 5 维。

B. Linear Discriminant Analysis (LDA)

是一种监督式学习(需要有 Y),也可线性分离。
因为为监督式,故在求取「类别内散布矩阵(Sw)」望小,「类别外散布矩阵(Sb)」则望大。
类似 PCA,也有降幂排序。

图中蓝圈内部为 Sw,蓝圈与黄圈互为 Sb
http://img2.58codes.com/2024/20138527SFI2QdjoYb.png

我们沿用上面 wine 的前处理,直接从使用 LDA 开始。

Sklearn 内建 LDA

(自行开发 LDA 见补充 2.)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDAlda = LDA(n_components=2)X_train_lda = lda.fit_transform(X_train_std, y_train)

接着选用一个演算法来评分:

# LDA 画 train data 图from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr = lr.fit(X_train_lda, y_train)plot_decision_regions(X_train_lda, y_train, classifier=lr)plt.xlabel('LD 1')plt.ylabel('LD 2')plt.legend(loc='lower left')plt.tight_layout()plt.show()

http://img2.58codes.com/2024/20138527iodVWZi16e.png

效果显着的比 PCA 还好~(当然是因为 LDA 多考虑了 Y)

C. Kernal PCA (KPCA)

以下使用 make_circles 作为 datasets:

from sklearn.datasets import make_circlesX, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)plt.scatter(X[y == 0, 0], X[y == 0, 1], color='red', marker='^', alpha=0.5)plt.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', marker='o', alpha=0.5)plt.tight_layout()plt.show()

http://img2.58codes.com/2024/20138527ZqiKTJu7zP.png

Part I. 使用原本 PCA 来转换

from sklearn.decomposition import PCAscikit_pca = PCA(n_components=2)X_spca = scikit_pca.fit_transform(X)fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))ax[0].scatter(X_spca[y == 0, 0], X_spca[y == 0, 1],              color='red', marker='^', alpha=0.5)ax[0].scatter(X_spca[y == 1, 0], X_spca[y == 1, 1],              color='blue', marker='o', alpha=0.5)ax[1].scatter(X_spca[y == 0, 0], np.zeros((500, 1)) + 0.02,              color='red', marker='^', alpha=0.5)ax[1].scatter(X_spca[y == 1, 0], np.zeros((500, 1)) - 0.02,              color='blue', marker='o', alpha=0.5)ax[0].set_xlabel('PC1')ax[0].set_ylabel('PC2')ax[1].set_ylim([-1, 1])ax[1].set_yticks([])ax[1].set_xlabel('PC1')plt.tight_layout()plt.savefig('Pic/KernalPCA (numpy) Ans02-1.png', dpi=300)plt.show()

http://img2.58codes.com/2024/20138527QqwHK52gEv.png

Part II. Sklearn 内建 KPCA

(自行开发 KPCA 见补充 3.)

额外提下,gamma 值表示距离决策边界越近的点的权重。
gamma 越大,决策边界附近的点会大幅度影响决策边界的形状,可能导致 overfitting。
PS. 此处有 KPCA gamma 值的影片。

from sklearn.decomposition import KernelPCA# kernel: 'linear'=线性,即 PCA。'poly'=多项式。'rbf'=以半径当基準。'sigmoid'=逻辑式迴归。# n_components: 降维至多少维度。# gamma: Kernel coefficient for linear, poly, rbf or sigmoid。clf = KernelPCA(kernel='rbf', n_components=2, gamma=15)X_kpca2 = clf.fit_transform(X)fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(7,3))ax[0].scatter(X_kpca2[y == 0, 0], X_kpca2[y == 0, 1],              color='red', marker='^', alpha=0.5)ax[0].scatter(X_kpca2[y == 1, 0], X_kpca2[y == 1, 1],              color='blue', marker='o', alpha=0.5)ax[1].scatter(X_kpca2[y == 0, 0], np.zeros((500, 1)) + 0.02,              color='red', marker='^', alpha=0.5)ax[1].scatter(X_kpca2[y == 1, 0], np.zeros((500, 1)) - 0.02,              color='blue', marker='o', alpha=0.5)ax[0].set_xlabel('PC1')ax[0].set_ylabel('PC2')ax[1].set_ylim([-1, 1])ax[1].set_yticks([])ax[1].set_xlabel('PC1')plt.tight_layout()plt.show()

http://img2.58codes.com/2024/20138527ZioQxLc62O.png.

结论:

在 3 种资料萃取的方法中,PCA 最泛用,可以处理大部分的资料转换。
LDA 则是 PCA 强化版,但条件严苛。
实作中常遇到没收集到齐全的 Y,即使有,也可能没多余资源去把资料一笔一笔标记上。
最后,KPCA 则可以应对较为複杂的,非线性分离(多项式、球形、逻辑迴归)的资料类型。
.
.
.
.
.
*补充1.:
自行开发 PCA

# 取 X_train_std 的 eigenvalues & eigenvectorimport numpy as npcov_mat = np.cov(X_train_std.T)eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)# 把 eigenvalues 依照比例排序tot = sum(eigen_vals)import matplotlib.pyplot as plt# 柱状图var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]plt.bar(range(1, 14), var_exp, alpha=0.5, align='center', label='individual explained variance')# 阶梯图cum_var_exp = np.cumsum(var_exp)plt.step(range(1, 14), cum_var_exp, where='mid', label='cumulative explained variance')plt.ylabel('Explained variance ratio')plt.xlabel('Principal component index')plt.legend(loc='best')plt.tight_layout()plt.axhline(0.9, color='r', linestyle='--', linewidth=1)plt.show()

http://img2.58codes.com/2024/20138527ZiNzZl6sMn.png

# 把 eigenvalue & eigenvector 合成一个 listeigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))]# 由大至小排序 listeigen_pairs.sort(key=lambda k: k[0], reverse=True)

定义权重!!!

# 为 eigen_pairs 增加一个维度,并黏合成为一个 13x2 的矩阵w2 = np.hstack((eigen_pairs[0][1][:, np.newaxis],                eigen_pairs[1][1][:, np.newaxis]))                print('Matrix W:\n', w2)>>  Matrix W:     [[-0.13724218  0.50303478]     [ 0.24724326  0.16487119]     [-0.02545159  0.24456476]     [ 0.20694508 -0.11352904]     [-0.15436582  0.28974518]     [-0.39376952  0.05080104]     [-0.41735106 -0.02287338]     [ 0.30572896  0.09048885]     [-0.30668347  0.00835233]     [ 0.07554066  0.54977581]     [-0.32613263 -0.20716433]     [-0.36861022 -0.24902536]     [-0.29669651  0.38022942]]

转换特徵值,由 13 to 2

X_train_pca = X_train_std.dot(w2)X_train_std.shape, X_train_pca.shape>>  ((124, 13), (124, 2))

作图,看看能否以 2 项权重分辨出 3 种酒类。

colors = ['r', 'b', 'g']markers = ['s', 'x', 'o']for l, c, m in zip(np.unique(y_train), colors, markers):    # 把 X_train_pca 的第一项作为 x 轴,第二项作为 y 轴作图    x = X_train_pca[y_train == l, 0]    y = X_train_pca[y_train == l, 1]    print(x.shape, y.shape)        plt.scatter(x, y, c=c, label=l, marker=m)plt.xlabel('PC 1')plt.ylabel('PC 2')plt.legend(loc='lower left')plt.tight_layout()plt.show()

*补充 2.:
自行开发 LDA

np.set_printoptions(precision=4)mean_vecs = []for label in range(1, 4):    mean_vecs.append(np.mean(X_train_std[y_train == label], axis=0))    print('MV %s: %s\n' % (label, mean_vecs[label - 1]))    d = 13 # 原特徵数S_W = np.zeros((d, d))for label, mv in zip(range(1, 4), mean_vecs):    class_scatter = np.zeros((d, d))  # scatter matrix for each class    for row in X_train_std[y_train == label]:        row, mv = row.reshape(d, 1), mv.reshape(d, 1)  # make column vectors        class_scatter += (row - mv).dot((row - mv).T)    S_W += class_scatter                          # sum class scatter matricesprint('Within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))>>  Within-class scatter matrix: 13x13d = 13  # 原特徵数S_W = np.zeros((d, d))for label, mv in zip(range(1, 4), mean_vecs):    class_scatter = np.cov(X_train_std[y_train == label].T)    S_W += class_scatterprint('Scaled within-class scatter matrix: %sx%s' % (S_W.shape[0],                                                     S_W.shape[1]))>>  Scaled within-class scatter matrix: 13x13mean_overall = np.mean(X_train_std, axis=0)d = 13  # 原特徵数S_B = np.zeros((d, d))for i, mean_vec in enumerate(mean_vecs):    n = X_train[y_train == i + 1, :].shape[0]    mean_vec = mean_vec.reshape(d, 1)  # make column vector    mean_overall = mean_overall.reshape(d, 1)  # make column vector    S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T)print('Between-class scatter matrix: %sx%s' % (S_B.shape[0], S_B.shape[1]))>>  Between-class scatter matrix: 13x13# 为新特徵子空间选择线性判别式eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))# (eigenvalue, eigenvector) tupleseigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i])               for i in range(len(eigen_vals))]# Sort (eigenvalue, eigenvector)eigen_pairs = sorted(eigen_pairs, key=lambda k: k[0], reverse=True)# 通过减少特徵值直观地确认列表是否正确排序print('Eigenvalues in descending order:\n')for eigen_val in eigen_pairs:    print(eigen_val[0])    >>  Eigenvalues in descending order:    349.617808905994    172.76152218979385    3.478228588635107e-14    2.842170943040401e-14    2.0792193804944213e-14    2.0792193804944213e-14    1.460811844224635e-14    1.460811844224635e-14    1.4555923097122117e-14    7.813418013637288e-15    7.813418013637288e-15    6.314269790397111e-15    6.314269790397111e-15# 做出权重w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,              eigen_pairs[1][1][:, np.newaxis].real))print('Matrix W:\n', w)>> Matrix W:     [[-0.1481 -0.4092]     [ 0.0908 -0.1577]     [-0.0168 -0.3537]     [ 0.1484  0.3223]     [-0.0163 -0.0817]     [ 0.1913  0.0842]     [-0.7338  0.2823]     [-0.075  -0.0102]     [ 0.0018  0.0907]     [ 0.294  -0.2152]     [-0.0328  0.2747]     [-0.3547 -0.0124]     [-0.3915 -0.5958]]# 作图X_train_lda = X_train_std.dot(w)colors = ['r', 'b', 'g']markers = ['s', 'x', 'o']for l, c, m in zip(np.unique(y_train), colors, markers):    plt.scatter(X_train_lda[y_train == l, 0],                X_train_lda[y_train == l, 1] * (-1),                c=c, label=l, marker=m)plt.xlabel('LD 1')plt.ylabel('LD 2')plt.legend(loc='lower right')plt.tight_layout()plt.show()

*补充 3.:
自行开发 KPCA

from scipy.spatial.distance import pdist, squareformfrom scipy import expfrom scipy.linalg import eighimport numpy as npdef rbf_kernel_pca(X, gamma, n_components):    """    RBF kernel PCA implementation.    Parameters    ------------    X: {NumPy ndarray}, shape = [n_samples, n_features]            gamma: float      Tuning parameter of the RBF kernel            n_components: int      Number of principal components to return    Returns    ------------     X_pc: {NumPy ndarray}, shape = [n_samples, k_features]       Projected dataset    """    # 计算 MxN 维数据集中的 pairwise squared Euclidean distances    sq_dists = pdist(X, 'sqeuclidean')    # 将 pairwise distances 转换为 square matrix    mat_sq_dists = squareform(sq_dists)    # 计算 symmetric kernel matrix    K = exp(-gamma * mat_sq_dists)    # 将 kernel matrix 居中。    N = K.shape[0]    one_n = np.ones((N, N)) / N    K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)    # 从 centered kernel matrix 获得 obtaining eigenpairs    # scipy.linalg.eigh: 按升序返还    eigvals, eigvecs = eigh(K)    eigvals, eigvecs = eigvals[::-1], eigvecs[:, ::-1]    # 收集前 k 个 eigenvectors (projected samples)    X_pc = np.column_stack((eigvecs[:, i]                            for i in range(n_components)))    return X_pc
X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))ax[0].scatter(X_kpca[y == 0, 0], X_kpca[y == 0, 1],              color='red', marker='^', alpha=0.5)ax[0].scatter(X_kpca[y == 1, 0], X_kpca[y == 1, 1],              color='blue', marker='o', alpha=0.5)ax[1].scatter(X_kpca[y == 0, 0], np.zeros((500, 1)) + 0.02,              color='red', marker='^', alpha=0.5)ax[1].scatter(X_kpca[y == 1, 0], np.zeros((500, 1)) - 0.02,              color='blue', marker='o', alpha=0.5)ax[0].set_xlabel('PC1')ax[0].set_ylabel('PC2')ax[1].set_ylim([-1, 1])ax[1].set_yticks([])ax[1].set_xlabel('PC1')plt.tight_layout()plt.show()

.
.
.
.
.

Homework Answer:

试着用 sklearn 的资料集 breast_cancer,操作 Featuring Selection (by RandomForest)。

# Datasetsfrom sklearn.datasets import load_breast_cancerimport pandas as pdds = load_breast_cancer()df_X = pd.DataFrame(ds.data, columns=ds.feature_names)df_y = pd.DataFrame(ds.target, columns=['Cancer or Not'])df_X.head()

http://img2.58codes.com/2024/20138527WMyIc79Y8M.png

# define ydf_y['Cancer or Not'].unique()>>  array([0, 1])
# Splitfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3)X_train.shape, X_test.shape>>  ((398, 30), (171, 30))
# 随机森林演算法from sklearn.ensemble import RandomForestClassifierimport numpy as nprfc = RandomForestClassifier(n_estimators=500, random_state=1)rfc.fit(X_train, y_train)# 把每一个变数特徵的重要性列出,从大排到小ipt = rfc.feature_importances_ipt_sort = np.argsort(ipt)[::-1]for i in range(X_train.shape[1]):    print(f'{i+1:>2d}) {ds.feature_names[ipt_sort[f]]:<30s} {ipt[ipt_sort[i]]:.4f}')    >>   1) worst perimeter                0.1442     2) worst radius                   0.1199    ...    ..    .    30) concave points error           0.0035
# 只取两项 featureX_train_2 = X_train[['worst radius', 'worst perimeter']]X_test_2 = X_test[['worst radius', 'worst perimeter']]X_train_2.shape, X_test_2.shape>>  ((398, 2), (171, 2))
# Modeling (by LogisticRegression)from sklearn.linear_model import LogisticRegression as lrclf = lr(solver='liblinear')clf.fit(X_train_2, y_train)print(clf.score(X_test_2, y_test))>>  0.9064327485380117

.
.
.
.
.

Homework:

试着用 sklearn 的资料集 breast_cancer,操作 Featuring Extraction (by PCA)。


关于作者: 网站小编

码农网专注IT技术教程资源分享平台,学习资源下载网站,58码农网包含计算机技术、网站程序源码下载、编程技术论坛、互联网资源下载等产品服务,提供原创、优质、完整内容的专业码农交流分享平台。

热门文章