Chap.O 程式基础 & 简介:
Part 1. 常用于演算法的开发程式,有以下几种:
1-1. Python (免费,套件多,系统整合佳)
1-2. R (免费,套件多,系统整合差)
1-3. Matlab (贵,套件少但功能完整,系统整合佳)
Part 2. Python 能做甚么?
Program development 程式开发Website development, crawler 网站开发、爬虫Statistics, Mathematics 统计、数学Programming language 程式开发入门语言System Management Script 系统管理脚本Data Science 资料科学(着重分析资料)Data Mining Algorithms 数据挖掘算法(着重分析资料)Deep Learning: Neural Network、CNN/RNN 深度学习:神经网路(着重预测资料)Part 3. 那么,AI 又有哪些应用领域呢?
Natural Language Understanding 自然语言处理Computer Vision 电脑视觉Speech Understanding 语音辨识Robotic Application 机器人应用Intelligent Agent 智慧型代理人:聊天机器人、AlphaGo...etc.Self driving Car 自驾车医疗:MRI 影像处理、诊断、新药开发...etc.智慧製造、智慧农业、智慧理财...etc.了解上述功能之后,接着进入正题~
Chap.I 理论基础:
了解上述功能与应用后,我们会从基础数学理论开始说起。其中包括:
Part 1:Linear algebra 线性代数
Part 2:Differential & Integral 微积分
Part 3:Vector 向量
Part 4:Statistics & Probability 统计&机率
Chap.II 深度学习与模型优化:
所有预测模型,都离不开下图 10 大步骤。此章节会依序解释每个步骤的应用。
sklearn 简介-如何选择一个合适的演算法
深度学习根据情境不同,概略分为三种:
Part 1. Supervised 监督式学习:
资料经过 Lebaling 标籤化,即有正确解答。
此外,依据资料类型不同,监督式学习分为以下两种:
Classification 分类:
资料集以"有限的类别"分布,对于其做归类,即分类。如:铁达尼号、红酒分类...等。
以下会用两个範例说明:
A."鸢尾花"的分类预测:
import pandas as pdimport numpy as npfrom sklearn import datasets # 引用 Scikit-Learn 中的 套件 datasets# 1. Data Setds = datasets.load_iris() # dataset: 引用 datasets 中的函数 load_irisprint(ds.DESCR) # DESCR: description,描述载入内容X =pd.DataFrame(ds.data, columns=ds.feature_names)y = ds.target# 2. Data clean (missing value check)print(X.isna().sum())>> sepal length (cm) 0 sepal width (cm) 0 petal length (cm) 0 petal width (cm) 0 dtype: int64# 3. Feature Engineering# No need# 4. Data Split (Training data & Test data)from sklearn.model_selection import train_test_split # test_size=0.2: 测试用资料为 20%X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)print(X_train.shape, y_train.shape)>> (120, 4) (120,)# 5. Define and train the KNN modelfrom sklearn.neighbors import KNeighborsClassifier# n_neighbors=: 超参数 (hyperparameter)clf = KNeighborsClassifier(n_neighbors = 3)# 适配 (训练),迴归/分类/降维...皆用 fit()clf.fit(X_train, y_train)# algorithm.score: 使用 test 资料 input,并根据结果评分print(f'score={clf.score(X_test, y_test)}')>> score=0.9# 验证答案print(' '.join(y_test.astype(str)))print(' '.join(clf.predict(X_test).astype(str)))>> 1 2 0 0 0 2 1 1 1 0 1 2 2 2 0 2 1 1 1 0 1 1 2 2 1 1 0 2 2 2 1 2 0 0 0 2 1 1 1 0 1 1 2 2 0 2 1 1 1 0 1 1 2 2 1 2 0 2 1 2# 查看预测的机率print(clf.predict_proba(X_test.head())) # 预测每个 x_test 机率>> [[0. 1. 0.] [0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [1. 0. 0.]]
B."乳癌"的分类预测:
import pandas as pdimport numpy as npfrom sklearn import datasets# 1. Datasetds = datasets.load_breast_cancer()X =pd.DataFrame(ds.data, columns=ds.feature_names)y = ds.target# 2. Data clean# no need# 3. Feature Engineering# no need# 4. Splitfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)# 5. Define and train the KNN modelfrom sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors = 3)# 适配(训练),迴归/分类/降维...皆用 fit(x_train, y_train)clf.fit(X_train, y_train)# algorithm.score: 使用 test 资料 input,并根据结果评分print(f'score={clf.score(X_test, y_test)}')>> score=0.9210526315789473# 验证答案print(' '.join(y_test.astype(str)))print(' '.join(clf.predict(X_test).astype(str)))>> 1 1 0 0 0 ... 0 1 1 0 0 0 ... 0# 查看预测的机率print(clf.predict_proba(X_test.head()))>> [[0. 1.] [0. 1.] [1. 0.] [1. 0.] [1. 0.]]
Regression 迴归:
资料集以"连续的方式分布",对于其以线性方式描述,即迴归。如:房价预测、小费预测...等。
此图为线性迴归原理
以下会用两个範例说明:
A."世界人口"的迴归预测:
# 1. DataSetyear=[1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100]pop=[2.53, 2.57, 2.62, 2.67, 2.71, 2.76, 2.81, 2.86, 2.92, 2.97, 3.03, 3.08, 3.14, 3.2, 3.26, 3.33, 3.4, 3.47, 3.54, 3.62, 3.69, 3.77, 3.84, 3.92, 4.0, 4.07, 4.15, 4.22, 4.3, 4.37, 4.45, 4.53, 4.61, 4.69, 4.78, 4.86, 4.95, 5.05, 5.14, 5.23, 5.32, 5.41, 5.49, 5.58, 5.66, 5.74, 5.82, 5.9, 5.98, 6.05, 6.13, 6.2, 6.28, 6.36, 6.44, 6.51, 6.59, 6.67, 6.75, 6.83, 6.92, 7.0, 7.08, 7.16, 7.24, 7.32, 7.4, 7.48, 7.56, 7.64, 7.72, 7.79, 7.87, 7.94, 8.01, 8.08, 8.15, 8.22, 8.29, 8.36, 8.42, 8.49, 8.56, 8.62, 8.68, 8.74, 8.8, 8.86, 8.92, 8.98, 9.04, 9.09, 9.15, 9.2, 9.26, 9.31, 9.36, 9.41, 9.46, 9.5, 9.55, 9.6, 9.64, 9.68, 9.73, 9.77, 9.81, 9.85, 9.88, 9.92, 9.96, 9.99, 10.03, 10.06, 10.09, 10.13, 10.16, 10.19, 10.22, 10.25, 10.28, 10.31, 10.33, 10.36, 10.38, 10.41, 10.43, 10.46, 10.48, 10.5, 10.52, 10.55, 10.57, 10.59, 10.61, 10.63, 10.65, 10.66, 10.68, 10.7, 10.72, 10.73, 10.75, 10.77, 10.78, 10.79, 10.81, 10.82, 10.83, 10.84, 10.85]df = pd.DataFrame({'year' : year, 'pop' : pop})# 2. 求 1 次项均方误差 MSE (Mean-Square Error)in_year = int(input('Please input 1950~2100 to calculation:'))fit1 = np.polyfit(x, y, 1)if 2100 >= in_year >= 1950: print('The actual pop is:', y[in_year-1950]) print('Predict pop is:', f'{(np.poly1d(fit1)(in_year)):.2}') y1 = fit1[0]*np.array(x) + fit1[1] print('MSE is:', f'{((y - y1)**2).mean():.2}')else: print('Wrong year!')# 3. 作图def ppf(x, y, order): fit = np.polyfit(x, y, order) # 线性迴归,求 y=a + bx^1+ cx^2 ...的参数 p = np.poly1d(fit) # 将 polyfit 迴归解代入 t = np.linspace(1950, 2100, 2000) plt.plot(x, y, 'ro', t, p(t), 'b--')plt.figure(figsize=(18, 4))titles = ['fitting with 1', 'fitting with 3', 'fitting with 50']for i, o in enumerate([1, 3, 50]): plt.subplot(1, 3, i+1) ppf(year, pop, o) plt.title(titles[i], fontsize=20)plt.show()
B."波士顿房价"的迴归预测:
import pandas as pdimport numpy as npfrom sklearn import datasets# 1. Datasetds = datasets.load_boston()X =pd.DataFrame(ds.data, columns=ds.feature_names)y = ds.target# 2. Data cleanprint(X.isna().sum())# 3. Feature Engineering# 4. Splitfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)>> (404, 13) (404,)# 5. Define and train the LinearRegression modelfrom sklearn.linear_model import LinearRegressionclf = LinearRegression()# 适配(训练),迴归/分类/降维...皆用 fit(x_train, y_train)clf.fit(X_train, y_train)# algorithm.score: 使用 test 资料 input,并根据结果评分print(f'score={clf.score(X_test, y_test)}')>> import pandas as pdimport numpy as npfrom sklearn import datasets# 1. Datasetds = datasets.load_boston()X =pd.DataFrame(ds.data, columns=ds.feature_names)y = ds.target# 2. Data cleanprint(X.isna().sum())# 3. Feature Engineering# 4. Splitfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)>> (404, 13) (404,)# 5. Define and train the LinearRegression modelfrom sklearn.linear_model import LinearRegressionclf = LinearRegression()# 适配(训练),迴归/分类/降维...皆用 fit(x_train, y_train)clf.fit(X_train, y_train)# algorithm.score: 使用 test 资料 input,并根据结果评分print(f'score={clf.score(X_test, y_test)}')>> score=0.6008214413101689# 验证答案print(list(y_test))b = [float(f'{i:.2}') for i in clf.predict(X_test)]print(b)>> [30.3, 8.4, 17.4, 10.2, 12.8, ... 22.5] [32.0, 4.6, 22.0, 6.2, 13.0, ... 29.0]
Part 2. Unsupervised 非监督式学习:
部分或者全部资料 Unlebaling 无标籤化,即没有正确解答。
2-1. Clustering 集群
将特徵相近的点归类,概念有些类似 Regression,称为集群。如下图:
以下为 CLV (Regression) 範例:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltds = pd.read_csv('CLV.csv')print(ds.describe().T)
A. 手动分群
分 1~10群,计算误差平方和 (elbow method) 最少者为优。
# 没有 yX=ds.iloc[:,[0,1]].valuesfrom sklearn.cluster import KMeanswcss = []for i in range(1,11): km=KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) km.fit(X) wcss.append(km.inertia_)plt.plot(range(1,11),wcss)plt.title('Elbow Method')plt.xlabel('Number of clusters')plt.ylabel('wcss')plt.show()
可以取用 2 群、4 群 or 10 群。
B. 自动分群
使用 sklearn 内建计算轮廓係数 (Silhoutte Coefficient)
from sklearn.metrics import silhouette_scorefrom sklearn.cluster import KMeansfor n_cluster in range(2, 11): kmeans = KMeans(n_clusters=n_cluster).fit(X) label = kmeans.labels_ sil_coeff = silhouette_score(X, label, metric='euclidean') print(f"n_clusters={n_cluster}, Silhouette Coefficient is {sil_coeff:.4}") >> n_clusters=2, Silhouette Coefficient is 0.4401 n_clusters=3, Silhouette Coefficient is 0.3596 n_clusters=4, Silhouette Coefficient is 0.3721 n_clusters=5, Silhouette Coefficient is 0.3617 n_clusters=6, Silhouette Coefficient is 0.3632 n_clusters=7, Silhouette Coefficient is 0.3629 n_clusters=8, Silhouette Coefficient is 0.3538 n_clusters=9, Silhouette Coefficient is 0.3441 n_clusters=10, Silhouette Coefficient is 0.3477
分成 9 群效果最显着。
若要视觉化分群,可见以下
# Fitting kmeans to the datasetkm4=KMeans(n_clusters=8,init='k-means++', max_iter=300, n_init=10, random_state=0)y_means = km4.fit_predict(X)# Visualising the clusters for k=4plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='yellow',label='Cluster5')plt.scatter(X[y_means==5,0],X[y_means==5,1],s=50, c='black',label='Cluster6')plt.scatter(X[y_means==6,0],X[y_means==6,1],s=50, c='brown',label='Cluster7')plt.scatter(X[y_means==7,0],X[y_means==7,1],s=50, c='red',label='Cluster8')plt.scatter(km4.cluster_centers_[:,0], km4.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')plt.title('Customer segments')plt.xlabel('Annual income of customer')plt.ylabel('Annual spend from customer on site')plt.legend()plt.show()
Note: 一般客户分析会使用 RFM (Recency-Frequency-Monetary) 分析
此为机器学习第三步:Feature Engineering
Part 3. Reinforcement 强化学习:
让机器学习算法,自动学会对环境做出反应。
结论:
由于是初学,因此会先聚焦在**"监督式学习"&"非监督式学习"**上。
以上就是程式基础简介,下篇将从理论基础开始介绍。
.
.
.
.
.
Homework 小费的迴归 (regression):
请使用 sklearn 内建的 Datasets,依照上述步骤完成以下资料的迴归or分类:
1. 红酒分类
提示:ds = datasets.load_wine()
2. 糖尿病迴归
提示:ds = datasets.load_diabetes()
2. 小费迴归
提示:ds = datasets.load_tips()
.
.
.
.
.