训练机器学习模型，可使用 Sklearn 提供的 16 个数据集【上篇】网站首页 其他

训练机器学习模型，可使用 Sklearn 提供的 16 个数据集【上篇】

ReganYue 2023-05-23 20:00:02

简介训练机器学习模型，可使用 Sklearn 提供的 16 个数据集【上篇】

数据是机器学习算法的动力，scikit-learn或sklearn提供了高质量的数据集，被研究人员、从业人员和爱好者广泛使用。Scikit-learn（sklearn）是一个建立在SciPy之上的机器学习的Python模块。它的独特之处在于其拥有大量的算法、十分易用以及能够与其他Python库进行整合。

什么是 “Sklearn数据集”？

Sklearn数据集作为scikit-learn（sklearn）库的一部分，所以它们是预先安装在库中的。因此，我们可以很容易地访问和加载这些数据集，而不需要单独下载它们。

要使用这些其中一个特定的数据集，可以简单地从sklearn.datasets模块中导入，并调用适当的函数将数据加载到程序中。

这些数据集通常都是经过预处理的，可以随时使用，这对于需要试验不同机器学习模型和算法的数据从业者来说，可以节省大量时间和精力。

预装的Sklearn数据集

1. Iris

这个数据集包括150朵鸢尾花的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量值，这些花属于三个不同的物种：Setosa、versicolor和virginica。鸢尾花数据集有150行和5列，以dataframe的形式存储。

Sepal.Length - 表示萼片的长度，单位是厘米。
Sepal.Width - 萼片的宽度，单位是厘米。
Petal.Length - 表示花瓣的长度（厘米）。
Species - 代表鸢尾花的种类，有三个可能的值：setosa、versicolor和virginica。

可以使用sklearn.datasets模块的load_iris函数直接从sklearn加载鸢尾花数据集。

# To install sklearn
pip install scikit-learn

# To import sklearn
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Print the dataset description
print(iris.describe())

这段使用sklearn加载Iris数据集的代码。
于2023年3月27日从https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html获取

2. Diabetes

这个sklearn数据集包含了442名糖尿病患者的信息，包括个人数据和临床测量值：

年龄
性别
身体质量指数(BMI)
平均血压
六项血清测量（如总胆固醇、低密度脂蛋白（LDL）胆固醇、高密度脂蛋白（HDL）胆固醇）。
糖尿病疾病进展的定量测量（HbA1c）。

糖尿病数据集可以使用sklearn.datasets模块的load_diabetes()函数加载。

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes()

# Print some information about the dataset
print(diabetes.describe())

上面是使用sklearn加载糖尿病数据集的代码。于2023年3月28日从https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset获取。

3. Digits

这个sklearn数据集是一个从0到9的手写数字的集合，存储为灰度图像。它总共包含1797个样本，每个样本是一个形状为(8,8)的二维阵列。在 Digits 数据集中有64个变量（或特征），对应于每张数字图像的64个像素。

from sklearn.datasets import load_digits

# Load the digits dataset
digits = load_digits()

# Print the features and target data
print(digits.data)
print(digits.target)

上面是使用sklearn加载Digits数据集的代码。与2023年3月29日从https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-recognition-of-handwritten-digits-dataset获取。

在这里插入图片描述

4. Linnerud

Linnerud数据集包含了20名职业运动员的身体和生理测量数据。

该数据集包括以下变量：

三个身体锻炼变量–引体向上、仰卧起坐和跳远。
三个生理测量变量–脉搏、收缩压和舒张压。

使用sklearn在Python中加载Linnerud数据集：

from sklearn.datasets import load_linnerud
linnerud = load_linnerud()

上面这段使用sklearn加载linnerud数据集的代码。于2023年3月27日从https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud获取。

5. Wine

这个sklearn数据集包含了对生长在意大利特定地区的葡萄酒进行化学分析的结果。
数据集中的一些变量：

Alcohol
Malic acid
Ash
Alkalinity of ash
Magnesium
Total phenols
Flavanoids

都是专业名词。我就不翻译了~ 需要用这个数据集的人应该比我更懂。

葡萄酒数据集可以使用sklearn.datasets模块的load_wine()函数加载。

from sklearn.datasets import load_wine

# Load the Wine dataset
wine_data = load_wine()

# Access the features and targets of the dataset
X = wine_data.data  # Features
y = wine_data.target  # Targets

# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names

上面这段使用sklearn加载葡萄酒质量数据集的代码。于2023年3月28日从https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset获取。

6. Breast Cancer Wisconsin Dataset

这个sklearn数据集由乳腺癌肿瘤的信息组成，最初由William H. Wolberg博士创建。创建该数据集是为了帮助研究人员和机器学习从业者将肿瘤分类为恶性（癌症）或良性（非癌症）。

这个数据集包含的变量：

ID number
Diagnosis (M = malignant, B = benign).
Radius (the mean of distances from the centre to points on the perimeter).
Texture (the standard deviation of gray-scale values).
Perimeter
Area
Smoothness (the local variation in radius lengths).
Compactness (the perimeter^2 / area - 1.0).
Concavity (the severity of concave portions of the contour).
Concave points (the number of concave portions of the contour).
Symmetry
Fractal dimension (“coastline approximation” - 1).

你可以使用sklearn.datasets模块的load_breast_cancer函数直接从sklearn加载乳腺癌肿瘤的数据集。

from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()

# Print the dataset description
print(cancer.describe())