您现在的位置是:网站首页> 编程资料编程资料
Python数据分析之使用scikit-learn构建模型_python_
2023-05-26
330人已围观
简介 Python数据分析之使用scikit-learn构建模型_python_
一、使用sklearn转换器处理
sklearn提供了model_selection模型选择模块、preprocessing数据预处理模块、decompisition特征分解模块,通过这三个模块能够实现数据的预处理和模型构建前的数据标准化、二值化、数据集的分割、交叉验证和PCA降维处理等工作。
1.加载datasets中的数据集
sklearn库的datasets模块集成了部分数据分析的经典数据集,可以选用进行数据预处理、建模的操作。
常见的数据集加载函数(器):
数据集加载函数(器) | 数据集任务类型 |
load_digits | 分类 |
load_wine | 分类 |
load_iris | 分类、聚类 |
load_breast_cancer | 分类、聚类 |
load_boston | 回归 |
fetch_california_housing | 回归 |
加载后的数据集可以看成是一个字典,几乎所有的sklearn数据集均可以使用data、target、feature_names、DESCR分别获取数据集的数据、标签、特征名称、描述信息。
以load_breast_cancer为例:
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer()##将数据集赋值给iris变量 print('breast_cancer数据集的长度为:',len(cancer)) print('breast_cancer数据集的类型为:',type(cancer)) #breast_cancer数据集的长度为: 6 #breast_cancer数据集的类型为: cancer_data = cancer['data'] print('breast_cancer数据集的数据为:','\n',cancer_data) #breast_cancer数据集的数据为: [[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01] [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02] [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02] ... [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02] [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01] [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]] cancer_target = cancer['target'] ## 取出数据集的标签 print('breast_cancer数据集的标签为:\n',cancer_target) #breast_cancer数据集的标签为: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1] cancer_names = cancer['feature_names'] ## 取出数据集的特征名 print('breast_cancer数据集的特征名为:\n',cancer_names) #breast_cancer数据集的特征名为: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension'] cancer_desc = cancer['DESCR'] ## 取出数据集的描述信息 print('breast_cancer数据集的描述信息为:\n',cancer_desc) #breast_cancer数据集的描述信息为: .. _breast_cancer_dataset: Breast cancer wisconsin (diagnostic) dataset -------------------------------------------- **Data Set Characteristics:** :Number of Instances: 569 :Number of Attributes: 30 numeric, predictive attributes and the class :Attribute Information: - radius (mean of distances from center to points on the perimeter) - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. - class: - WDBC-Malignant - WDBC-Benign :Summary Statistics: ===================================== ====== ====== Min Max ===================================== ====== ====== radius (mean): 6.981 28.11 texture (mean): 9.71 39.28 perimeter (mean): 43.79 188.5 area (mean): 143.5 2501.0 smoothness (mean): 0.053 0.163 compactness (mean): 0.019 0.345 concavity (mean): 0.0 0.427 concave points (mean): 0.0 0.201 symmetry (mean): 0.106 0.304 fractal dimension (mean): 0.05 0.097 radius (standard error): 0.112 2.873 texture (standard error): 0.36 4.885 perimeter (standard error): 0.757 21.98 area (standard error): 6.802 542.2 smoothness (standard error): 0.002 0.031 compactness (standard error): 0.002 0.135 concavity (standard error): 0.0 0.396 concave points (standard error): 0.0 0.053 symmetry (standard error): 0.008 0.079 fractal dimension (standard error): 0.001 0.03 radius (worst): 7.93 36.04 texture (worst): 12.02 49.54 perimeter (worst): 50.41 251.2 area (worst): 185.2 4254.0 smoothness (worst): 0.071 0.223 compactness (worst): 0.027 1.058 concavity (worst): 0.0 1.252 concave points (worst): 0.0 0.291 symmetry (worst): 0.156 0.664 fractal dimension (worst): 0.055 0.208 ===================================== ====== ====== :Missing Attribute Values: None :Class Distribution: 212 - Malignant, 357 - Benign :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian :Donor: Nick Street :Date: November, 1995 This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2 Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/ .. topic:: References - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993. - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995. - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171. 2.划分数据集:训练集、测试集
在数据分析的过程中,为了保证模型在实际系统中能够起到预期的作用,一般需要将样本分成独立的三部分:训练集(train set)、验证集(validation set)、测试集(test set)。
训练集—50%:用于估计模型
验证集—25%:用于确定网络结构或控制模型复杂程度的参数
测试集—25%:用于检验最优模型的性能
当数据总量较少的时候,使用上述方法划分就不合适。常用的方法是留少部分做测试集,然后对其余N个样本采用K折交叉验证法:
将样本打乱,并均匀分成K份,轮流选择其中K-1份做训练,剩余一份做检验,计算预测误差平方和,最后把K次的预测误差平方和的均值作为选择最优模型结构的依据。
sklearn.model_selection.train_test_split(*arrays,**options)
参数名称 | 说明 |
*arrays | 接受一个或者多个数据集。代表需要划分的数据集。若为分类、回归,则传入数据、标签;若为聚类,则传入数据 |
test_size | 代表测试集的大小。若传入为float类型数据,需要限定在0-1之间,代表测试集在总数中的占比;若传入的为int型数据,则表示测试集记录的绝对数目。该参数与train_size可以只传入一个。 |
train_size | 与test_size相同 |
random_state | 接受int。代表随机种子编号,相同随机种子编号产生相同的随机结果。 |
shuffle | 接受boolean。代表是否进行有回放抽样,若为True,则stratify参数必须不能为空。 |
stratify | 接受array或None。若不为None,则使用传入的标签进行分层抽样。 |
print('原始数据集数据的形状为:',cancer_data.shape) print('原始数据集标签的形状为:',cancer_target.shape) 原始数据集数据的形状为: (569, 30) 原始数据集标签的形状为: (569,) from sklearn.model_selection import train_test_split cancer_data_train,cancer_data_test,cancer_target_train,cancer_target_test = train_test_split(cancer_data,cancer_target, test_size=0.2,random_state=42) print('训练集数据的形状为:',cancer_data_train.shape) print('训练集数据的标签形状为:',cancer_target_train.shape) print('测试集数据的形状为:',cancer_data_test.shape) print('测试集数据的标签形状为:',cancer_target_test.shape) 训练集数据的形状为: (455, 30) 训练集数据的标签形状为: (455,) 测试集数据的形状为: (114, 30) 测试集数据的标签形状为: (114,)该函数分别将传入的数据划分为训练集和测试集。如果传入的是一组数据,那么生成的就是这一组数据随机划分后的训练集和测试集,总共两组;如果传入的是两组数据,那么生成的训练集和测试集分别两组,总共四组。train_test_split方法仅是最常用的数据划分方法,在model_selection模块中还有其他的划分函数,例如PredefinedSplit、ShuffleSplit等。
3.使用sklearn转换器进行数据预处理与降维
sklearn将相关的功能封装为转换器,转换器主要包含有3个方法:fit、transform、fit_trainsform:
相关内容
- 利用Python实现普通视频变成动漫视频_python_
- 基于PyQt5实现状态栏(statusBar)显示和隐藏功能_python_
- pandas实现手机号号码中间4位匿名化的示例代码_python_
- pandas round方法保留两位小数的设置实现_python_
- python格式化字符串的实战教程(使用占位符、format方法)_python_
- 详解pandas df.iloc[]的典型用法_python_
- pandas df.sample()的使用_python_
- python嵌套try...except如何使用详解_python_
- slearn缺失值处理器之Imputer详析_python_
- Python使用random.shuffle()随机打乱字典排序_python_
