网站首页添加标签,企业做网站注意事项,更改host文件把淘宝指向自己做的钓鱼网站,低成本做网站数据挖掘的过程
数据挖掘任务主要分为以下六个步骤#xff1a;
1.数据预处理2.特征转换3.特征选择4.训练模型5.模型预测6.评估预测结果
数据准备
这里准备了20条关于不同地区、不同性别、不同身高、体重…的人的兴趣数据集#xff08;命名为hobby.csv)#xff1a;
id,h…数据挖掘的过程
数据挖掘任务主要分为以下六个步骤
1.数据预处理2.特征转换3.特征选择4.训练模型5.模型预测6.评估预测结果
数据准备
这里准备了20条关于不同地区、不同性别、不同身高、体重…的人的兴趣数据集命名为hobby.csv)
id,hobby,sex,address,age,height,weight
1,football,male,dalian,12,168,55
2,pingpang,female,yangzhou,21,163,60
3,football,male,dalian,,172,70
4,football,female,,13,167,58
5,pingpang,female,shanghai,63,170,64
6,football,male,dalian,30,177,76
7,basketball,male,shanghai,25,181,90
8,football,male,dalian,15,172,71
9,basketball,male,shanghai,25,179,80
10,pingpang,male,shanghai,55,175,72
11,football,male,dalian,13,169,55
12,pingpang,female,yangzhou,22,164,61
13,football,male,dalian,23,170,71
14,football,female,,12,164,55
15,pingpang,female,shanghai,64,169,63
16,football,male,dalian,30,177,76
17,basketball,male,shanghai,22,180,80
18,football,male,dalian,16,173,72
19,basketball,male,shanghai,23,176,73
20,pingpang,male,shanghai,56,171,71任务分析 通过sex,address,age,height,weight这五个特征预测一个人的兴趣爱好
数据预处理
想要连接数据必须先创建一个spark对象
定义Spark对象
使用SparkSession中的builder()构建 后续设定appName 和master 最后使用getOrCreate()完成构建 // 定义spark对象val spark SparkSession.builder().appName(兴趣预测).master(local[*]).getOrCreate()连接数据
使用spark.read连接数据需要指定数据的格式为“CSV”将首行设置为header最后指定文件路径
val dfspark.read.format(CSV).option(header,true).load(C:/Users/35369/Desktop/hobby.csv)使用df.show() df.printSchema()查看数据 df.show()df.printSchema()spark.stop() // 关闭spark输出信息
-------------------------------------------
| id| hobby| sex| address| age|height|weight|
-------------------------------------------
| 1| football| male| dalian| 12| 168| 55|
| 2| pingpang|female|yangzhou| 21| 163| 60|
| 3| football| male| dalian|null| 172| 70|
| 4| football|female| null| 13| 167| 58|
| 5| pingpang|female|shanghai| 63| 170| 64|
| 6| football| male| dalian| 30| 177| 76|
| 7|basketball| male|shanghai| 25| 181| 90|
| 8| football| male| dalian| 15| 172| 71|
| 9|basketball| male|shanghai| 25| 179| 80|
| 10| pingpang| male|shanghai| 55| 175| 72|
| 11| football| male| dalian| 13| 169| 55|
| 12| pingpang|female|yangzhou| 22| 164| 61|
| 13| football| male| dalian| 23| 170| 71|
| 14| football|female| null| 12| 164| 55|
| 15| pingpang|female|shanghai| 64| 169| 63|
| 16| football| male| dalian| 30| 177| 76|
| 17|basketball| male|shanghai| 22| 180| 80|
| 18| football| male| dalian| 16| 173| 72|
| 19|basketball| male|shanghai| 23| 176| 73|
| 20| pingpang| male|shanghai| 56| 171| 71|
-------------------------------------------root|-- id: string (nullable true)|-- hobby: string (nullable true)|-- sex: string (nullable true)|-- address: string (nullable true)|-- age: string (nullable true)|-- height: string (nullable true)|-- weight: string (nullable true)补全年龄空缺的行
补全数值型数据可以分三步 1取出去除空行数据之后的这一列数据 2计算1中那一列数据的平均值 3将平均值填充至原先的表中
1取出空行之后的数据 val ageNaDF df.select(age).na.drop()ageNaDF.show()---
|age|
---
| 12|
| 21|
| 13|
| 63|
| 30|
| 25|
| 15|
| 25|
| 55|
| 13|
| 22|
| 23|
| 12|
| 64|
| 30|
| 22|
| 16|
| 23|
| 56|
---2计算1中那一列数据的平均值
查看ageNaDF的基本特征
ageNaDF.describe(age).show()输出
------------------------
|summary| age|
------------------------
| count| 19|
| mean|28.42105263157895|
| stddev|17.48432882286206|
| min| 12|
| max| 64|
------------------------可以看到其中的均值mean为28.42105263157895我们需要取出这个mean val mean ageNaDF.describe(age).select(age).collect()(1)(0).toStringprint(mean) //28.421052631578953将平均值填充至原先的表中 使用df.na.fill()方法可以填充空值需要指定列为“age”所以第二个参数为List(“age”) val ageFilledDF df.na.fill(mean,List(age))ageFilledDF.show()输出
--------------------------------------------------------
| id| hobby| sex| address| age|height|weight|
--------------------------------------------------------
| 1| football| male| dalian| 12| 168| 55|
| 2| pingpang|female|yangzhou| 21| 163| 60|
| 3| football| male| dalian|28.42105263157895| 172| 70|
| 4| football|female| null| 13| 167| 58|
| 5| pingpang|female|shanghai| 63| 170| 64|
| 6| football| male| dalian| 30| 177| 76|
| 7|basketball| male|shanghai| 25| 181| 90|
| 8| football| male| dalian| 15| 172| 71|
| 9|basketball| male|shanghai| 25| 179| 80|
| 10| pingpang| male|shanghai| 55| 175| 72|
| 11| football| male| dalian| 13| 169| 55|
| 12| pingpang|female|yangzhou| 22| 164| 61|
| 13| football| male| dalian| 23| 170| 71|
| 14| football|female| null| 12| 164| 55|
| 15| pingpang|female|shanghai| 64| 169| 63|
| 16| football| male| dalian| 30| 177| 76|
| 17|basketball| male|shanghai| 22| 180| 80|
| 18| football| male| dalian| 16| 173| 72|
| 19|basketball| male|shanghai| 23| 176| 73|
| 20| pingpang| male|shanghai| 56| 171| 71|
--------------------------------------------------------可以发现年龄中的空值被填充了平均值
删除城市有空值所在的行
由于城市的列没有合理的数据可以填充所以如果城市出现空数据则选择把改行删除 使用.na.drop()方法 val addressDf ageFilledDF.na.drop()addressDf.show()输出
--------------------------------------------------------
| id| hobby| sex| address| age|height|weight|
--------------------------------------------------------
| 1| football| male| dalian| 12| 168| 55|
| 2| pingpang|female|yangzhou| 21| 163| 60|
| 3| football| male| dalian|28.42105263157895| 172| 70|
| 5| pingpang|female|shanghai| 63| 170| 64|
| 6| football| male| dalian| 30| 177| 76|
| 7|basketball| male|shanghai| 25| 181| 90|
| 8| football| male| dalian| 15| 172| 71|
| 9|basketball| male|shanghai| 25| 179| 80|
| 10| pingpang| male|shanghai| 55| 175| 72|
| 11| football| male| dalian| 13| 169| 55|
| 12| pingpang|female|yangzhou| 22| 164| 61|
| 13| football| male| dalian| 23| 170| 71|
| 15| pingpang|female|shanghai| 64| 169| 63|
| 16| football| male| dalian| 30| 177| 76|
| 17|basketball| male|shanghai| 22| 180| 80|
| 18| football| male| dalian| 16| 173| 72|
| 19|basketball| male|shanghai| 23| 176| 73|
| 20| pingpang| male|shanghai| 56| 171| 71|
--------------------------------------------------------4和14行被删除
将每列字段的格式转换成合理的格式 //对df的schema进行调整val formatDF addressDf.select(col(id).cast(int),col(hobby).cast(String),col(sex).cast(String),col(address).cast(String),col(age).cast(Double),col(height).cast(Double),col(weight).cast(Double))formatDF.printSchema()输出
root|-- id: integer (nullable true)|-- hobby: string (nullable true)|-- sex: string (nullable true)|-- address: string (nullable true)|-- age: double (nullable true)|-- height: double (nullable true)|-- weight: double (nullable true)到此数据预处理部分完成。
特征转换
为了便于模型训练在数据的特征转换中我们需要对age、weight、height、address、sex这些特征做分桶处理。
对年龄做分桶处理
18以下18-3535-6060以上
使用Bucketizer类用来分桶处理需要设置输入的列名和输出的列名把定义的分桶区间作为这个类分桶的依据最后给定需要做分桶处理的DataFrame //2.1 对年龄进行分桶处理//定义一个数组作为分桶的区间val ageSplits Array(Double.NegativeInfinity,18,35,60,Double.PositiveInfinity)val bucketizerDF new Bucketizer().setInputCol(age).setOutputCol(ageFeature).setSplits(ageSplits).transform(formatDF)bucketizerDF.show()查看分桶结果
------------------------------------------------------------------
| id| hobby| sex| address| age|height|weight|ageFeature|
------------------------------------------------------------------
| 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0|
| 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0|
| 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0|
| 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0|
| 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0|
| 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0|
| 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0|
| 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0|
| 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0|
| 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0|
| 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0|
| 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0|
| 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0|
| 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0|
| 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0|
| 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0|
| 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0|
| 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0|
------------------------------------------------------------------对身高做二值化处理
基准为170 使用Binarizer类 //2.2 对身高做二值化处理val heightDF new Binarizer().setInputCol(height).setOutputCol(heightFeature).setThreshold(170) // 阈值.transform(bucketizerDF)heightDF.show()查看处理后结果
-------------------------------------------------------------------------------
| id| hobby| sex| address| age|height|weight|ageFeature|heightFeature|
-------------------------------------------------------------------------------
| 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0| 0.0|
| 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0| 0.0|
| 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0| 1.0|
| 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0| 0.0|
| 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0|
| 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0| 1.0|
| 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0| 1.0|
| 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0| 1.0|
| 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0| 1.0|
| 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0| 0.0|
| 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0| 0.0|
| 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0| 0.0|
| 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0| 0.0|
| 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0|
| 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0| 1.0|
| 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0| 1.0|
| 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0| 1.0|
| 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0| 1.0|
-------------------------------------------------------------------------------对体重做二值化处理
阈值设为 65 //2.3 对体重做二值化处理val weightDF new Binarizer().setInputCol(weight).setOutputCol(weightFeature).setThreshold(65).transform(heightDF)weightDF.show()性别、城市、爱好字段的处理
这三个字段都是字符串而字符串的形式在机器学习中是不适合做分析处理的所以也需要对他们做特征转换编码处理。 //2.4 对性别进行labelEncode转换val sexIndex new StringIndexer().setInputCol(sex).setOutputCol(sexIndex).fit(weightDF).transform(weightDF)//2.5对家庭地址进行labelEncode转换val addIndex new StringIndexer().setInputCol(address).setOutputCol(addIndex).fit(sexIndex).transform(sexIndex)//2.6对地址进行one-hot编码val addOneHot new OneHotEncoder().setInputCol(addIndex).setOutputCol(addOneHot).fit(addIndex).transform(addIndex)//2.7对兴趣字段进行LabelEncode处理val hobbyIndexDF new StringIndexer().setInputCol(hobby).setOutputCol(hobbyIndex).fit(addOneHot).transform(addOneHot)hobbyIndexDF.show()这里额外对地址做了一个one-hot处理。
将hobbyIndex列名称改成label因为hobby在模型训练阶段用作标签。 //2.8修改列名val resultDF hobbyIndexDF.withColumnRenamed(hobbyIndex,label)resultDF.show()最终特征转换后的结果
------------------------------------------------------------------------------------------------------------------------------
| id| hobby| sex| address| age|height|weight|ageFeature|heightFeature|weightFeature|sexIndex|addIndex| addOneHot|label|
------------------------------------------------------------------------------------------------------------------------------
| 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0| 0.0| 0.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0| 0.0| 0.0| 1.0| 2.0| (2,[],[])| 1.0|
| 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0| 0.0| 0.0| 1.0| 1.0|(2,[1],[1.0])| 1.0|
| 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0|
| 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0|
| 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 1.0|
| 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0| 0.0| 0.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0| 0.0| 0.0| 1.0| 2.0| (2,[],[])| 1.0|
| 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0| 0.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0| 0.0| 0.0| 1.0| 1.0|(2,[1],[1.0])| 1.0|
| 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0|
| 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0|
| 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0|
| 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 1.0|
------------------------------------------------------------------------------------------------------------------------------特征选择
特征转换后的结果是一个多列数据但不是所有的列都可以拿来用作机器学习的模型训练特征选择就是要选择可以用来机器学习的数据。
选择特征
使用VectorAssembler可以将需要的列取出 //3.1选择特征val vectorAssembler new VectorAssembler().setInputCols(Array(ageFeature,heightFeature,weightFeature,sexIndex,addIndex,label)).setOutputCol(features)特征进行规范化处理 val scaler new StandardScaler().setInputCol(features).setOutputCol(featureScaler).setWithStd(true) // 是否使用标准差.setWithMean(false) // 是否使用中位数特征筛选 // 特征筛选,使用卡方检验方法来做筛选val selector new ChiSqSelector().setLabelCol(label).setOutputCol(featuresSelector)构建逻辑回归模型和pipline // 逻辑回归模型val lr new LogisticRegression().setLabelCol(label).setFeaturesCol(featuresSelector)// 构造pipelineval pipeline new Pipeline().setStages(Array(vectorAssembler,scaler,selector,lr))设置网络搜索最佳参数 // 设置网络搜索最佳参数val params new ParamGridBuilder().addGrid(lr.regParam,Array(0.1,0.01)) //正则化参数.addGrid(selector.numTopFeatures,Array(5,10,5)) //设置卡方检验最佳特征数.build()设置交叉检验 // 设置交叉检验val cv new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(params).setNumFolds(5)模型训练与预测
模型训练前需要拆分一下训练集和测试集
val Array(trainDF,testDF) resultDF.randomSplit(Array(0.8,0.2))使用randomSplit方法可以完成拆分
开始训练和预测 val model cv.fit(trainDF)// 模型预测val preddiction model.bestModel.transform(testDF)preddiction.show()报错求解决
运行cv.fit(trainDF)的地方报错了 这个信息网上也没找到
Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/trees/BinaryLikeat java.lang.ClassLoader.defineClass1(Native Method)at java.lang.ClassLoader.defineClass(ClassLoader.java:756)at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)at java.net.URLClassLoader.access$100(URLClassLoader.java:74)at java.net.URLClassLoader$1.run(URLClassLoader.java:369)at java.net.URLClassLoader$1.run(URLClassLoader.java:363)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:362)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)at org.apache.spark.ml.stat.SummaryBuilderImpl.summary(Summarizer.scala:251)at org.apache.spark.ml.stat.SummaryBuilder.summary(Summarizer.scala:54)at org.apache.spark.ml.feature.StandardScaler.fit(StandardScaler.scala:112)at org.apache.spark.ml.feature.StandardScaler.fit(StandardScaler.scala:84)at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)at scala.collection.Iterator.foreach(Iterator.scala:943)at scala.collection.Iterator.foreach$(Iterator.scala:943)at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213)at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:93)at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$7(CrossValidator.scala:174)at scala.runtime.java8.JFunction0$mcD$sp.apply(JFunction0$mcD$sp.java:23)at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)at scala.util.Success.$anonfun$map$1(Try.scala:255)at scala.util.Success.map(Try.scala:213)at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)at org.sparkproject.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete(Promise.scala:372)at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete$(Promise.scala:371)at scala.concurrent.impl.Promise$KeptPromise$Successful.onComplete(Promise.scala:379)at scala.concurrent.impl.Promise.transform(Promise.scala:33)at scala.concurrent.impl.Promise.transform$(Promise.scala:31)at scala.concurrent.impl.Promise$KeptPromise$Successful.transform(Promise.scala:379)at scala.concurrent.Future.map(Future.scala:292)at scala.concurrent.Future.map$(Future.scala:292)at scala.concurrent.impl.Promise$KeptPromise$Successful.map(Promise.scala:379)at scala.concurrent.Future$.apply(Future.scala:659)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$6(CrossValidator.scala:182)at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)at scala.collection.TraversableLike.map(TraversableLike.scala:286)at scala.collection.TraversableLike.map$(TraversableLike.scala:279)at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$4(CrossValidator.scala:172)at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)at scala.collection.TraversableLike.map(TraversableLike.scala:286)at scala.collection.TraversableLike.map$(TraversableLike.scala:279)at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$1(CrossValidator.scala:166)at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213)at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:137)at org.example.SparkML.SparkMl01$.main(SparkMl01.scala:147)at org.example.SparkML.SparkMl01.main(SparkMl01.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.trees.BinaryLikeat java.net.URLClassLoader.findClass(URLClassLoader.java:387)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)全部源码以及pom文件
package org.example.SparkMLimport org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{Binarizer, Bucketizer, ChiSqSelector, OneHotEncoder, StandardScaler, StringIndexer, VectorAssembler}
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col/*** 数据挖掘的过程* 1.数据预处理* 2.特征转换编码。。。* 3.特征选择* 4.训练模型* 5.模型预测* 6.评估预测结果*/
object SparkMl01 {def main(args: Array[String]): Unit {// 定义spark对象val spark SparkSession.builder().appName(兴趣预测).master(local).getOrCreate()import spark.implicits._val dfspark.read.format(CSV).option(header,true).load(C:/Users/35369/Desktop/hobby.csv)//1.数据预处理补全空缺的年龄val ageNaDF df.select(age).na.drop()val mean ageNaDF.describe(age).select(age).collect()(1)(0).toStringval ageFilledDF df.na.fill(mean,List(age))//address为空的行直接删除val addressDf ageFilledDF.na.drop()//对df的schema进行调整val formatDF addressDf.select(col(id).cast(int),col(hobby).cast(String),col(sex).cast(String),col(address).cast(String),col(age).cast(Double),col(height).cast(Double),col(weight).cast(Double))//2.特征转换//2.1 对年龄进行分桶处理//定义一个数组作为分桶的区间val ageSplits Array(Double.NegativeInfinity,18,35,60,Double.PositiveInfinity)val bucketizerDF new Bucketizer().setInputCol(age).setOutputCol(ageFeature).setSplits(ageSplits).transform(formatDF)//2.2 对身高做二值化处理val heightDF new Binarizer().setInputCol(height).setOutputCol(heightFeature).setThreshold(170) // 阈值.transform(bucketizerDF)//2.3 对体重做二值化处理val weightDF new Binarizer().setInputCol(weight).setOutputCol(weightFeature).setThreshold(65).transform(heightDF)//2.4 对性别进行labelEncode转换val sexIndex new StringIndexer().setInputCol(sex).setOutputCol(sexIndex).fit(weightDF).transform(weightDF)//2.5对家庭地址进行labelEncode转换val addIndex new StringIndexer().setInputCol(address).setOutputCol(addIndex).fit(sexIndex).transform(sexIndex)//2.6对地址进行one-hot编码val addOneHot new OneHotEncoder().setInputCol(addIndex).setOutputCol(addOneHot).fit(addIndex).transform(addIndex)//2.7对兴趣字段进行LabelEncode处理val hobbyIndexDF new StringIndexer().setInputCol(hobby).setOutputCol(hobbyIndex).fit(addOneHot).transform(addOneHot)//2.8修改列名val resultDF hobbyIndexDF.withColumnRenamed(hobbyIndex,label)//3 特征选择//3.1选择特征val vectorAssembler new VectorAssembler().setInputCols(Array(ageFeature,heightFeature,weightFeature,sexIndex,addOneHot)).setOutputCol(features)//3.2特征进行规范化处理val scaler new StandardScaler().setInputCol(features).setOutputCol(featureScaler).setWithStd(true) // 是否使用标准差.setWithMean(false) // 是否使用中位数// 特征筛选,使用卡方检验方法来做筛选val selector new ChiSqSelector().setFeaturesCol(featureScaler).setLabelCol(label).setOutputCol(featuresSelector)// 逻辑回归模型val lr new LogisticRegression().setLabelCol(label).setFeaturesCol(featuresSelector)// 构造pipelineval pipeline new Pipeline().setStages(Array(vectorAssembler,scaler,selector,lr))// 设置网络搜索最佳参数val params new ParamGridBuilder().addGrid(lr.regParam,Array(0.1,0.01)) //正则化参数.addGrid(selector.numTopFeatures,Array(5,10,5)) //设置卡方检验最佳特征数.build()// 设置交叉检验val cv new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(params).setNumFolds(5)// 模型训练val Array(trainDF,testDF) resultDF.randomSplit(Array(0.8,0.2))trainDF.show()testDF.show()val model cv.fit(trainDF)//生成模型
// val model pipeline.fit(trainDF)
// val prediction model.transform(testDF)
// prediction.show()// 模型预测
// val preddiction model.bestModel.transform(testDF)
// preddiction.show()spark.stop()}
}
?xml version1.0 encodingUTF-8?
project xmlnshttp://maven.apache.org/POM/4.0.0xmlns:xsihttp://www.w3.org/2001/XMLSchema-instancexsi:schemaLocationhttp://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsdmodelVersion4.0.0/modelVersiongroupIdorg.example/groupIdartifactIduntitled/artifactIdversion1.0-SNAPSHOT/versionpropertiesmaven.compiler.source8/maven.compiler.sourcemaven.compiler.target8/maven.compiler.targetproject.build.sourceEncodingUTF-8/project.build.sourceEncoding/propertiesdependenciesdependencygroupIdorg.scala-lang/groupIdartifactIdscala-library/artifactIdversion2.12.18/version/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_2.12/artifactIdversion3.0.0-preview2/version/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-hive_2.12/artifactIdversion3.1.2/version
!-- scopeprovided/scope--/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-sql_2.12/artifactIdversion3.0.0-preview2/version
!-- scopecompile/scope--/dependency!-- dependency--
!-- groupIdmysql/groupId--
!-- artifactIdmysql-connector-java/artifactId--
!-- version8.0.16/version--
!-- /dependency--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-mllib_2.12/artifactIdversion3.5.0/version
!-- scopecompile/scope--/dependency/dependenciesbuildpluginsplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-shade-plugin/artifactIdversion2.4.1/versionexecutionsexecutionphasepackage/phasegoalsgoalshade/goal/goalsconfigurationtransformerstransformer implementationorg.apache.maven.plugins.shade.resource.ManifestResourceTransformermainClasscom.xxg.Main/mainClass/transformer/transformers/configuration/execution/executions/plugin/plugins/build/project