当前位置: 首页 > news >正文

网站首页添加标签企业做网站注意事项

网站首页添加标签,企业做网站注意事项,更改host文件把淘宝指向自己做的钓鱼网站,低成本做网站数据挖掘的过程 数据挖掘任务主要分为以下六个步骤#xff1a; 1.数据预处理2.特征转换3.特征选择4.训练模型5.模型预测6.评估预测结果 数据准备 这里准备了20条关于不同地区、不同性别、不同身高、体重…的人的兴趣数据集#xff08;命名为hobby.csv)#xff1a; id,h…数据挖掘的过程 数据挖掘任务主要分为以下六个步骤 1.数据预处理2.特征转换3.特征选择4.训练模型5.模型预测6.评估预测结果 数据准备 这里准备了20条关于不同地区、不同性别、不同身高、体重…的人的兴趣数据集命名为hobby.csv) id,hobby,sex,address,age,height,weight 1,football,male,dalian,12,168,55 2,pingpang,female,yangzhou,21,163,60 3,football,male,dalian,,172,70 4,football,female,,13,167,58 5,pingpang,female,shanghai,63,170,64 6,football,male,dalian,30,177,76 7,basketball,male,shanghai,25,181,90 8,football,male,dalian,15,172,71 9,basketball,male,shanghai,25,179,80 10,pingpang,male,shanghai,55,175,72 11,football,male,dalian,13,169,55 12,pingpang,female,yangzhou,22,164,61 13,football,male,dalian,23,170,71 14,football,female,,12,164,55 15,pingpang,female,shanghai,64,169,63 16,football,male,dalian,30,177,76 17,basketball,male,shanghai,22,180,80 18,football,male,dalian,16,173,72 19,basketball,male,shanghai,23,176,73 20,pingpang,male,shanghai,56,171,71任务分析 通过sex,address,age,height,weight这五个特征预测一个人的兴趣爱好 数据预处理 想要连接数据必须先创建一个spark对象 定义Spark对象 使用SparkSession中的builder()构建 后续设定appName 和master 最后使用getOrCreate()完成构建 // 定义spark对象val spark SparkSession.builder().appName(兴趣预测).master(local[*]).getOrCreate()连接数据 使用spark.read连接数据需要指定数据的格式为“CSV”将首行设置为header最后指定文件路径 val dfspark.read.format(CSV).option(header,true).load(C:/Users/35369/Desktop/hobby.csv)使用df.show() df.printSchema()查看数据 df.show()df.printSchema()spark.stop() // 关闭spark输出信息 ------------------------------------------- | id| hobby| sex| address| age|height|weight| ------------------------------------------- | 1| football| male| dalian| 12| 168| 55| | 2| pingpang|female|yangzhou| 21| 163| 60| | 3| football| male| dalian|null| 172| 70| | 4| football|female| null| 13| 167| 58| | 5| pingpang|female|shanghai| 63| 170| 64| | 6| football| male| dalian| 30| 177| 76| | 7|basketball| male|shanghai| 25| 181| 90| | 8| football| male| dalian| 15| 172| 71| | 9|basketball| male|shanghai| 25| 179| 80| | 10| pingpang| male|shanghai| 55| 175| 72| | 11| football| male| dalian| 13| 169| 55| | 12| pingpang|female|yangzhou| 22| 164| 61| | 13| football| male| dalian| 23| 170| 71| | 14| football|female| null| 12| 164| 55| | 15| pingpang|female|shanghai| 64| 169| 63| | 16| football| male| dalian| 30| 177| 76| | 17|basketball| male|shanghai| 22| 180| 80| | 18| football| male| dalian| 16| 173| 72| | 19|basketball| male|shanghai| 23| 176| 73| | 20| pingpang| male|shanghai| 56| 171| 71| -------------------------------------------root|-- id: string (nullable true)|-- hobby: string (nullable true)|-- sex: string (nullable true)|-- address: string (nullable true)|-- age: string (nullable true)|-- height: string (nullable true)|-- weight: string (nullable true)补全年龄空缺的行 补全数值型数据可以分三步 1取出去除空行数据之后的这一列数据 2计算1中那一列数据的平均值 3将平均值填充至原先的表中 1取出空行之后的数据 val ageNaDF df.select(age).na.drop()ageNaDF.show()--- |age| --- | 12| | 21| | 13| | 63| | 30| | 25| | 15| | 25| | 55| | 13| | 22| | 23| | 12| | 64| | 30| | 22| | 16| | 23| | 56| ---2计算1中那一列数据的平均值 查看ageNaDF的基本特征 ageNaDF.describe(age).show()输出 ------------------------ |summary| age| ------------------------ | count| 19| | mean|28.42105263157895| | stddev|17.48432882286206| | min| 12| | max| 64| ------------------------可以看到其中的均值mean为28.42105263157895我们需要取出这个mean val mean ageNaDF.describe(age).select(age).collect()(1)(0).toStringprint(mean) //28.421052631578953将平均值填充至原先的表中 使用df.na.fill()方法可以填充空值需要指定列为“age”所以第二个参数为List(“age”) val ageFilledDF df.na.fill(mean,List(age))ageFilledDF.show()输出 -------------------------------------------------------- | id| hobby| sex| address| age|height|weight| -------------------------------------------------------- | 1| football| male| dalian| 12| 168| 55| | 2| pingpang|female|yangzhou| 21| 163| 60| | 3| football| male| dalian|28.42105263157895| 172| 70| | 4| football|female| null| 13| 167| 58| | 5| pingpang|female|shanghai| 63| 170| 64| | 6| football| male| dalian| 30| 177| 76| | 7|basketball| male|shanghai| 25| 181| 90| | 8| football| male| dalian| 15| 172| 71| | 9|basketball| male|shanghai| 25| 179| 80| | 10| pingpang| male|shanghai| 55| 175| 72| | 11| football| male| dalian| 13| 169| 55| | 12| pingpang|female|yangzhou| 22| 164| 61| | 13| football| male| dalian| 23| 170| 71| | 14| football|female| null| 12| 164| 55| | 15| pingpang|female|shanghai| 64| 169| 63| | 16| football| male| dalian| 30| 177| 76| | 17|basketball| male|shanghai| 22| 180| 80| | 18| football| male| dalian| 16| 173| 72| | 19|basketball| male|shanghai| 23| 176| 73| | 20| pingpang| male|shanghai| 56| 171| 71| --------------------------------------------------------可以发现年龄中的空值被填充了平均值 删除城市有空值所在的行 由于城市的列没有合理的数据可以填充所以如果城市出现空数据则选择把改行删除 使用.na.drop()方法 val addressDf ageFilledDF.na.drop()addressDf.show()输出 -------------------------------------------------------- | id| hobby| sex| address| age|height|weight| -------------------------------------------------------- | 1| football| male| dalian| 12| 168| 55| | 2| pingpang|female|yangzhou| 21| 163| 60| | 3| football| male| dalian|28.42105263157895| 172| 70| | 5| pingpang|female|shanghai| 63| 170| 64| | 6| football| male| dalian| 30| 177| 76| | 7|basketball| male|shanghai| 25| 181| 90| | 8| football| male| dalian| 15| 172| 71| | 9|basketball| male|shanghai| 25| 179| 80| | 10| pingpang| male|shanghai| 55| 175| 72| | 11| football| male| dalian| 13| 169| 55| | 12| pingpang|female|yangzhou| 22| 164| 61| | 13| football| male| dalian| 23| 170| 71| | 15| pingpang|female|shanghai| 64| 169| 63| | 16| football| male| dalian| 30| 177| 76| | 17|basketball| male|shanghai| 22| 180| 80| | 18| football| male| dalian| 16| 173| 72| | 19|basketball| male|shanghai| 23| 176| 73| | 20| pingpang| male|shanghai| 56| 171| 71| --------------------------------------------------------4和14行被删除 将每列字段的格式转换成合理的格式 //对df的schema进行调整val formatDF addressDf.select(col(id).cast(int),col(hobby).cast(String),col(sex).cast(String),col(address).cast(String),col(age).cast(Double),col(height).cast(Double),col(weight).cast(Double))formatDF.printSchema()输出 root|-- id: integer (nullable true)|-- hobby: string (nullable true)|-- sex: string (nullable true)|-- address: string (nullable true)|-- age: double (nullable true)|-- height: double (nullable true)|-- weight: double (nullable true)到此数据预处理部分完成。 特征转换 为了便于模型训练在数据的特征转换中我们需要对age、weight、height、address、sex这些特征做分桶处理。 对年龄做分桶处理 18以下18-3535-6060以上 使用Bucketizer类用来分桶处理需要设置输入的列名和输出的列名把定义的分桶区间作为这个类分桶的依据最后给定需要做分桶处理的DataFrame //2.1 对年龄进行分桶处理//定义一个数组作为分桶的区间val ageSplits Array(Double.NegativeInfinity,18,35,60,Double.PositiveInfinity)val bucketizerDF new Bucketizer().setInputCol(age).setOutputCol(ageFeature).setSplits(ageSplits).transform(formatDF)bucketizerDF.show()查看分桶结果 ------------------------------------------------------------------ | id| hobby| sex| address| age|height|weight|ageFeature| ------------------------------------------------------------------ | 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0| | 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0| | 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0| | 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0| | 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0| | 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0| | 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0| | 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0| | 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0| | 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0| | 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0| | 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0| | 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0| | 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0| | 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0| | 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0| | 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0| | 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0| ------------------------------------------------------------------对身高做二值化处理 基准为170 使用Binarizer类 //2.2 对身高做二值化处理val heightDF new Binarizer().setInputCol(height).setOutputCol(heightFeature).setThreshold(170) // 阈值.transform(bucketizerDF)heightDF.show()查看处理后结果 ------------------------------------------------------------------------------- | id| hobby| sex| address| age|height|weight|ageFeature|heightFeature| ------------------------------------------------------------------------------- | 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0| 0.0| | 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0| 0.0| | 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0| 1.0| | 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0| 0.0| | 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| | 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0| 1.0| | 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0| 1.0| | 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0| 1.0| | 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0| 1.0| | 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0| 0.0| | 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0| 0.0| | 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0| 0.0| | 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0| 0.0| | 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| | 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0| 1.0| | 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0| 1.0| | 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0| 1.0| | 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0| 1.0| -------------------------------------------------------------------------------对体重做二值化处理 阈值设为 65 //2.3 对体重做二值化处理val weightDF new Binarizer().setInputCol(weight).setOutputCol(weightFeature).setThreshold(65).transform(heightDF)weightDF.show()性别、城市、爱好字段的处理 这三个字段都是字符串而字符串的形式在机器学习中是不适合做分析处理的所以也需要对他们做特征转换编码处理。 //2.4 对性别进行labelEncode转换val sexIndex new StringIndexer().setInputCol(sex).setOutputCol(sexIndex).fit(weightDF).transform(weightDF)//2.5对家庭地址进行labelEncode转换val addIndex new StringIndexer().setInputCol(address).setOutputCol(addIndex).fit(sexIndex).transform(sexIndex)//2.6对地址进行one-hot编码val addOneHot new OneHotEncoder().setInputCol(addIndex).setOutputCol(addOneHot).fit(addIndex).transform(addIndex)//2.7对兴趣字段进行LabelEncode处理val hobbyIndexDF new StringIndexer().setInputCol(hobby).setOutputCol(hobbyIndex).fit(addOneHot).transform(addOneHot)hobbyIndexDF.show()这里额外对地址做了一个one-hot处理。 将hobbyIndex列名称改成label因为hobby在模型训练阶段用作标签。 //2.8修改列名val resultDF hobbyIndexDF.withColumnRenamed(hobbyIndex,label)resultDF.show()最终特征转换后的结果 ------------------------------------------------------------------------------------------------------------------------------ | id| hobby| sex| address| age|height|weight|ageFeature|heightFeature|weightFeature|sexIndex|addIndex| addOneHot|label| ------------------------------------------------------------------------------------------------------------------------------ | 1| football| male| dalian| 12.0| 168.0| 55.0| 0.0| 0.0| 0.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 2| pingpang|female|yangzhou| 21.0| 163.0| 60.0| 1.0| 0.0| 0.0| 1.0| 2.0| (2,[],[])| 1.0| | 3| football| male| dalian|28.42105263157895| 172.0| 70.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 5| pingpang|female|shanghai| 63.0| 170.0| 64.0| 3.0| 0.0| 0.0| 1.0| 1.0|(2,[1],[1.0])| 1.0| | 6| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 7|basketball| male|shanghai| 25.0| 181.0| 90.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0| | 8| football| male| dalian| 15.0| 172.0| 71.0| 0.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 9|basketball| male|shanghai| 25.0| 179.0| 80.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0| | 10| pingpang| male|shanghai| 55.0| 175.0| 72.0| 2.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 1.0| | 11| football| male| dalian| 13.0| 169.0| 55.0| 0.0| 0.0| 0.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 12| pingpang|female|yangzhou| 22.0| 164.0| 61.0| 1.0| 0.0| 0.0| 1.0| 2.0| (2,[],[])| 1.0| | 13| football| male| dalian| 23.0| 170.0| 71.0| 1.0| 0.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 15| pingpang|female|shanghai| 64.0| 169.0| 63.0| 3.0| 0.0| 0.0| 1.0| 1.0|(2,[1],[1.0])| 1.0| | 16| football| male| dalian| 30.0| 177.0| 76.0| 1.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 17|basketball| male|shanghai| 22.0| 180.0| 80.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0| | 18| football| male| dalian| 16.0| 173.0| 72.0| 0.0| 1.0| 1.0| 0.0| 0.0|(2,[0],[1.0])| 0.0| | 19|basketball| male|shanghai| 23.0| 176.0| 73.0| 1.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 2.0| | 20| pingpang| male|shanghai| 56.0| 171.0| 71.0| 2.0| 1.0| 1.0| 0.0| 1.0|(2,[1],[1.0])| 1.0| ------------------------------------------------------------------------------------------------------------------------------特征选择 特征转换后的结果是一个多列数据但不是所有的列都可以拿来用作机器学习的模型训练特征选择就是要选择可以用来机器学习的数据。 选择特征 使用VectorAssembler可以将需要的列取出 //3.1选择特征val vectorAssembler new VectorAssembler().setInputCols(Array(ageFeature,heightFeature,weightFeature,sexIndex,addIndex,label)).setOutputCol(features)特征进行规范化处理 val scaler new StandardScaler().setInputCol(features).setOutputCol(featureScaler).setWithStd(true) // 是否使用标准差.setWithMean(false) // 是否使用中位数特征筛选 // 特征筛选,使用卡方检验方法来做筛选val selector new ChiSqSelector().setLabelCol(label).setOutputCol(featuresSelector)构建逻辑回归模型和pipline // 逻辑回归模型val lr new LogisticRegression().setLabelCol(label).setFeaturesCol(featuresSelector)// 构造pipelineval pipeline new Pipeline().setStages(Array(vectorAssembler,scaler,selector,lr))设置网络搜索最佳参数 // 设置网络搜索最佳参数val params new ParamGridBuilder().addGrid(lr.regParam,Array(0.1,0.01)) //正则化参数.addGrid(selector.numTopFeatures,Array(5,10,5)) //设置卡方检验最佳特征数.build()设置交叉检验 // 设置交叉检验val cv new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(params).setNumFolds(5)模型训练与预测 模型训练前需要拆分一下训练集和测试集 val Array(trainDF,testDF) resultDF.randomSplit(Array(0.8,0.2))使用randomSplit方法可以完成拆分 开始训练和预测 val model cv.fit(trainDF)// 模型预测val preddiction model.bestModel.transform(testDF)preddiction.show()报错求解决 运行cv.fit(trainDF)的地方报错了 这个信息网上也没找到 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/trees/BinaryLikeat java.lang.ClassLoader.defineClass1(Native Method)at java.lang.ClassLoader.defineClass(ClassLoader.java:756)at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)at java.net.URLClassLoader.access$100(URLClassLoader.java:74)at java.net.URLClassLoader$1.run(URLClassLoader.java:369)at java.net.URLClassLoader$1.run(URLClassLoader.java:363)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:362)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)at org.apache.spark.ml.stat.SummaryBuilderImpl.summary(Summarizer.scala:251)at org.apache.spark.ml.stat.SummaryBuilder.summary(Summarizer.scala:54)at org.apache.spark.ml.feature.StandardScaler.fit(StandardScaler.scala:112)at org.apache.spark.ml.feature.StandardScaler.fit(StandardScaler.scala:84)at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)at scala.collection.Iterator.foreach(Iterator.scala:943)at scala.collection.Iterator.foreach$(Iterator.scala:943)at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213)at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:93)at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$7(CrossValidator.scala:174)at scala.runtime.java8.JFunction0$mcD$sp.apply(JFunction0$mcD$sp.java:23)at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)at scala.util.Success.$anonfun$map$1(Try.scala:255)at scala.util.Success.map(Try.scala:213)at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)at org.sparkproject.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete(Promise.scala:372)at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete$(Promise.scala:371)at scala.concurrent.impl.Promise$KeptPromise$Successful.onComplete(Promise.scala:379)at scala.concurrent.impl.Promise.transform(Promise.scala:33)at scala.concurrent.impl.Promise.transform$(Promise.scala:31)at scala.concurrent.impl.Promise$KeptPromise$Successful.transform(Promise.scala:379)at scala.concurrent.Future.map(Future.scala:292)at scala.concurrent.Future.map$(Future.scala:292)at scala.concurrent.impl.Promise$KeptPromise$Successful.map(Promise.scala:379)at scala.concurrent.Future$.apply(Future.scala:659)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$6(CrossValidator.scala:182)at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)at scala.collection.TraversableLike.map(TraversableLike.scala:286)at scala.collection.TraversableLike.map$(TraversableLike.scala:279)at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$4(CrossValidator.scala:172)at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)at scala.collection.TraversableLike.map(TraversableLike.scala:286)at scala.collection.TraversableLike.map$(TraversableLike.scala:279)at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)at org.apache.spark.ml.tuning.CrossValidator.$anonfun$fit$1(CrossValidator.scala:166)at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)at scala.util.Try$.apply(Try.scala:213)at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:137)at org.example.SparkML.SparkMl01$.main(SparkMl01.scala:147)at org.example.SparkML.SparkMl01.main(SparkMl01.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.trees.BinaryLikeat java.net.URLClassLoader.findClass(URLClassLoader.java:387)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)全部源码以及pom文件 package org.example.SparkMLimport org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.feature.{Binarizer, Bucketizer, ChiSqSelector, OneHotEncoder, StandardScaler, StringIndexer, VectorAssembler} import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.col/*** 数据挖掘的过程* 1.数据预处理* 2.特征转换编码。。。* 3.特征选择* 4.训练模型* 5.模型预测* 6.评估预测结果*/ object SparkMl01 {def main(args: Array[String]): Unit {// 定义spark对象val spark SparkSession.builder().appName(兴趣预测).master(local).getOrCreate()import spark.implicits._val dfspark.read.format(CSV).option(header,true).load(C:/Users/35369/Desktop/hobby.csv)//1.数据预处理补全空缺的年龄val ageNaDF df.select(age).na.drop()val mean ageNaDF.describe(age).select(age).collect()(1)(0).toStringval ageFilledDF df.na.fill(mean,List(age))//address为空的行直接删除val addressDf ageFilledDF.na.drop()//对df的schema进行调整val formatDF addressDf.select(col(id).cast(int),col(hobby).cast(String),col(sex).cast(String),col(address).cast(String),col(age).cast(Double),col(height).cast(Double),col(weight).cast(Double))//2.特征转换//2.1 对年龄进行分桶处理//定义一个数组作为分桶的区间val ageSplits Array(Double.NegativeInfinity,18,35,60,Double.PositiveInfinity)val bucketizerDF new Bucketizer().setInputCol(age).setOutputCol(ageFeature).setSplits(ageSplits).transform(formatDF)//2.2 对身高做二值化处理val heightDF new Binarizer().setInputCol(height).setOutputCol(heightFeature).setThreshold(170) // 阈值.transform(bucketizerDF)//2.3 对体重做二值化处理val weightDF new Binarizer().setInputCol(weight).setOutputCol(weightFeature).setThreshold(65).transform(heightDF)//2.4 对性别进行labelEncode转换val sexIndex new StringIndexer().setInputCol(sex).setOutputCol(sexIndex).fit(weightDF).transform(weightDF)//2.5对家庭地址进行labelEncode转换val addIndex new StringIndexer().setInputCol(address).setOutputCol(addIndex).fit(sexIndex).transform(sexIndex)//2.6对地址进行one-hot编码val addOneHot new OneHotEncoder().setInputCol(addIndex).setOutputCol(addOneHot).fit(addIndex).transform(addIndex)//2.7对兴趣字段进行LabelEncode处理val hobbyIndexDF new StringIndexer().setInputCol(hobby).setOutputCol(hobbyIndex).fit(addOneHot).transform(addOneHot)//2.8修改列名val resultDF hobbyIndexDF.withColumnRenamed(hobbyIndex,label)//3 特征选择//3.1选择特征val vectorAssembler new VectorAssembler().setInputCols(Array(ageFeature,heightFeature,weightFeature,sexIndex,addOneHot)).setOutputCol(features)//3.2特征进行规范化处理val scaler new StandardScaler().setInputCol(features).setOutputCol(featureScaler).setWithStd(true) // 是否使用标准差.setWithMean(false) // 是否使用中位数// 特征筛选,使用卡方检验方法来做筛选val selector new ChiSqSelector().setFeaturesCol(featureScaler).setLabelCol(label).setOutputCol(featuresSelector)// 逻辑回归模型val lr new LogisticRegression().setLabelCol(label).setFeaturesCol(featuresSelector)// 构造pipelineval pipeline new Pipeline().setStages(Array(vectorAssembler,scaler,selector,lr))// 设置网络搜索最佳参数val params new ParamGridBuilder().addGrid(lr.regParam,Array(0.1,0.01)) //正则化参数.addGrid(selector.numTopFeatures,Array(5,10,5)) //设置卡方检验最佳特征数.build()// 设置交叉检验val cv new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(params).setNumFolds(5)// 模型训练val Array(trainDF,testDF) resultDF.randomSplit(Array(0.8,0.2))trainDF.show()testDF.show()val model cv.fit(trainDF)//生成模型 // val model pipeline.fit(trainDF) // val prediction model.transform(testDF) // prediction.show()// 模型预测 // val preddiction model.bestModel.transform(testDF) // preddiction.show()spark.stop()} } ?xml version1.0 encodingUTF-8? project xmlnshttp://maven.apache.org/POM/4.0.0xmlns:xsihttp://www.w3.org/2001/XMLSchema-instancexsi:schemaLocationhttp://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsdmodelVersion4.0.0/modelVersiongroupIdorg.example/groupIdartifactIduntitled/artifactIdversion1.0-SNAPSHOT/versionpropertiesmaven.compiler.source8/maven.compiler.sourcemaven.compiler.target8/maven.compiler.targetproject.build.sourceEncodingUTF-8/project.build.sourceEncoding/propertiesdependenciesdependencygroupIdorg.scala-lang/groupIdartifactIdscala-library/artifactIdversion2.12.18/version/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_2.12/artifactIdversion3.0.0-preview2/version/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-hive_2.12/artifactIdversion3.1.2/version !-- scopeprovided/scope--/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-sql_2.12/artifactIdversion3.0.0-preview2/version !-- scopecompile/scope--/dependency!-- dependency-- !-- groupIdmysql/groupId-- !-- artifactIdmysql-connector-java/artifactId-- !-- version8.0.16/version-- !-- /dependency--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-mllib_2.12/artifactIdversion3.5.0/version !-- scopecompile/scope--/dependency/dependenciesbuildpluginsplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-shade-plugin/artifactIdversion2.4.1/versionexecutionsexecutionphasepackage/phasegoalsgoalshade/goal/goalsconfigurationtransformerstransformer implementationorg.apache.maven.plugins.shade.resource.ManifestResourceTransformermainClasscom.xxg.Main/mainClass/transformer/transformers/configuration/execution/executions/plugin/plugins/build/project
http://wiki.neutronadmin.com/news/315476/

相关文章:

  • 农业网站怎么做百度推广如何建设网站山东济南兴田德润官网
  • 同学录网站开发的背景徐州有哪些网络公司
  • html5网站后台管理系统公司网站建设费怎么做账
  • 做彩票网站被捉将受到什么惩罚wordpress博客vieu模板
  • 做亚马逊网站的公司建议wordpress禁止游客访问
  • 0基础网站开发上市公司
  • 网站建设与维护高考试卷朝阳区互联网大厂
  • 盐城专业做网站的公司wordpress 安装中文
  • 做网站什么价位河北一建考试最新消息
  • WordPress资讯站点源码平面设计广告设计
  • 网站首页优化asp.net答辩做网站
  • 中文建网站大型企业网站制作
  • 做网站 怎么备案沈阳网页设计公司排名
  • 建材手机网站网站地图咋做
  • 做网站系统如何保证自己的版权公司网站域名到期了去哪里缴费
  • 建设外贸网站案例旅游网站建设策划书
  • 知名网站建设企业多少钱百度助手应用商店下载安装
  • 如何查询网站的建设商苍山做网站
  • 外贸网站的推广方法域名备案是什么意思?
  • 广州网站建设公司万齐网络科技网站域名注册哪家好
  • 大网站建设公司福田区网络建设
  • 雄县网站建设公司兼职网站开发团队工作项目总结
  • 东莞公司建站哪个更便宜邯郸专业做网站地方
  • 国外免费建站网站搭建长沙做网站要微联讯点很好
  • 成都网站建设技术外包做一个赚钱的网站好
  • 济南网站万词优化导航网站模板免费
  • 网站建设最高管理权限网站建设工作会议.
  • 个人微信公共号可以做微网站么织梦网站模版下载
  • 公司推广网站建设话术做一个购物网站要多少钱
  • 南宁网站seo服务企业网站空间多大