实际生产环境下Spark一些简单功能的实现

技术2022-07-16 67

Spark一些简单功能的实现 1、批量改变列名

val frame = spark.createDataFrame(Seq(("zhangsan", 18, "nan"),("lisi",20,"nv"))).toDF("name", "age", "sex") val str = "name:name1" if(str == ""){ frame.show() }else{ val map: Map[String, String] = str.split(",").map(data => { val arr = data.split(":") (arr(0), arr(1)) }) .toMap import org.apache.spark.sql.functions._ val frame1: DataFrame = frame.select(frame.columns.map(c => col(c).as(map.getOrElse(c, c))): _*) frame1.show() }

2、添加自增列

1) val frame = spark.createDataFrame(Seq(("zhangsan", 18, "nan"),("lisi",20,"nv"))).toDF("name", "age", "sex") val schema: StructType = SourceFrame.schema.add(StructField("SortId", LongType)) val dfRDD: RDD[(Row, Long)] = SourceFrame.rdd.zipWithIndex() val rowRDD = dfRDD.map(tp => Row.merge(tp._1,Row(tp._2 + 1))) spark.createDataFrame(rowRDD,schema).show() 2) import org.apache.spark.sql.functions._ frame.withcolumn("id",monotonically_increasing_id() + 1)

3、读取csv文件的另一种格式(可以把第一列作为列名)

spark.read //推断数据类型 .option("inferSchema", "true") //可设置分隔符，默认， //.option("delimiter",",") //设置空值 // .option("nullValue", "?") //表示有表头，若没有则为false .option("header", true) //文件路径 .csv("D:\\software\\idea\\WorkSpace\\com.program.UserBehaviorAnalysis\\LogAnalysis\\src\\main\\resources\\UserBehavior.csv")

4、取出一个dataframe的前n行

val frame = SourceFrame.limit(10).toDF() frame.show()

5、把一个dataframe按指定列拆分成两个dataframe

val selectArray:ArrayBuffer[String] = new ArrayBuffer[String]() selectArray.+=("ClickCount").+=("aid") val WriteFrame1 = SourceFrame.selectExpr(selectArray:_*) WriteFrame1.show() val array: Array[String] = SourceFrame.columns val strings: Array[String] = array.diff(selectArray) val writeFrame2 = SourceFrame.selectExpr(strings:_*) writeFrame2.show()

6、算数计算，主要是方差

import org.apache.spark.sql.functions._ val Frame = SourceFrame.groupBy() .agg( max("ct").alias("maxnum"), min("ct").alias("minnum"), avg("ct").alias("avgnum"), var_pop("ct").alias("variance")//方差 ) Frame.show()

7、Spark读取数据库操作

val frame = spark.read.format("jdbc") .option("driver", "com.mysql.jdbc.Driver")//8.0以上是com.mysql.cj.jdbc.Driver //.option("driver", "oracle.jdbc.driver.OracleDriver") //.option("driver", "org.postgresql.Driver")postgresql数据库 .option("url", "jdbc:mysql://192.168.216.201:3306/nshop")//高版本的mysql的url要在后面加上?serverTimezone=UTC .option("dbtable", "orders") .option("user", "root") .option("password", "123456") .load()

8、Java调用Spark jar包

import java.io.IOException; import org.apache.spark.deploy.SparkSubmit; public class UseJar{ public static void main(String[] args) throws IOException { String [] arr=new String[]{ "--master","spark://192.168.216.201:7077",//ip端口 "--deploy-mode","client", "--name",“spark submit test”, "--class",className,//运行主类main "--executor-memory","2G", "--total-executor-cores","10", "--executor-cores","2", "D:\software\idea\WorkSpace\JarTestProject\target\JarTestProject-1.0-SNAPSHOT.jar",//jar包的路径 "LR", "20180817111111", "66"//后边三行是jar中的参数 }; SparkSubmit.main(arr); } } //注：在idea的pom文件中导入的spark的版本要和虚拟机或者集群上的spark的版本一致，否则会出现反序列化的问题，困扰了好久，希望大家借鉴。

9、maven打包的pom文件内容

<build> <plugins> <plugin>  <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.2</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> <executions> <execution> <phase>compile</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin>  <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.1</version> <executions> <execution> <id>scala-compile-first</id> <phase>process-resources</phase> <goals> <goal>add-source</goal> <goal>compile</goal> </goals> </execution> </executions> </plugin> </plugins> </build>

10、DataFrame写出为csv文件时带表头

frame .coalesce(1) .write .mode(SaveMode.Overwrite) .option("header",true) .csv("hdfs://192.168.216.201:9000/output")

11、通过scala或者java代码在HDFS上创建文件夹 1）JavaAPI

//需要的包 import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path import org.apache.hadoop.fs.FileSystem public static void main(String[] args) throws IOException{ //指定操作Hadoop的用户 System.setProoerty("HADOOP_USER_NAME","hdfs"); Configuration conf = new Configuration(); conf.set("fs.defaultFS","hdfs://192.168.216.201:9000"); FileSystem hdfs = FileSystem.get(conf); boolean mk = hdfs.mkdirs(new Path("hdfs:/output/mk")); boolean delete= hdfs.deleteOnExit(new Path("hdfs:/output/mk")); boolean rename = hdfs.rename(new Path("hdfs:/output/mk"),new Path("hdfs:/output/mkdir")); hdfs.close; }

2）Scala实现

//需要导包 import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} def main(args: Array[String]): Unit = { val conf = new Configuration conf.set("fs.defaultFS", "hdfs://Hadoop001:9000") val hdfs = FileSystem.get(conf) val isok = hdfs.mkdirs(new Path("hdfs:/input/data1")) val delete = hdfs.deleteOnExit(new Path("hdfs:/output/mk")) val reName= hdfs.rename(new Path("hdfs:/output/mk"),new Path("hdfs:/output/mkdir")) hdfs.close() }

12、Spark读取多个路径下相同类型文件

//以CSV为例，其他类型文件类似 //.read方法后的.json,.txt,.csv...查看源码时显示 //def text(path: String): DataFrame = {text(Seq(path): _*)} 则说明是把所有路径放到了Seq中，并且查询所有Seq中所有路径 def ReadCsv(spark:SparkSession,inpath:ArrayBuffer[String],outpath:String): Unit ={ val frame = spark.read.csv(inpath:_*) frame.coalesce(1).write.mode(SaveMode.Overwrite).json(outpath) spark.close() }

13、Scala的main方法抛出异常和java有所区别

def main(args: Array[String]): Unit = { throw new Exception("This is a Exception") }

java的main方法异常抛出如下 public static void main(String[] args) throws IOException{ }

14、scala代码执行sql脚本（查询不显示结果，其他库表数据操作都支持）

object sqlOprate { def main(args: Array[String]): Unit = { try{ val conn = getMysqlConnection val runner = new ScriptRunner(conn) Resources.setCharset(Charset.forName("utf-8")) runner.setLogWriter(null) // val reader = new FileReader(new File("/sql/test.sql")) val reader = Resources.getResourceAsReader("sql/test.sql") runner.runScript(reader) runner.closeConnection() conn.close() println("sql执行完毕") }catch { case e:Exception =>{ println("sql执行异常") e.printStackTrace() } } } def getMysqlConnection:Connection ={ Class.forName("com.mysql.jdbc.Driver") val prop = Resources.getResourceAsProperties("db.properties") val url = prop.getProperty("url") val user = prop.getProperty("user") val password = prop.getProperty("password") DriverManager.getConnection(url,user,password) } } //properties文件内容 url = jdbc:mysql://localhost:3306/selftest user = root password = 123456 pom要有的依赖 <dependency> <groupId>org.mybatis</groupId> <artifactId>mybatis</artifactId> <version>3.5.0</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-jdbc</artifactId> <version>3.2.18.RELEASE</version> </dependency> //需要导包 import java.nio.charset.Charset import java.sql.{Connection, DriverManager} import org.apache.ibatis.io.Resources import org.apache.ibatis.jdbc.ScriptRunner

15、SparkDataFrame的开窗操作

//求所有行的数据 val w1 = Window.partitionBy("tenant").orderBy("price").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) df.count().over(w1)//求所有数据的总条数 //求截至到当前 val w2 = Window.partitionBy("tenant").orderBy("price").rowsBetween(Window.unboundedPreceding, Window.currentRow) df.count().over(w1)//求截至到当前行所有数据的总条数

16、Spark读取mysql数据并写入ES

def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("asd").setMaster("local[*]") conf.set("es.nodes","192.168.216.201") conf.set("es.port","9200") val spark = SparkSession.builder().config(conf).getOrCreate() val frame = spark.read.format("jdbc") .option("url", "jdbc:mysql://192.168.216.201:3306/nshop") .option("dbtable", "customer") .option("user", "root") .option("password", "123456") .load() EsSparkSQL.saveToEs(frame,"customer/test") spark.read.format("org.elasticsearch.spark.sql") .option("es.nodes","192.168.216.201") .option("es.port","9200") .load("customer/test") .show() spark.close() }

Processed: 0.013, SQL: 9