hortonworks

技术2024-06-18 77

Hadoop的SAS / ACCESS接口

Hadoop的SAS / ACCESS接口提供了访问SAS本机中Hadoop中存储的数据集的功能。通过SAS / ACCESS到Hadoop：

LIBNAME语句可用于使Hive表看起来像SAS数据集，可以在其上运行SAS过程和SAS DATA步骤。 PROC SQL命令提供了在Hortonworks Data Platform（HDP）集群上运行直接Hive SQL命令的功能。 PROC HADOOP提供了将SAS执行环境中的MapReduce，Apache Pig和HDFS命令直接提交到HDP集群的功能。

目标

使用在基于IBM POWER8处理器的服务器上运行的HDP集群对Hadoop进行BASE SAS和SAS / ACCESS接口验证和测试的关键目标是：

配置SAS / ACCESS到Hadoop的接口以与在IBM®Power Systems™的Red Hat Enterprise Linux（RHEL）上运行的HDP接口验证BASE SAS可以通过SAS / ACCESS到Hadoop的接口连接到HDP 证明BASE SAS可以使用上述各种接口访问和分析HDP中存储的数据证明在Power上运行于IBMAIX®上的传统SAS应用程序可以利用分布式处理架构和在IBMPower®服务器上运行的HDP的可伸缩性

测试环境

测试环境的高级组件包括：

Hadoop的BASE SAS和SAS / ACCESS接口

基本SAS 9.4（TS1M4）适用于Hadoop 9.4（TS1M4）的SAS / ACCESS AIX 7.2（7200-00-01-1543）最少资源：两个虚拟处理器，4 GB内存，35 GB磁盘空间 IBMPowerVM® 基于IBMPOWER8®处理器的服务器

Hortonworks数据平台

Hortonworks 数据平台（ HDP）2.5版（GA之前的版本） RHEL 7.2 最少资源：16个虚拟处理器，48 GB内存，50 GB磁盘空间 IBM PowerKVM™ 基于IBM POWER8处理器的服务器

部署架构

图1描述了用于在Power上运行HDP来验证SAS软件的部署和高级体系结构。 SAS软件由BASE SAS和Hadoop的SAS / ACCESS接口组成，已在基于IBM POWER8处理器的运行IBM AIX 7.2 OS的虚拟机[或Power Systems术语中的逻辑分区（LPAR）]上安装和配置。服务器。 HDP的单节点集群已在第二台基于IBM POWER8处理器的服务器上的RHEL 7.2的虚拟机上安装和配置。

Hadoop的SAS / ACCESS接口配置为与HDP交互并访问HDP上的数据。请注意，到Hadoop的BASE SAS和SAS / ACCESS接口的安装和配置对于HDP群集中的节点数是透明的。

图1. BASE SAS和SAS / ACCESS的体系结构与Hadoop的接口，HDP在IBM Power服务器上运行

安装与配置

本节介绍了HDP群集和SAS软件（与Hadoop的BASE SAS和SAS / ACESS接口）的安装和配置。

安装HDP集群

安装和配置HDP群集的高级步骤：

请遵循Power Systems上HDP的安装指南（在参考资料下提供）来安装和配置HDP群集。登录到Ambari服务器，并确保所有服务都在运行。通过Ambari监视和管理HDP群集，Hadoop和相关服务。

设置测试数据和Hive表

下载MovieLens和驱动程序测试数据，将数据复制到HDFS，并创建Hive表。

从此处下载MovieLens数据集（请参阅参考资料中的引用）请按照此处的说明将MovieLens数据集数据复制到HDFS并设置Hive外部表。使用配置单元用户ID相同。从此处下载驱动程序数据文件。将驱动程序数据复制到HDFS。 # su – hive # hadoop fs -mkdir -p /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers # hadoop fs -ls /user/hive/dataset/drivers Found 2 items -rw-r--r-- 3 hive hdfs 2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv -rw-r--r-- 3 hive hdfs 2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv 创建Hive表以获取驱动程序数据。 # su – hive # hive hive>create database trucks; hive> use trucks; hive> create table drivers (driverId int, name string, ssn bigint, location string, certified string, wageplan string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> create table truck_events (driverId int, truckId int, eventTime string, eventType string, longitude double, latitude double, eventKey string, correlationId bigint, driverName string, routeId int, routeName string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> show tables; OK drivers truck_events 将数据从HDFS中的文件加载到表中。 hive> LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv' overwrite into table truck_events; hive> LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv' overwrite into table drivers; 通过在表上运行查询来交叉检查表以确保数据存在。

将BASE SAS和SAS / ACCESS接口安装到Hadoop

执行以下步骤以将BASE SAS和SAS / ACCESS接口安装到AIX 7.2上的Hadoop

从SAS获取软件和依赖项。该软件订单包含有关如何下载软件的所有详细信息，以及许可证文件。在AIX LPAR上安装，配置和启动VNC（虚拟网络计算）服务器，然后连接到VNC会话。下载SAS AIX的下载管理器。启动SAS Download Manager ，提供从SAS收到的订单号和安装密钥。下载管理器将下载所有带有订单号标记的软件。安装软件：通过从软件下载目录运行setup.sh来启动SAS部署向导。选择SASHome 。这是SAS软件的安装位置。在任何需要的地方选择必要的选项，然后完成安装。请注意， $ SASHOME / SASFoundation / 9.4 / sas是运行SAS代码的可执行文件。

图2.使用SAS部署向导将BASE SAS和SAS / ACCESS接口安装到Hadoop

图3.完成软件安装后的SASHome

配置BASE SAS和SAS / ACCESS接口到Hadoop

安装Hadoop的BASE SAS和SAS / ACCESS接口后，请遵循SAS 9.4 Hadoop基本SAS和SAS / ACCESS的配置指南指南，以获取有关如何配置Hadoop的SAS / ACCESS接口以连接到HDP集群的说明。

按照指南第1章中所述验证Hadoop环境。遵循指南的第4章，为Hadoop配置SAS / ACCESS以获得详细说明。要将SAS / ACCESS与Hadoop服务器一起使用，SAS客户端系统必须具有一组Hadoop JAR和配置文件。有关详细信息，请参阅使SAS客户端计算机可用的Hadoop JAR和配置文件。通过运行以下命令启动SAS部署向导： $ SASHOME / SASDeploymentManager / 9.4 / sasdm.sh 在向导中，选择AMBARI作为群集管理器，并提供AMBARI服务器和访问详细信息。提供Hadoop群集详细信息，以及以安全外壳（SSH）私钥形式提供的访问信息，以访问群集节点。指定SAS客户端系统上Hadoop配置（例如，/ hadoop / conf）和Hadoop JAR文件（例如，/ hadoop / lib）的位置。选择添加环境变量选项，该选项将使用Hadoop配置和JAR文件的路径名更新SAS配置文件sasv9.cfg 。 sasv9.cfg文件将使用SAS_HADOOP_JAR_PATH和SAS_HADOOP_CONFIG_PATH环境变量进行更新。提供Hive服务详细信息以及Hive用户名和密码。按照向导完成配置。

图4. SAS Deployment Manager，用于配置与Hadoop的SAS / ACCESS接口

图5.完成配置后的Hadoop配置和JAR文件路径片段

图6.带有Hadoop配置和JAR文件路径的SAS配置文件片段

使用BASE SAS访问和分析HDP上存储的数据

完成上述安装和配置说明后，请使用BASE SAS访问和分析HDP上存储的数据。因为本文涵盖了通过HDP验证SAS / ACCESS与Hadoop的接口，所以我们尝试了一些重要的接口来访问数据并进行了基本分析。您可以探索以下各节所述的示例。

在HDFS上访问数据

使用SAS PROC HADOOP和HDFS语句从BASE SAS访问HDFS中存储的数据。以下示例SAS代码描述了如何连接到Hadoop，如何将数据写入HDFS以及访问存储在HDFS上的数据。您可以在SAS客户端系统上使用SAS命令行界面（CLI）运行SAS代码。

# cat proc-hadoop-test.sas proc hadoop username='hdfs' password='xxxxxxxx' verbose; hdfs mkdir='/user/hdfs/sasdata'; hdfs delete='/user/hdfs/test1'; hdfs copytolocal='/user/hdfs/test/fromHDP.txt' out='/home/np/SASCode/fromHDP.txt'; hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt' out='/user/hdfs/sasdata/fromSAS.txt'; run; # /SASHome/SASFoundation/9.4/sas proc-hadoop-test.sas

运行代码后，检查日志文件中的执行状态。

# cat proc-hadoop-test.log 1 The SAS System 00:26 Saturday, April 22, 2017 NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) Proprietary Software 9.4 (TS1M4) Licensed to IBM CORP - FOR DEMO PURPOSES ONLY, Site 70219019. NOTE: This session is executing on the AIX 7.2 (AIX 64) platform. NOTE: Additional host information: IBM AIX AIX 64 2 7 00F9C48F4C00 You are running SAS 9. Some SAS 8 files will be automatically converted by the V9 engine; others are incompatible. Please see http://support.sas.com/rnd/migration/planning/platform/64bit.html PROC MIGRATE will preserve current SAS file attributes and is recommended for converting all your SAS libraries from any SAS 8 release to SAS 9. For details and examples, please see http://support.sas.com/rnd/migration/index.html This message is contained in the SAS news file, and is presented upon initialization. Edit the file "news" in the "misc/base" directory to display site-specific news and information in the program log. The command line option "-nonews" will prevent this display. NOTE: SAS initialization used: real time 0.03 seconds cpu time 0.00 seconds 1 proc hadoop username='hdfs' password=XXXXXXXXXX verbose; 2 hdfs mkdir='/user/hdfs/sasdata'; 3 4 hdfs delete='/user/hdfs/test1'; 5 6 hdfs copytolocal='/user/hdfs/test/fromHDP.txt' 7 out='/home/np/SASCode/fromHDP.txt'; 8 9 hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt' 10 out='/user/hdfs/sasdata/fromSAS.txt'; 11 run; NOTE: PROCEDURE HADOOP used (Total process time): real time 4.49 seconds cpu time 0.05 seconds NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 NOTE: The SAS System used: real time 4.53 seconds cpu time 0.05 seconds

检查本地目录和HDFS的内容，以确认SAS代码已成功运行。

# hadoop fs -ls /user/hdfs/sasdata/ Found 1 items -rw-r--r-- 3 hdfs hdfs 30 2017-04-22 01:28 /user/hdfs/sasdata/fromSAS.txt [hdfs@hdpnode1 ~]$ hadoop fs -cat /user/hdfs/sasdata/fromSAS.txt Hello HDP user! Good morning! # ls -ltr total 136 -rw-r--r-- 1 root system 333 Apr 22 00:25 proc-hadoop-test.sas -rw-r--r-- 1 root system 23 Apr 22 00:26 fromHDP.txt -rw-r--r-- 1 root system 2042 Apr 22 00:26 proc-hadoop-test.log l48fvp038_pub[/home/np/SASCode] > cat fromHDP.txt Hello SAS user! Howdy?

通过BASE SAS在HDP上运行Map Reduce程序

您可以针对来自BASE SAS的HDP上存储的数据运行Map Reduce程序。将使用PROC HADOOP SAS过程从BASE SAS运行MapReduce程序。例如，如果您有一个较大的数据集，需要使用SAS进行实际分析之前进行一些预处理，则可以编写一个MapReduce程序以对HDP群集执行预处理。

以下示例SAS代码将文本文件复制到HDFS并运行WordCount Map Reduce程序。

$ cat word-count-mr-ex.sas proc hadoop username='hdfs' password='xxxxxxx' verbose; #Copy text file from client machine to HDFS hdfs copyfromlocal='/home/np/SASCode/wordcount-input.txt' out='/user/hdfs/sasdata/wordcount-input.txt'; #Map reduce statement to execute wordcount mapreduce input='/user/hdfs/sasdata/wordcount-input.txt' output='/user/hdfs/sasdata/wcout' jar='/hadoop/lib/hadoop-mapreduce-examples-2.7.3.2.5.0.0-1245.jar' outputkey='org.apache.hadoop.io.Text' outputvalue='org.apache.hadoop.io.IntWritable' reduce='org.apache.hadoop.examples.WordCount$IntSumReducer' combine='org.apache.hadoop.examples.WordCount$IntSumReducer' map='org.apache.hadoop.examples.WordCount$TokenizerMapper'; run; # /SASHome/SASFoundation/9.4/sas word-count-mr-ex.sas

运行示例代码，并检查SAS日志文件，以确保代码成功运行。另外，检查HDFS目录的内容以验证输入文件的内容以及WordCount MapReduce程序的输出。

# hadoop fs -ls /user/hdfs/sasdata/ drwxr-xr-x - hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout -rw-r--r-- 3 hdfs hdfs 268 2017-04-22 01:46 /user/hdfs/sasdata/wordcount-input.txt # hadoop fs -cat /user/hdfs/sasdata/wordcount-input.txt This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes the word counts to the output text file. # hadoop fs -ls /user/hdfs/sasdata/wcout Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout/_SUCCESS -rw-r--r-- 3 hdfs hdfs 270 2017-04-22 01:47 /user/hdfs/sasdata/wcout/part-r-00000 # hadoop fs -cat /user/hdfs/sasdata/wcout/part-r-00000 HADOOP 1 Hadoop 2 MapReduce 2 PROC 1 The 1 This 1 WordCount 1 a 3 and 1 application 1 breaks 1 counts 2 each 1 example 2 file, 1 file. 1 input 1 into 1 line 1 output 1 program 1 reads 1 server. 1 submits 1 text 2 that 1 the 4 then 1 to 2 uses 1 word 1 words, 2 writes 1

图7. WordCount Map从HDP的“作业历史记录”服务器中减少作业历史记录

将Hive表作为SAS数据集访问

使用LIBNAME语句以SAS数据集的形式访问Hive表，并使用PROC SQL过程通过JDBC对HDP集群运行Hive查询。

请参考下面的示例SAS代码，该代码使用LIBNAME和PROC SQL来访问Hive表并从BASE SAS运行查询和分析。

$ cat sas-hdp-hive-access.sas #LIBNAME statement to connect to HIVEService2 on HDP Cluster libname myhdp hadoop server='hdpnode1.dal-ebis.ihost.com' schema='default' user=hive; # Listing of tables proc datasets lib=myhdp details; run; # Get schema for all tables. proc contents data=myhdp._all_; run; # Run queries and analysis using PROC SQL procedure proc sql; select count(*) from MYHDP.TRUCK_EVENTS; run; # Run PROC FREQ statistical procedure against the HDFS data that #is not available as SAS dataset proc freq data=MYHDP.TRUCK_EVENTS; tables eventtype; run;

检查SAS日志文件中的表列表。请参阅以下日志文件摘录，其中列出了默认 Hive模式中的表。

3 proc datasets lib=myhdp details; Libref MYHDP Engine HADOOP Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default Schema/Owner default # Name Type or Indexes Vars Label 1 DRIVERS DATA . 6 2 EXPORT_TABLE DATA . 2 3 EXTENSION DATA . 3 4 HOSTS DATA . 16 5 HOSTS1 DATA . 16 6 HOSTS2 DATA . 16 7 HOSTS3 DATA . 16 8 TEST2 DATA . 2 9 TRUCK_EVENTS DATA . 11

检查列表文件（扩展名为.lst ）以获取统计过程的输出。例如，清单文件的摘录显示了表truck_events的所有表名和详细信息。

The SAS System 02:19 Saturday, April 22, 2017 1 The CONTENTS Procedure Libref MYHDP Engine HADOOP Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default Schema/Owner default # Name Type 1 DRIVERS DATA 2 EXPORT_TABLE DATA 3 EXTENSION DATA 4 HOSTS DATA 5 HOSTS1 DATA 6 HOSTS2 DATA 7 HOSTS3 DATA 8 TEST2 DATA 9 TRUCK_EVENTS DATA The CONTENTS Procedure Data Set Name MYHDP.TRUCK_EVENTS Observations . Member Type DATA Variables 11 Engine HADOOP Indexes 0 Created . Observation Length 0 Last Modified . Deleted Observations 0 Protection Compressed NO Data Set Type Sorted NO Label Data Representation Default Encoding Default Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 8 correlationid Num 8 20. 20. correlationid 1 driverid Num 8 11. 11. driverid 9 drivername Char 32767 $32767. $32767. drivername 7 eventkey Char 32767 $32767. $32767. eventkey 3 eventtime Char 32767 $32767. $32767. eventtime 4 eventtype Char 32767 $32767. $32767. eventtype 6 latitude Num 8 latitude 5 longitude Num 8 longitude 10 routeid Num 8 11. 11. routeid 11 routename Char 32767 $32767. $32767. routename 2 truckid Num 8 11. 11. truckid

清单文件还在truck_events表的eventtype列上显示了统计过程PROC FREQ的输出。请参阅清单文件的以下摘录。

The FREQ Procedure eventtype Cumulative Cumulative eventtype Frequency Percent Frequency Percent -------------------------------------------------------------------------- Lane Departure 11 0.06 11 0.06 Normal 17041 99.80 17052 99.87 Overspeed 9 0.05 17061 99.92 Unsafe following distance 7 0.04 17068 99.96 Unsafe tail distance 7 0.04 17075 100.00

使用显式传递访问Hive表

结合使用PROC SQL过程和CONECT TO HADOOP和EXECUTE语句，以显式传递模式运行Hive查询。这样会绕过JDBC，因此速度更快。此接口对于从Hive / HDFS进行流式读取很有帮助。

下面的示例代码演示了此接口。

# cat sas-hdp-hive-access-explicit-passthru.sas options dbidirectexec; options nodbidirectexec; proc sql; #Connect to Hadoop/HiveServer2 connect to hadoop (server=xxxxx.xxxxxxxx.com' user=hive subprotocol=hive2); #Execute HIVEQL query to create an external HIVE table execute ( CREATE EXTERNAL TABLE movie_ratings ( userid INT, movieid INT, rating INT, tstamp STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE LOCATION '/user/hive/sasdata/movie_ratings') by hadoop; disconnect from hadoop; proc hadoop username='hive' password='xxxxxx' verbose; #Copy data from client to the HDFS location for the movie_ratings table. hdfs copyfromlocal='/home/np/SASCode/ratings.txt' out='/user/hive/sasdata/movie_ratings'; quit;

登录到HDP群集上的Hive，并验证movie_ratings表已创建并包含数据。

# su - hive # hive hive> show tables; OK movie_ratings hive> desc movie_ratings; OK userid int movieid int rating int tstamp string Time taken: 0.465 seconds, Fetched: 4 row(s) hive> select count(*) from movie_ratings; Query ID = hive_20170422042813_e9e49803-144a-48e9-b0f6-f5cd8595d254 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1492505822201_0034) Map 1: -/- Reducer 2: 0/1 Map 1: 0/2 Reducer 2: 0/1 Map 1: 0/2 Reducer 2: 0/1 Map 1: 0(+1)/2 Reducer 2: 0/1 Map 1: 0(+2)/2 Reducer 2: 0/1 Map 1: 0(+2)/2 Reducer 2: 0/1 Map 1: 2/2 Reducer 2: 0(+1)/1 Map 1: 2/2 Reducer 2: 1/1 OK 1000209 Time taken: 14.931 seconds, Fetched: 1 row(s)

翻译自: https://www.ibm.com/developerworks/aix/library/l-sas-hdp/index.html

相关资源：hdp-vagrant-generator：用于Hortonworks Data Platform（HDP）的Vagrantfile生成器-源码

Processed: 0.014, SQL: 9