sqoop数据采集

    技术2022-07-10  98

    什么是sqoop sqoop的全称是SQL-To-Hadoop,它是一个关系型数据库和Hadoop之间数据交换的工具,它从sql导入到hadoop的过程叫做import,从hadoop导出sql的过程叫做export。import和export都是基于mapreduce的 安装Sqoop sqoop的安装非常节点,只需要把它的包解压后,再配置一下环境变量就可以了 如果采集Oracle中的数据,需要把Oracle安装包中lib的jar放入sqoop的lib目录中

    我的虚拟机中oracle的jdbc包ojdbc14.ja放入sqoop的lib中 操作sqoop 我们使用sqoop help命令查看一下sqoop的操作命令有哪些 Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/…/…/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/…/…/hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/…/…/accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/…/…/zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 2020-06-30 22:46:08,300 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5 usage: sqoop COMMAND [ARGS] 在开始的提示当中有几个警告,说是hbase不存在,Hbase的导入将来会失败,需要设置一下Hbase的目录 第二个警告是说HCatalog不存在,Hcatalog将来可能失败 其他导入可能会失败而已

    Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information

    以上的是sqoop中可以使用的命令,藏用的命令有 version Display version information 查看sqoop系统的版本 list-databases List available databases on a server 如果是Oracle:列出的数据库中所有的用户名称 如果是MySQL: 列出的所有的数据库名字 list-tables List available tables in a database 列出所有的额表 import Import a table from a database to HDFS 从数据库中导入一张表到Hdfs中 import-all-tables Import tables from a database to HDFS 从数据库中导入所有表到Hdfs上 codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands 在使用这些命令的时候,如果想查看这些命令还有那些参数可以使用help进行查询 比如 sqoop help list-databases Common arguments: –connect Specify JDBC connect string –connection-manager Specify connection manager class name –connection-param-file Specify connection parameters file –driver Manually specify JDBC driver class to use –hadoop-home Override $HADOOP_MAPRED_HOME_ARG –hadoop-mapred-home

    Override $HADOOP_MAPRED_HOME_ARG –help Print usage instructions -P Read password from console –password Set authentication password –password-file Set authentication password file path –relaxed-isolation Use read-uncommitted isolation for imports –skip-dist-cache Skip copying jars to distributed cache –username Set authentication username –verbose Print more information while working 示例(注意:操作Oracle,大写:用户名、表名): sqoop list-databases --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl –username SYSTEM --password password 在执行这条命令的时候出现了这样的报错

    Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 2020-06-30 23:28:40,880 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5 2020-06-30 23:28:40,927 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2020-06-30 23:28:41,101 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled. 2020-06-30 23:28:41,131 INFO manager.SqlManager: Using default fetchSize of 1000 2020-06-30 23:28:41,415 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException java.lang.NullPointerException at org.apache.sqoop.manager.OracleManager.listDatabases(OracleManager.java:695) at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:49) at org.apache.sqoop.Sqoop.run(Sqoop.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) at org.apache.sqoop.Sqoop.main(Sqoop.java:236) [root@bigdata111 lib]# sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException

    在我查询这个ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException java.lang.NullPointerException错误大多数人说是数据的jar没有导入,但是我的导入了,最后去测试数据库原来是没有启动 。 使用lsnrctl status 只运行了一个进程 在我使用超级管理用户,修改系统注册的时候老是报错(ORA-01034: ORACLE not available)不知道为什么 最后在网上看到在用超级用户进入后执行一下startup就可以了,然后我oracle系统就正常启动了 参考的文章就是这个 https://community.oracle.com/message/15081041 然后我们再执行那个命令就正常了 然后我在测试列出oracle下某个用户的表 列出某个用户下的表 sqoop list-tables --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl –username SCOTT --password tiger

    eval 执行一条sql语句并展示它的结果

    sqoop eval --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl \ --username SCOTT --password tiger \ --query "select * from emp where deptno=10"

    codegen 根据表的结构生成java的相关文件文件

    [root@bigdata111 ~]# sqoop codegen --connect jdbc:oracle:thin:@192.168.112.130:1rcl --username SCOTT --password tiger --table EMP521:orcl --username SCOTT --password tiger --table EMP Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /root/training/sqoop-1.4.5.bin__hadoop-0.23/bin/../../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 2020-07-03 00:52:29,688 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5 2020-07-03 00:52:29,754 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2020-07-03 00:52:29,967 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled. 2020-07-03 00:52:30,002 INFO manager.SqlManager: Using default fetchSize of 1000 2020-07-03 00:52:30,002 INFO tool.CodeGenTool: Beginning code generation 2020-07-03 00:52:37,114 INFO manager.OracleManager: Time zone has been set to GMT 2020-07-03 00:52:37,243 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM EMP t WHERE 1=0 2020-07-03 00:52:38,217 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /root/training/hadoop-3.1.2 Note: /tmp/sqoop-root/compile/4b473e773e258d71210d1b7d65e64f72/EMP.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 2020-07-03 00:52:46,994 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/4b473e773e258d71210d1b7d65e64f72/EMP.jar [root@bigdata111 ~]# ls /tmp/sqoop-root/compile/4b473e773e258d71210d1b7d65e64f72/ EMP.class EMP.jar

    我可以从打印的日志上可以看到生成到linux的tmp目录下,并且在执行这条命令的目录上生成了一个java文件

    create-hive-table 根据关系型数据库表结构创建hive的表 sqoop create-hive-table --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl –username SCOTT --password tiger --table EMP --hive-table emphive 我们开启hive命令行模式,查看一下刚才的表 hive show tables; describe emphive; import Import a table from a database to HDFS 导入一张表到hdfs中

    sqoop import --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl \ --username SCOTT --password tiger --table EMP --target-dir /sqoopemp

    使用这张表我们已经导入到hdfs中 import-all-tables Import tables from a database to HDFS 采集某个用户下的所有表到hdfs中 sqoop import-all-tables --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl --username SCOTT --password tiger -m 1 这里-m 1 表示使用表中的第一列作为主键,因为oracle中有一张表中没有主键 导入到了hdfs的user/root的目录下 export Export an HDFS directory to a database table 从hdfs导出数据到关系数据库中,但是值得注意的是在导入之前,这样表必须提前存在于关系型数据库中 sqoop export --connect jdbc:oracle:thin:@192.168.112.130:1521:orcl --username SCOTT --password tiger --table STUDENTS --export-dir /students

    Processed: 0.052, SQL: 9