源数据库模式位于sql server上,它包含几个具有主键的表:
>一个varchar
> Composite – 两个varchar列或一个varchar一个int列或两个int列.有一张大桌子?有三个的行PK中的列一个int两个varchar列根据Sqoop文档:
Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
第一个问题是:’手动选择分裂列’的期望是什么 – 我怎么能牺牲pk而只使用一列或者我错过了一些概念?
SQL Server表是(仅两列,它们形成一个复合主键):
ChassiNo varchar(8) Unchecked ECU_Name nvarchar(15) Unchecked
我继续导入,源表有7909097条记录:
sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose
令人担忧的警告和错误的映射器输入和记录:
16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job. 16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column. 16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1168400 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1128 HDFS: Number of bytes written=209961941 HDFS: Number of read operations=32 HDFS: Number of large read operations=0 HDFS: Number of write operations=16 Job Counters Launched map tasks=8 Other local map tasks=8 Total time spent by all maps in occupied slots (ms)=62785 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=62785 Total vcore-seconds taken by all map tasks=62785 Total megabyte-seconds taken by all map tasks=128583680 Map-Reduce Framework Map input records=15818167 Map output records=15818167 Input split bytes=1128 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=780 CPU time spent (ms)=45280 Physical memory (bytes) snapshot=2219433984 Virtual memory (bytes) snapshot=20014182400 Total committed heap usage (bytes)=9394716672 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=209961941 16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec) 16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.
创建表:
CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8), `ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';
可怕的结果(没有错误) – PROBLEM:15818167 vs 7909097(sql server)记录:
> select count(1) from ChassiECU; Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1446726117927_0059) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 14 14 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s -------------------------------------------------------------------------------- OK _c0 15818167
令人惊讶的是,如果复合键由一个int(用于拆分)组成,我得到的准确或不匹配少于10条记录,但我仍然对这些记录感到担忧!
我该怎么办?
手动指定拆分列.拆分列不一定等于PK.你可以有复杂的PK和一些int Split列.您可以指定任何整数列甚至简单函数(一些简单的函数,如substring或cast,而不是聚合或分析).拆分列最好应均匀分布整数.例如,如果拆分列包含值为-1的10行和10M行,值为10000 – 10000000且num-mappers = 8,则sqoop将不均匀地拆分映射器之间的数据集:
>第一个映射器会得到几行-1,
>第2至第7位映射器将获得0行,>第8个映射器将获得近10M行,这将导致数据倾斜,第8个映射器将永远运行或
甚至失败了.当使用非整数时,我也有重复 拆分列与MS-SQL.因此,使用整数拆分列.在你的情况下 只有两个varchar列的表可以(1)添加代理int PK并将其用作分裂或
(2)使用带有WHERE子句的自定义查询手动拆分数据,并使用num-mappers = 1运行sqoop几次,或者
(3)将一些确定性的Integer非聚合函数应用于varchar列,例如cast(substr(…)as int)或second(timestamp_col)或datepart(second,date)等作为split-column.
精彩评论