AnsweredAssumed Answered

Mapr + Streamsets: What kinds of data could make streamsets' TSV parse fail?

Question asked by reedv on Feb 9, 2018
Latest reply on Feb 27, 2018 by maprcommunity

Using streamsets to move TSV (with header) data between mapr FS locations (in batch cluster mode using hadoop impersonation, if any of that is relevant) and after the pipeline runs for a bit, the pipeline fails and tries to restart, continually failing at this point (ie. some records run through successfully until it seems to hit some problem). Looking at the logs, I see the error:

** If it helps, note that I am getting the flat files in question by using sqoop to pull from a database (resulting inn parquets by default) and then using drill (with the line "set store.format='tsv'") to convert the parquets into tsv.

....
Diagnostics : Task failed task_1517001433399_0014_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
   
2018-01-26 15:52:52,404     ingest2sa2tenant_demodata_batch_002/ingest2sa2tenantdemodatabatch002e19a9d7a-14a2-4c5e-b4a6-965dd34c43ee     ERROR     Error in Slave Runner:     ClusterRunner     *admin          runner-pool-2-thread-23

java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:530)
at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:540)
....

Looking at the error message, it seems that the parser is having trouble with line 2 of one of the tsv files beng processed (there are 4 separate files in the origin location). Redacting some information the line-2 records from each of the tsv files are shown below:

XXXXXX Puncture wound with foreign body of left thumb without damage to nail null INJURIES TO THE WRIST, HAND AND FINGERS null XXXXXX Y Y null null null null null null null null null null 2 null null null null null null N N null null null null null null null S61.042 2 null S61.042 XXXX null null null 2017-10-06 00:02:49 XXX 2018-02-05 16:08:59
XXXXXXXXXXX Complex care coordination null null null null N N null null null null null null null null null null 1 null null null null null null null null null null null null null null 2 null null V65.49 Z71.89 XXXX null null null 2017-10-06 08:32:47 null 2018-02-05 16:08:59
XXXXXX Unspecified occupant of three-wheeled motor vehicle injured in collision with heavy transport vehicle or bus in traffic accident, initial encounter null OCCUPANT OF THREE-WHEELED MOTOR VEHICLE INJURED IN TRANSPORT ACCIDENT null XXXXXX Y Y null Motor vehicle collision victim null null null null null null null null 2 null null null null null null N N null null null null null null null V34.9XXA 2 null V34.9XXA XXXXXX XXXXXXX null null 2017-10-05 23:48:49 300 2018-02-05 16:08:59
XXXXXXX Wedge compression fracture of unspecified lumbar vertebra, subsequent encounter for fracture with nonunion null null null null N null null Compression fracture of lumbar vertebra null null null null null null null null 1 null null null null null null null null null null null null null null 2 null null 733.82 S32.000K XXXXXX null null null 2017-10-06 06:27:38 null 2018-02-05 16:08:59

    | column_0  |                                column_1                                | column_2  |                 column_3                 | column_4  | column_5  | column_6  | column_7  | column_8  | column_9  | column_10  | column_11  | column_12  | column_13  | column_14  | column_15  | column_16  | column_17  | column_18  | column_19  | column_20  | column_21  | column_22  | column_23  | column_24  | column_25  | column_26  | column_27  | column_28  | column_29  | column_30  | column_31  | column_32  | column_33  | column_34  | column_35  | column_36  | column_37  | column_38  | column_39  | column_40  | column_41  |      column_42       | column_43  |      column_44       |
    +-----------+------------------------------------------------------------------------+-----------+------------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+
    | XXXXXX    | Puncture wound with foreign body of left thumb without damage to nail  | null      | INJURIES TO THE WRIST, HAND AND FINGERS  | null      | XXXXXX    | Y         | Y         | null      | null      | null       | null       | null       | null       | null       | null       | null       | null       | 2          | null       | null       | null       | null       | null       | null       | N          | N          | null       | null       | null       | null       | null       | null       | null       | S61.042    | 2          | null       | S61.042    | XXXXXXXX   | null       | null       | null       | 2017-10-06 00:02:49  | XXX        | 2018-02-05 16:08:59  |
    +-----------+------------------------------------------------------------------------+-----------+------------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+
   
    |   column_0   |          column_1          | column_2  | column_3  | column_4  | column_5  | column_6  | column_7  | column_8  | column_9  | column_10  | column_11  | column_12  | column_13  | column_14  | column_15  | column_16  | column_17  | column_18  | column_19  | column_20  | column_21  | column_22  | column_23  | column_24  | column_25  | column_26  | column_27  | column_28  | column_29  | column_30  | column_31  | column_32  | column_33  | column_34  | column_35  | column_36  | column_37  | column_38  | column_39  | column_40  | column_41  |      column_42       | column_43  |      column_44       |
    +--------------+----------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+
    | XXXXXXXXXXX  | Complex care coordination  | null      | null      | null      | null      | N         | N         | null      | null      | null       | null       | null       | null       | null       | null       | null       | null       | 1          | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | 2          | null       | null       | V65.49     | Z71.89     | XXXXXXXX   | null       | null       | null       | 2017-10-06 08:32:47  | null       | 2018-02-05 16:08:59  |
    +--------------+----------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+
   
    | column_0 | column_1                                                                                                                                                     | column_2  | column_3                                                              | column_4  | column_5  | column_6  | column_7  | column_8  | column_9                          | column_10 | column_11 | column_12 | column_13 | column_14 | column_15 | column_16 | column_17 | column_18 | column_19 | column_20 | column_21 | column_22 | column_23 | column_24 | column_25 | column_26 | column_27 | column_28 | column_29 | column_30 | column_31 | column_32 | column_33 | column_34 | column_35 | column_36 | column_37 | column_38 | column_39 | column_40 | column_41 | column_42           | column_43 | column_44 |
    +----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
    | XXXXXXXX | Unspecified occupant of three-wheeled motor vehicle injured in collision with heavy transport vehicle or bus in traffic accident, initial encounter          | null      | OCCUPANT OF THREE-WHEELED MOTOR VEHICLE INJURED IN TRANSPORT ACCIDENT | null      | XXXXXX    | Y         | Y         | null      | Motor vehicle collision victim    | null      | null      | null      | null      | null      | null      | null      | null      | 2         | null      | null      | null      | null      | null      | null      | N         | N         | null      | null      | null      | null      | null      | null      | null      | V34.9XXA  | 2         | null      | V34.9XXA  | XXXXXX    | XXXXXXX   | null      | null      | 2017-10-05 23:48:49 | XXX       | 2018-02-05 16:08:59 |
    +----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
   
    | column_0  |                                                  column_1                                                   | column_2  | column_3  | column_4  | column_5  | column_6  | column_7  | column_8  |                 column_9                 | column_10  | column_11  | column_12  | column_13  | column_14  | column_15  | column_16  | column_17  | column_18  | column_19  | column_20  | column_21  | column_22  | column_23  | column_24  | column_25  | column_26  | column_27  | column_28  | column_29  | column_30  | column_31  | column_32  | column_33  | column_34  | column_35  | column_36  | column_37  | column_38  | column_39  | column_40  | column_41  |      column_42       | column_43  |      column_44       |
    +-----------+-------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------------------------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+
    | XXXXXXX   | Wedge compression fracture of unspecified lumbar vertebra, subsequent encounter for fracture with nonunion  | null      | null      | null      | null      | N         | null      | null      | Compression fracture of lumbar vertebra  | null       | null       | null       | null       | null       | null       | null       | null       | 1          | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | null       | 2          | null       | null       | 733.82     | S32.000K   | XXXXXX     | null       | null       | null       | 2017-10-06 06:27:38  | null       | 2018-02-05 16:08:59  |
    +-----------+-------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------------------------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+----------------------+------------+----------------------+

 

The data format configuration for the origin stage I am using is:

data format: delimited delimiter 
format type: tab separated values 
header line: with header line 
allow extra columns: false 
max record length (chars): 100,000 
root field type: list-map 
lines to skip: 0 
parse nulls: false 
charset: utf-8 
ignore control characters: false .

Counting the longest character string of all of the origin files with

[mapr@mapr002]$ awk '{ if ( length > L ) { L=length} }END{ print L}' ./* 652

we see that the longest record is 652, well below the max record length param.

Checking that all of the files being parsed all have the same number of columns for all rows with

[mapr@mapr001]$ ls
1_0_0.tsv 1_1_0.tsv 1_2_0.tsv 1_3_0.tsv
[mapr@mapr001]$ awk -F'\t' '{print NF}' 1_0_0.tsv | sort -nu | wc -l
1
[mapr@mapr001]$ awk -F'\t' '{print NF}' 1_1_0.tsv | sort -nu | wc -l
1
[mapr@mapr001]$ awk -F'\t' '{print NF}' 1_2_0.tsv | sort -nu | wc -l
1
[mapr@mapr001]$ awk -F'\t' '{print NF}' 1_3_0.tsv | sort -nu | wc -l
1

show that this appears to be ok for all files.

Checking for non ascii characters with

[mapr@mapr001]$ grep --color='auto' -P -n "[\x80-\xFF]" 1_0_0.tsv
82206:...    Iris bombé, right    ....
95293:...    Glaucoma with iris bombé, right, moderate stage ....
105933:...   Fleischer-Strümpell ring, unspecified laterality        ....
notice that some of these names have weird characters (eg. bombé, and Strümpell). 
I don't know enough about how streamsets parses the data to know whether this would be a problems or not.

(At all destination stages the data lands as json).

Removing acsii characters from all files with

[mapr@mapr001]$ perl -pi -e 's/[^[:ascii:]]//g' 1_0_0.tsv
[mapr@mapr001]$ perl -pi -e 's/[^[:ascii:]]//g' 1_1_0.tsv
[mapr@mapr001]$ perl -pi -e 's/[^[:ascii:]]//g' 1_2_0.tsv
[mapr@mapr001]$ perl -pi -e 's/[^[:ascii:]]//g' 1_3_0.tsv

then checking for ascii and running pipeline again, showed that the pipeline parsing errors still occured.

 

Checking if perhaps the headers of the files were different in some way, I ran

[root@mapr001]# comm -12 1_0_0.tsv 1_2_0.tsv

comparing all of the tsv file to the "initial" (1_0_0.tsv), the output was that only the header line was returned for each comparison. This implies that at least all of the header rows are the exact same.

 

Looking at the logs of the YARN job for this pipeline, the logs that I see around the first instance of the error popping up are

2018-02-16 10:58:38,836 [user:] [pipeline:] [thread:Socket Reader #1 for port 42644]  INFO  ServiceAuthorizationManager - Authorization successful for job_1518652467538_0001 (auth:TOKEN) for protocol=interface org.apache.hadoop.mapred.TaskUmbilicalProtocol
2018-02-16 10:58:38,837 [user:] [pipeline:] [thread:IPC Server handler 0 on 42644]  INFO  TaskAttemptListenerImpl - JVM with ID : jvm_1518652467538_0001_m_8796093022213 asked for a task
2018-02-16 10:58:38,837 [user:] [pipeline:] [thread:IPC Server handler 0 on 42644]  INFO  TaskAttemptListenerImpl - JVM with ID: jvm_1518652467538_0001_m_8796093022213 given task: attempt_1518652467538_0001_m_000002_0
2018-02-16 10:58:47,283 [user:] [pipeline:] [thread:IPC Server handler 2 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000003_0 is : 1.0
2018-02-16 10:58:48,137 [user:] [pipeline:] [thread:IPC Server handler 11 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000000_0 is : 0.0014552695
2018-02-16 10:58:48,207 [user:] [pipeline:] [thread:IPC Server handler 2 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000001_0 is : 0.002064415
2018-02-16 10:58:48,239 [user:] [pipeline:] [thread:IPC Server handler 3 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000002_0 is : 0.004178892
2018-02-16 10:58:49,669 [user:] [pipeline:] [thread:IPC Server handler 6 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000003_0 is : 1.0
2018-02-16 10:58:49,700 [user:] [pipeline:] [thread:IPC Server handler 1 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000003_0 is : 1.0
2018-02-16 10:58:49,704 [user:] [pipeline:] [thread:IPC Server handler 16 on 42644]  INFO  TaskAttemptListenerImpl - Done acknowledgement from attempt_1518652467538_0001_m_000003_0
2018-02-16 10:58:49,706 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  TaskAttemptImpl - attempt_1518652467538_0001_m_000003_0 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP
2018-02-16 10:58:49,707 [user:] [pipeline:] [thread:ContainerLauncher #4]  INFO  ContainerLauncherImpl - Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_e08_1518652467538_0001_01_000003 taskAttempt attempt_1518652467538_0001_m_000003_0
2018-02-16 10:58:49,707 [user:] [pipeline:] [thread:ContainerLauncher #4]  INFO  ContainerLauncherImpl - KILLING attempt_1518652467538_0001_m_000003_0
2018-02-16 10:58:49,707 [user:] [pipeline:] [thread:ContainerLauncher #4]  INFO  ContainerManagementProtocolProxy - Opening proxy : mapr006.ucera.local:8099
2018-02-16 10:58:49,734 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  TaskAttemptImpl - attempt_1518652467538_0001_m_000003_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2018-02-16 10:58:49,759 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  TaskImpl - Task succeeded with attempt attempt_1518652467538_0001_m_000003_0
2018-02-16 10:58:49,761 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  TaskImpl - task_1518652467538_0001_m_000003 Task Transitioned from RUNNING to SUCCEEDED
2018-02-16 10:58:49,764 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  JobImpl - Num completed Tasks: 1
2018-02-16 10:58:50,160 [user:] [pipeline:] [thread:RMCommunicator Allocator]  INFO  RMContainerAllocator - Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:4 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:3 RackLocal:1
2018-02-16 10:58:51,172 [user:] [pipeline:] [thread:RMCommunicator Allocator]  INFO  RMContainerAllocator - Received completed container container_e08_1518652467538_0001_01_000003
2018-02-16 10:58:51,173 [user:] [pipeline:] [thread:RMCommunicator Allocator]  INFO  RMContainerAllocator - After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:3 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:3 RackLocal:1
2018-02-16 10:58:51,173 [user:] [pipeline:] [thread:AsyncDispatcher event handler]  INFO  TaskAttemptImpl - Diagnostics report from attempt_1518652467538_0001_m_000003_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

2018-02-16 10:58:57,163 [user:] [pipeline:] [thread:IPC Server handler 3 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000000_0 is : 0.02398264
2018-02-16 10:58:57,254 [user:] [pipeline:] [thread:IPC Server handler 16 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000001_0 is : 0.03516187
2018-02-16 10:58:57,279 [user:] [pipeline:] [thread:IPC Server handler 10 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000002_0 is : 0.1161685
2018-02-16 10:58:57,631 [user:] [pipeline:] [thread:IPC Server handler 12 on 42644]  INFO  TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1518652467538_0001_m_000001_0 is : 0.03516187
2018-02-16 10:58:57,643 [user:] [pipeline:] [thread:IPC Server handler 8 on 42644]  FATAL TaskAttemptListenerImpl - Task: attempt_1518652467538_0001_m_000001_0 - exited : java.lang.RuntimeException: Error invoking map function: java.lang.RuntimeException:
com.streamsets.pipeline.cluster.ConsumerRuntimeException: Consumer encountered error: java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
     at com.streamsets.pipeline.cluster.Producer.waitForCommit(Producer.java:107)
     at com.streamsets.pipeline.stage.origin.hdfs.cluster.ClusterHdfsSource.completeBatch(ClusterHdfsSource.java:792)

and I don't see much of anything that looks useful there.

 

Here is me looking at the distribution of characters (case insensitive) after sqooping in the files and converting to PSV (as opposed to TSV which was what I was normally doing, so expect a lot of "|" chars):

see https://stackoverflow.com/a/3966916/8236733

[mapr@mapr002 ingest_scripts]$ awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' ../data_flat/CLARITY_EDG_export/1_0_0.tbl
4
- 4
. 2
0 11
1 14
2 7
3 3
4 4
5 5
6 3
7 5
8 4
9 7
: 4
_ 93
a 23
b 3
c 60
d 56
e 58
f 18
g 10
h 12
i 41
l 90
m 17
n 67
o 36
p 13
r 39
s 14
t 37
u 40
v 2
w 3
x 19
y 12
z 1
| 88
[mapr@mapr002 ingest_scripts]$ awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' ../data_flat/CLARITY_EDG_export/1_1_0.tbl
á 1
ä 4
ç 1
è 12
5254044
é 23
" 50
# 2
ë 1
% 2083
& 1953
' 14236
( 77313
) 77291
* 48896
+ 306
ô 14
, 346180
- 2148693
ö 3
. 1479865
÷ 2
/ 32781
0 5505654
1 5955213
2 5980609
3 2456612
ü 11
4 1946013
5 2926722
6 2055325
7 2465386
8 2290921
9 2009795
: 2081212
; 690
< 352
= 130
> 819
? 83
  1
[ 108
] 108
^ 8
_ 93
` 1
a 2508634
b 540790
c 1481318
° 18
d 1214875
e 3470106
f 1033343
g 583753
h 873089
i 2829342
j 120278
k 186434
l 30981486
m 882733
n 17896336
o 2684675
p 849124
q 60882
r 2375072
s 2012187
t 2349108
u 15886720
v 392661
w 266808
x 167765
y 1462782
z 47600
| 22874456
[mapr@mapr002 ingest_scripts]$ awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' ../data_flat/CLARITY_EDG_export/1_2_0.tbl
è 35
9399844
é 66
" 36
% 490
& 147
' 14554
( 97579
) 97579
* 67125
+ 238
, 1484849
- 2958675
ö 6
. 1603112
/ 10346
0 7501439
1 7567462
2 7615690
3 2626703
ü 6
4 2395415
5 3437774
6 3329127
7 3192766
8 2859960
9 2416935
: 2892114
; 2501
< 312
= 41
> 330
  5
[ 53
] 53
_ 93
a 3761635
b 976319
c 2416469
° 18
d 1859627
e 6381245
f 2044485
g 924707
h 1393531
i 4744658
j 129025
k 308269
l 50253377
m 1106774
n 28840831
o 4274576
p 1394092
q 339033
r 3883684
s 3333582
t 4506864
u 26535672
v 490873
w 591825
x 362728
y 1141966
z 39478
{ 28
| 31709832
} 28
[mapr@mapr002 ingest_scripts]$ awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' ../data_flat/CLARITY_EDG_export/1_3_0.tbl
á 3
è 1
3418739
é 13
" 34
$ 1
% 542
& 83
' 2693
( 42102
) 42102
* 29221
ó 1
+ 34
, 395822
- 1071956
ö 2
. 565183
/ 3876
0 2498867
1 2903993
2 2480938
3 916764
ü 1
4 880966
5 1105371
6 1218524
7 1281666
8 1005814
9 860635
: 1030583
; 483
< 28
> 3061
[ 20
] 20
_ 93
a 1341181
b 318157
c 895337
d 642904
e 2278393
f 742898
g 299663
h 549640
i 1597221
j 41344
k 77483
l 18300974
m 370204
n 10405874
o 1513568
p 498273
q 125673
r 1331235
s 1212166
t 1583633
u 9668565
v 177549
w 215143
x 120315
y 373901
z 12558
| 11336292

Do any of these characters look like red flags? Thinking that some of these characters looked strange, I ran the command 

[mapr@mapr002 ingest_scripts]$ file -i ../data_flat/file_locations/1_*.tbl
../data_flat/CLARITY_EDG_export/1_0_0.tbl: text/plain; charset=us-ascii
../data_flat/CLARITY_EDG_export/1_1_0.tbl: text/plain; charset=us-ascii
../data_flat/CLARITY_EDG_export/1_2_0.tbl: text/plain; charset=us-ascii
../data_flat/CLARITY_EDG_export/1_3_0.tbl: text/plain; charset=us-ascii

Yet switching the pipeline char-set from the default utf-8 to us-ascii still ran into the same invalid char problem.

 

This question has also been asked by other streamsets user before, here (Google Groups, [SDC-4753] Problematic Rows in Delimited File in S3 Import Are Reported as Stage Errors and Block Subsequent Rows - JIRA).

 

So it seems like streamsets is failing to parse certain data features in a record in one of the TSV files. My question is what kind of data could be causing this (ie. what are some specific things that the parser would not like (since the error message does not seem to show exactly what caused the parsing error))? Could there be some other reason that the pipeline is failing part-way through with this error each time (ie. am I not interpreting this error correctly)? Thanks.

Outcomes