Report file with problems instead of failing queries when dealing with thousands or millions of records.
Sorry, while I initially upvoted, I have thought a bit about this...
I know that there's an Apache project that does something similar. In that project the idea was to provide a timeout for a query and if the query doesn't return in time, then the results gathered are returned as a possible result. (And the time out can be set on a per query basis. )
While this may work for some queries, its really a smaller subset of the types of queries out there. And it has the potential to be misused.
Imagine trying to balance the books based on orders or transactions. Think of the different isolation levels and the effects on the RDBMS queries. You now don't know the accuracy of your query. Or the percentage of completion before the error. (What does it mean 80% complete if you don't the size of the result set? Its only an estimation.)
This is different from loading a file in to a database where some rows are rejected and if a certain percentage of the rows fail, the job fails, otherwise it continues to completion with an error log file containing the failed rows.
Drill has to be zero tolerant to a failure. You have to trust that when you run a query against a static data set, you get the same results every time.
Product owners have to be careful when thinking about an enhancement and how it impacts 99+% of the use cases. It may be a good idea, but if it causes confusion or issues with normal use, it should be avoided.
Interesting observation and I agree that this enhancement must be used carefully. I think it should be 100% by default but configurable and the error tolerance % selected should be shown in the query result. That way any reader can verify the error tolerance utilized and replicate the result, what do you think?
I would have to recommend ditching this feature. No bueno.
I can see this as a feature in an ETL tool where its possible to have one or two bad rows out of millions. So you want to process the entire file, reporting and holding rows is one thing. So if you say abort after n errors, that would be ok. You can then fix and process the bad data at a later date.
However, Drill is a query tool.
You have to walk thru all of the available data.
Can you share any use cases where you think this is a good fit?
I have a cluster with multiple data sources being ingested daily (json, csv, parquet as the most common). Most source files schema rarely change. However, I have python workflows build on top the raw data to restructure the json logs and I use Drill to query those as well for Analytics.
Ideally, python workflows are built once and never changed, but reality is that we need to iterate to process the data in ways to better extract value. It's a never ending process of tweaking, testing, re-tweaking. In Terabytes of data, I frequently find many unforeseen scenarios while building my ETL workflows. Due to Drill's errors descriptions, it is sometimes very hard to understand what happen and detecting one single failure does not help me find patterns to troubleshoot. Moreover, a handful of failures in my scenario, is not very troublesome. It would be great to have a feature to enable Drill to help in the debugging process, allow some limit of failure and report back the entire list of errors with file sources to make the process more efficient.
So, for this case, feature is "muy bueno"
So... you're writing your ETL in Python and because of an error, when you use Drill to query the data, you are having issues?
If I understand your use case, the issue isn't with Drill, but your ETL process. Correct?
Correct. It's because my ETL is changing frequently as we develop & change the data structure.
Ok, so this goes back to my point... if we look back at the queries, you want exact answers. Suppose you're doing a sum, and at 90% completion, you get one answer. However if you rerun the query, and still fail at or around the 90% mark, you get a different answer. Which one is correct?
The point is that percentage completion doesn't necessarily equate to any sort of confidence for most of the queries run.
It sounds like the issue is that the issue is in your ETL process. If I understand your underlying problem... you have to deal with schema changes and still reconcile the data?
I think the difference is that you consider a absolute must for all cases to have 100% of the data at all times. Like I explained before, I don't see it that way. In my case, the report with the missed data is more valuable than the query result.
Drill is a piece of my ETL puzzle, as it does some transformation inside the query as well, and it would help if it would skip the files that do not run and return the full report later. I am not analyzing yet, I am just building my data flow, trying to understand how the data behaves (specially when testing new Drill functions). By adding flexibility, Drill would be a valuable asset.
That is my point. Have it be a feature, configurable, not by default and transparent in the report. That is my "Dear Santa" wish for the Drill community
Retrieving data ...