Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))
So my advice is to do a minimum checking [on the edge node] before data ingestion.
No comments:
Post a Comment