2. Data Engineering
Data Cleaning — Quiz
Test your understanding of data cleaning with 5 practice questions.
Practice Questions
Question 1
A dataset contains a column 'Customer_ID' with some entries that are 'NULL' or 'N/A'. If 'Customer_ID' is a critical unique identifier and these missing values cannot be reliably imputed, what is the most appropriate initial action for these entries?
Question 2
Which of the following data cleaning techniques is most appropriate for handling inconsistencies in textual data, such as variations in spelling or capitalization for the same category?
Question 3
A dataset contains a column 'TransactionAmount' with values ranging from $10$ to $10,000,000$. To prevent this column from disproportionately influencing a machine learning model that uses distance-based algorithms, which data cleaning strategy would be most appropriate?
Question 4
In a dataset, a column 'Product_ID' contains entries like '$P-101$', '$P_102$', and '$Product103$'. What data cleaning step is primarily needed to ensure these identifiers are consistent and usable for merging with other datasets?
Question 5
When preparing a dataset for time-series analysis, why is it critical to ensure that all date and time entries are in a consistent, machine-readable format (e.g., 'YYYY-MM-DD HH:MM:SS')?
