2. Data Engineering

Data Cleaning — Quiz

Test your understanding of data cleaning with 5 practice questions.

Practice Questions

Question 1

A dataset contains a column 'Customer_ID' with some entries that are 'NULL' or 'N/A'. If 'Customer_ID' is a critical unique identifier and these missing values cannot be reliably imputed, what is the most appropriate initial action for these entries?

Question 2

Which of the following data cleaning techniques is most appropriate for handling inconsistencies in textual data, such as variations in spelling or capitalization for the same category?

Question 3

A dataset contains a column 'TransactionAmount' with values ranging from $10$ to $10,000,000$. To prevent this column from disproportionately influencing a machine learning model that uses distance-based algorithms, which data cleaning strategy would be most appropriate?

Question 4

In a dataset, a column 'Product_ID' contains entries like '$P-101$', '$P_102$', and '$Product103$'. What data cleaning step is primarily needed to ensure these identifiers are consistent and usable for merging with other datasets?

Question 5

When preparing a dataset for time-series analysis, why is it critical to ensure that all date and time entries are in a consistent, machine-readable format (e.g., 'YYYY-MM-DD HH:MM:SS')?