A common workflow in product/services for business is to be able to import existing data and export them later on.
A simple strategy to import data is to use CSV/XLSX files, however, this also has some tradeoffs.
Let's see what can go wrong when handling data importing using CSV/XLSX files approach.
Header/Columns Templates
Your software can provide an import template to make sure the user fill the columns with the correct data. What can go wrong with this:
Users can change the header of your import file, so your importing strategy will fail.
You could enforce the column number, but users will change this order
You could enforce the column header label, but users will change them
You could try to match them using a heuristic that this will also fail
You can let the user tells you what is inside each column, making it easier for them and harder for your processing algorithm.
Validations
You need to make sure the import data has no errors and it is valid. You should not import invalid data for your product.
Your validations can be global or local.
If you are importing users, you need a global validation to make sure that one row does not conflict (same primary keys) with another in the same import.
You need local validation per row, to make sure all required columns are defined.
Each column can have a data type (string, number, date, ID, boolean, and more).
You also need to make sure all columns have the proper type.
If you have an ID reference to some database data, you need to make sure the ID is valid, and it is stored in the database.
Space
Users won't be careful when creating the importing file. They will add space before and after data without notice. They can't see space, as most editors don't show them.
Before processing any data you need to trim space from the beginning and end of any value.
Dates/Times
CSV/XLSX files do not ensure the user will use the data formatting that you asked them for.
You can use the date format in XLSX, but the user can change the column to remove this formatting, or pasting the wrong date format altogether.
You can also handle dates as a string or let your XLSX parser them to date.
The most common problem is that some countries using dd/mm/yyyy and other uses mm/dd/yyyy. You can use some heuristic to try to correctly parse them, as some dates only work in a given format. You could also use the date format of the locale of the user, but it won't tell much about the language of the file the user used to edit the import data.
You also need to handle timezone possible issues as always.
Booleans
You can have a predefined set of strings to represent truth and false values.
It can be in English and also in the language of your users.
You should also lowercase them to make comparisons more stable, as users can behave and act differently.
Conditional Fields
One column/field can affect other fields.
One field can make another field required
One field can be exclusive against other (you can't have both)
Error Handling
Instead of a stack trace, we need to provide a file trace of error that happened during the processing.
We can have 3 types of errors: global, per row, or column.
Global: it happens when 2 or more rows of conflicts
Per row: it happens when all columns are valid individually but an error happens when they are used together.
Per column: it happens when a given column is wrong, and break the processing of a given row.
Idempotency
We want to make sure we don't duplicate data when handling imports.
To make sure this won't happen, you need to make sure each row has one or more idempotency keys or primary keys.
These idempotency keys are used to check if the data already exists, so you perform an update instead of an insert.
Showing Results
You need to show what rows were imported successfully.
You also need to show which rows were not processed correctly and why, so the user can fix them in another import.
Below is an example of a data import report
it shows the number of items successful imported
it shows the error per row
it shows the specific column with problem
I hope this gave you a good overview of many edge cases that you need to consider when implementing data import workflow in your product. You can discuss more on the comments section:
Also, share with friends that will need to implement data import or want to improve their product current workflow
Importing data using CSV/XLSX
Very interesting problem to work on that would benefit greatly from testing. Ideally the user would interact with the machine to fix their import and keep working on it until they get it right. Would be cool to go over some code for this stuff too, do you know any OSS projects doing this?