csv_file_validator
Python CSV file validation tool
This tool is written with type hints introduced in Python 3.5
- What this tool can do
- Validation schema
- Validation schema for a file with a header
- Validation schema for a file without a header
- Validation rules
- How to install & run
- Arguments needed
- How to add a custom column validation rule
What this tool can do:
The purpose of this tool is to validate comma separated value files. This tool needs the user to provide a validation schema as a json file and a file path of the file to be validated, or a folder path to validate multiple files in one run against the provided validation schema.
Validation schema:
Validation schema is a json file. Let’s have a closer look at a real life example file.
{ "file_metadata":{ "file_value_separator":",", "file_value_quote_char": """, "file_row_terminator":"n", "file_has_header":true }, "file_validation_rules":{ "file_name_file_mask":"SalesJ", "file_extension":"csv", "file_size_range":[0,1], "file_row_count_range":[0,1000], "file_header_column_names":[ "Transaction_date", "Product", "Price", "Payment_Type", "Name", "City", "State", "Country", "Account_Created", "Last_Login", "Latitude", "Longitude" ] }, "column_validation_rules":{ "Transaction_date":{ "allow_data_type": "datetime.%M/%d/%y %H:%S" }, "Country":{ "allow_fixed_value_list":[ "Norway", "United States" ], "allow_regex":"[a-zA-Z].+", "allow_substring":"sub", "allow_data_type":"str" }, "Price":{ "allow_int_value_range":[0, 100000], "allow_fixed_value":1000, "allow_data_type":"int" }, "Latitude":{ "allow_float_value_range": [-42.5, 90.5] } } }
Mandatory objects in the validation schema json are:
file_metadata
object containing these 4 keys:
"file_metadata":{ "file_value_separator":",", "file_value_quote_char": """, "file_row_terminator":"n", "file_has_header":true },
- at least one defined rule
Validation schema for a file with a header:
If validating a file that has a header, we have to set the file_has_header
key to true
and define the column names in the column validation rules
.
{ "file_metadata":{ "file_value_separator":",", "file_value_quote_char": """, "file_row_terminator":"n", "file_has_header":true }, "file_validation_rules":{ "file_name_file_mask":"SalesJ", "file_extension":"csv", "file_size_range":[0,1], "file_row_count_range":[0,1000], "file_header_column_names":[ "Transaction_date", "Product", "Price", "Payment_Type", "Name", "City", "State", "Country", "Account_Created", "Last_Login", "Latitude", "Longitude" ] }, "column_validation_rules":{ "Transaction_date":{ "allow_data_type": "datetime.%M/%d/%y %H:%S" }, "Country":{ "allow_fixed_value_list":[ "Norway", "United States" ], "allow_regex":"[a-zA-Z].+", "allow_substring":"Norwayz", "allow_data_type":"str", "allow_fixed_value":"value" }, "Price":{ "allow_int_value_range":[0, 100000], "allow_fixed_value":1000, "allow_data_type":"int" }, "Latitude":{ "allow_float_value_range": [-42.5, 90.5] } } }
Validation schema for a file without a header:
If validating a file that has no header, we have to set the file_has_header
key to false
and define the column indexes in the column validation rules
so they’re starting from 0 for the first column.
{ "file_metadata":{ "file_value_separator":",", "file_value_quote_char": """, "file_row_terminator":"n", "file_has_header":false }, "file_validation_rules":{ "file_name_file_mask":"SalesJ", "file_extension":"csv", "file_size_range":[0,1], "file_row_count_range":[0,1000] }, "column_validation_rules":{ "0":{ "allow_data_type": "datetime.%M/%d/%y %H:%S" }, "1":{ "allow_fixed_value_list":[ "Norway", "United States" ], "allow_regex":"[a-zA-Z].+", "allow_substring":"xzy", "allow_data_type":"str" }, "2":{ "allow_int_value_range":[0, 100000], "allow_fixed_value":1000, "allow_data_type":"int" }, "10":{ "allow_float_value_range": [-42.5, 90.5] } } }
Validation rules:
- File level validation rules:
- file_name_file_mask : checks file name matches the file mask regex pattern
- file_extension : checks file extension is an exact match with the provided value
- file_size_range : checks file size in MB is in the range of the provided values
- file_row_count_range : checks file row count is in the range of the provided values
- file_header_column_names : checks file header is an exact match with the provided value
- Column level validation rules:
- allow_data_type : checks column values are of the allowed data type ( allowed options:
str
,int
,float
,datetime
,datetime.<<format>>
) - allow_int_value_range : checks integer column values are in the range of the provided values
- allow_float_value_range : checks float column values are in the range of the provided values
- allow_fixed_value_list : checks column values are in the provided value list
- allow_regex : checks column values match the provided regex pattern
- allow_substring : checks column values are a substring of the provided value
- allow_fixed_value : checks column values are an exact match with the provided value
- allow_data_type : checks column values are of the allowed data type ( allowed options:
How to install & run:
-
ideally create and activate a
virtual environment
orpipenv
in order to safely install dependencies fromrequirements.txt
usingpip install -r requirements.txt
-
Set PYTHONPATH , from Windows CMD for example
set PYTHONPATH=%PYTHONPATH%;C:csv_file_validator
-
run using a command for example:
python C:csv_file_validatorcsv_file_validator -fl C:csv_file_validatortestsfilescsvwith_headerSalesJan2009_with_header_correct_file.csv -cfg C:csv_file_validatortestsfilesconfigsconfig_with_header.json
-
in
settings.conf
file- you can set the variable
RAISE_EXCEPTION_AND_HALT_ON_FAILED_VALIDATION
toTrue
orFalse
, this variable drives the behavior whether the tool stops validations after it hits a failed validation or not - you can set the variable
SKIP_COLUMN_VALIDATIONS_ON_EMPTY_FILE
toTrue
orFalse
, this variable drives the behavior whether the tool bypass the column level validations on a file that has no rows or not
- you can set the variable
arguments needed:
-fl
<string: mandatory> single file absolute path or absolute folder location (in case you need to validate multiple files from a directory in one app run)-cfg
<string: mandatory> configuration json file location absolute path
How to add a custom column validation rule:
Column validation rule interface:
The keyword argument
validation_value
is the value in the config.json file, describing the allowed values for the validation rule
The keyword argument
column_value
is the value in the corresponding column in the .csv file being validated
-
Create a function in
/csv_file_validator/validation_functions.py
module and decorate it with@logging_decorator
like this:@logging_decorator def my_new_validation_function(kwargs): # example condition that validates the exact match of a validation_value and a column_value: if kwargs.get('validation_value') == kwargs.get('column_value'): # your validation condition success returns 0 return 0 # your validation condition fail returns 1 return 1
-
Add your function name to the registered validation name key — function mapping dictionary
_ATTRIBUTE_FUNC_MAP
.
This dictionary is located at the end of the file/csv_file_validator/validation_functions.py
."my_new_validation_function": my_new_validation_function
-
For a column you wish to evaluate using this new validation rule, setup a validation function to validation value mapping in
config.json
"my_column_name": { "my_new_validation_function": "some_validation_value" }
-
If you need to define regex patterns in regex validation rules, check https://regex101.com/
I have a csv file in this format
"emails"
"foo.bar@foo.com"
"bar.foo@foo.com"
"foobar@foo.com"
If a file is not in the above format, I want to throw a file format error. How to do this?
cnu
35.8k23 gold badges64 silver badges63 bronze badges
asked Feb 2, 2011 at 8:59
1
Just use the csv module to read the file — check the first row matches the header you expect, and use a regex (re module) to check the email addresses in subsequent lines… Throw the appropriate exception or terminate the program if these measures fail.
answered Feb 2, 2011 at 9:10
Just try cutplace it validates if your csv conforms to a specified format.
pbaranski
22.4k18 gold badges99 silver badges115 bronze badges
answered May 24, 2013 at 13:33
This module provides some simple utilities for validating data contained in CSV
files, or other similar data sources.
The source code for this module lives at:
https://github.com/alimanfoo/csvvalidator
Please report any bugs or feature requests via the issue tracker there.
Installation
This module is registered with the Python package index, so you can do:
$ easy_install csvvalidator
… or download from http://pypi.python.org/pypi/csvvalidator and
install in the usual way:
$ python setup.py install
If you want the bleeding edge, clone the source code repository:
$ git clone git://github.com/alimanfoo/csvvalidator.git $ cd csvvalidator $ python setup.py install
Usage
The CSVValidator class is the foundation for all validator objects that are
capable of validating CSV data.
You can use the CSVValidator class to dynamically construct a validator, e.g.:
import sys import csv from csvvalidator import * field_names = ( 'study_id', 'patient_id', 'gender', 'age_years', 'age_months', 'date_inclusion' ) validator = CSVValidator(field_names) # basic header and record length checks validator.add_header_check('EX1', 'bad header') validator.add_record_length_check('EX2', 'unexpected record length') # some simple value checks validator.add_value_check('study_id', int, 'EX3', 'study id must be an integer') validator.add_value_check('patient_id', int, 'EX4', 'patient id must be an integer') validator.add_value_check('gender', enumeration('M', 'F'), 'EX5', 'invalid gender') validator.add_value_check('age_years', number_range_inclusive(0, 120, int), 'EX6', 'invalid age in years') validator.add_value_check('date_inclusion', datetime_string('%Y-%m-%d'), 'EX7', 'invalid date') # a more complicated record check def check_age_variables(r): age_years = int(r['age_years']) age_months = int(r['age_months']) valid = (age_months >= age_years * 12 and age_months % age_years < 12) if not valid: raise RecordError('EX8', 'invalid age variables') validator.add_record_check(check_age_variables) # validate the data and write problems to stdout data = csv.reader('/path/to/data.csv', delimiter='t') problems = validator.validate(data) write_problems(problems, sys.stdout)
For more complex use cases you can also sub-class CSVValidator to define
re-usable validator classes for specific data sources.
For a complete account of all of the functionality available from this module,
see the example.py and tests.py modules in the source code repository.
Notes
Note that the csvvalidator module is intended to be used in combination with
the standard Python csv module. The csvvalidator module will not
validate the syntax of a CSV file. Rather, the csvvalidator module can be
used to validate any source of row-oriented data, such as is provided by a
csv.reader object.
I.e., if you want to validate data from a CSV file, you have to first construct
a CSV reader using the standard Python csv module, specifying the appropriate
dialect, and then pass the CSV reader as the source of data to either the
CSVValidator.validate or the CSVValidator.ivalidate method.
Go to Python
r/Python
r/Python
News about the programming language Python. If you have something to teach others post here. If you have questions or are a newbie use r/learnpython
Members
Online
•
by
python9293
I have a cvs file with 100s of rows, i seen a few ways on github but they are all out of date, is there any simple examples that i can go back to validate that each column has the correct type of data eg string, number, date.
В этом уроке мы увидим, как проверить файл CSV в программировании на Python, поэтому, если вы хотите полное решение с кодом, прочитайте эту статью до конца.
CSV — это формат файла, который используется в электронной таблице Excel, поэтому, если вы работаете с файлами csv в python, вам необходимо проверить его перед обработкой.
Итак, для проверки файла CSV в python я буду использовать панды library это очень мощная библиотека, когда дело доходит до анализа данных, поэтому давайте посмотрим код, который это делает.
import pandas as pd
try:
df = pd.read_csv('test.csv')
print("CSV file is valid")
except Exception as e:
print(f"CSV file not valid: {e}")
Выше приведен код Python для проверки файла csv, поскольку вы можете видеть, что это небольшая программа, которую вы можете протестировать на своем компьютере или с помощью онлайн-компилятор.
Вышеуказанная программа требует, чтобы вы установили библиотеку pandas, чтобы вы могли сделать это, используя приведенную ниже команду, вставьте ее в свой терминал.
pip install pandas
Код очень прост, как вы можете видеть, я обернул код оператором try и exception. При попытке он попытается прочитать файл csv, поэтому, если при попытке чтения возникнет какая-либо ошибка, он выдаст ошибку в операторе исключения, где я печатаю файл недействителен с ошибкой.
Вот еще несколько руководств по Python для вас:
Итак, это был простой учебник по проверке файла CSV в программировании на Python. Я надеюсь, что вы нашли эту программу полезной, поделитесь ею с теми, кому она может понадобиться. Вы можете присоединиться к нашему телеграмм канал если вы хотите обновления наших последних руководств.
Спасибо за прочтение, хорошего дня 🙂
Ссылка на источник