Проверка csv файла на ошибки python

csv_file_validator

Python CSV file validation tool

This tool is written with type hints introduced in Python 3.5

  • What this tool can do
  • Validation schema
    • Validation schema for a file with a header
    • Validation schema for a file without a header
  • Validation rules
  • How to install & run
    • Arguments needed
  • How to add a custom column validation rule

What this tool can do:

The purpose of this tool is to validate comma separated value files. This tool needs the user to provide a validation schema as a json file and a file path of the file to be validated, or a folder path to validate multiple files in one run against the provided validation schema.

Validation schema:

Validation schema is a json file. Let’s have a closer look at a real life example file.

{
   "file_metadata":{
      "file_value_separator":",",
      "file_value_quote_char": """,
      "file_row_terminator":"n",
      "file_has_header":true
   },
   "file_validation_rules":{
      "file_name_file_mask":"SalesJ",
      "file_extension":"csv",
      "file_size_range":[0,1],
      "file_row_count_range":[0,1000],
      "file_header_column_names":[
         "Transaction_date",
         "Product",
         "Price",
         "Payment_Type",
         "Name",
         "City",
         "State",
         "Country",
         "Account_Created",
         "Last_Login",
         "Latitude",
         "Longitude"
      ]
   },
   "column_validation_rules":{
      "Transaction_date":{
         "allow_data_type": "datetime.%M/%d/%y %H:%S"
      },
      "Country":{
         "allow_fixed_value_list":[
            "Norway",
            "United States"
         ],
         "allow_regex":"[a-zA-Z].+",
         "allow_substring":"sub",
         "allow_data_type":"str"
      },
      "Price":{
         "allow_int_value_range":[0, 100000],
         "allow_fixed_value":1000,
         "allow_data_type":"int"
      },
      "Latitude":{
        "allow_float_value_range": [-42.5, 90.5]
      }
   }
}

Mandatory objects in the validation schema json are:

  • file_metadata object containing these 4 keys:
   "file_metadata":{
      "file_value_separator":",",
      "file_value_quote_char": """,
      "file_row_terminator":"n",
      "file_has_header":true
   },
  • at least one defined rule

Validation schema for a file with a header:

If validating a file that has a header, we have to set the file_has_header key to true and define the column names in the column validation rules.

{
   "file_metadata":{
      "file_value_separator":",",
      "file_value_quote_char": """,
      "file_row_terminator":"n",
      "file_has_header":true
   },
   "file_validation_rules":{
      "file_name_file_mask":"SalesJ",
      "file_extension":"csv",
      "file_size_range":[0,1],
      "file_row_count_range":[0,1000],
      "file_header_column_names":[
         "Transaction_date",
         "Product",
         "Price",
         "Payment_Type",
         "Name",
         "City",
         "State",
         "Country",
         "Account_Created",
         "Last_Login",
         "Latitude",
         "Longitude"
      ]
   },
   "column_validation_rules":{
      "Transaction_date":{
         "allow_data_type": "datetime.%M/%d/%y %H:%S"
      },
      "Country":{
         "allow_fixed_value_list":[
            "Norway",
            "United States"
         ],
         "allow_regex":"[a-zA-Z].+",
         "allow_substring":"Norwayz",
         "allow_data_type":"str",
         "allow_fixed_value":"value"
      },
      "Price":{
         "allow_int_value_range":[0, 100000],
         "allow_fixed_value":1000,
         "allow_data_type":"int"
      },
      "Latitude":{
        "allow_float_value_range": [-42.5, 90.5]
      }
   }
}

Validation schema for a file without a header:

If validating a file that has no header, we have to set the file_has_header key to false and define the column indexes in the column validation rules so they’re starting from 0 for the first column.

{
   "file_metadata":{
      "file_value_separator":",",
      "file_value_quote_char": """,
      "file_row_terminator":"n",
      "file_has_header":false
   },
   "file_validation_rules":{
      "file_name_file_mask":"SalesJ",
      "file_extension":"csv",
      "file_size_range":[0,1],
      "file_row_count_range":[0,1000]
   },
   "column_validation_rules":{
      "0":{
         "allow_data_type": "datetime.%M/%d/%y %H:%S"
      },
      "1":{
         "allow_fixed_value_list":[
            "Norway",
            "United States"
         ],
         "allow_regex":"[a-zA-Z].+",
         "allow_substring":"xzy",
         "allow_data_type":"str"
      },
      "2":{
         "allow_int_value_range":[0, 100000],
         "allow_fixed_value":1000,
         "allow_data_type":"int"
      },
      "10":{
        "allow_float_value_range": [-42.5, 90.5]
      }
   }
}

Validation rules:

  • File level validation rules:
    • file_name_file_mask : checks file name matches the file mask regex pattern
    • file_extension : checks file extension is an exact match with the provided value
    • file_size_range : checks file size in MB is in the range of the provided values
    • file_row_count_range : checks file row count is in the range of the provided values
    • file_header_column_names : checks file header is an exact match with the provided value
  • Column level validation rules:
    • allow_data_type : checks column values are of the allowed data type ( allowed options: str , int , float, datetime, datetime.<<format>>)
    • allow_int_value_range : checks integer column values are in the range of the provided values
    • allow_float_value_range : checks float column values are in the range of the provided values
    • allow_fixed_value_list : checks column values are in the provided value list
    • allow_regex : checks column values match the provided regex pattern
    • allow_substring : checks column values are a substring of the provided value
    • allow_fixed_value : checks column values are an exact match with the provided value

How to install & run:

  • ideally create and activate a virtual environment or pipenv in order to safely install dependencies from requirements.txt using pip install -r requirements.txt

  • Set PYTHONPATH , from Windows CMD for example set PYTHONPATH=%PYTHONPATH%;C:csv_file_validator

  • run using a command for example: python C:csv_file_validatorcsv_file_validator -fl C:csv_file_validatortestsfilescsvwith_headerSalesJan2009_with_header_correct_file.csv -cfg C:csv_file_validatortestsfilesconfigsconfig_with_header.json

  • in settings.conf file

    • you can set the variable RAISE_EXCEPTION_AND_HALT_ON_FAILED_VALIDATION to True or False, this variable drives the behavior whether the tool stops validations after it hits a failed validation or not
    • you can set the variable SKIP_COLUMN_VALIDATIONS_ON_EMPTY_FILE to True or False, this variable drives the behavior whether the tool bypass the column level validations on a file that has no rows or not

arguments needed:

  • -fl <string: mandatory> single file absolute path or absolute folder location (in case you need to validate multiple files from a directory in one app run)
  • -cfg <string: mandatory> configuration json file location absolute path

How to add a custom column validation rule:

Column validation rule interface:

The keyword argument validation_value is the value in the config.json file, describing the allowed values for the validation rule

The keyword argument column_value is the value in the corresponding column in the .csv file being validated

  • Create a function in /csv_file_validator/validation_functions.py module and decorate it with @logging_decorator like this:

    @logging_decorator 
    def my_new_validation_function(kwargs):
        # example condition that validates the exact match of a validation_value and a column_value:
        if kwargs.get('validation_value') == kwargs.get('column_value'): 
            # your validation condition success returns 0
            return 0
        # your validation condition fail returns 1     
        return 1
  • Add your function name to the registered validation name key — function mapping dictionary _ATTRIBUTE_FUNC_MAP.
    This dictionary is located at the end of the file /csv_file_validator/validation_functions.py.

    "my_new_validation_function": my_new_validation_function
  • For a column you wish to evaluate using this new validation rule, setup a validation function to validation value mapping in config.json

    "my_column_name": {
        "my_new_validation_function": "some_validation_value"
    }
  • If you need to define regex patterns in regex validation rules, check https://regex101.com/

I have a csv file in this format

"emails"
"foo.bar@foo.com"
"bar.foo@foo.com"
"foobar@foo.com"

If a file is not in the above format, I want to throw a file format error. How to do this?

cnu's user avatar

cnu

35.8k23 gold badges64 silver badges63 bronze badges

asked Feb 2, 2011 at 8:59

silence_ghost's user avatar

1

Just use the csv module to read the file — check the first row matches the header you expect, and use a regex (re module) to check the email addresses in subsequent lines… Throw the appropriate exception or terminate the program if these measures fail.

answered Feb 2, 2011 at 9:10

Matt Billenstein's user avatar

Just try cutplace it validates if your csv conforms to a specified format.

pbaranski's user avatar

pbaranski

22.4k18 gold badges99 silver badges115 bronze badges

answered May 24, 2013 at 13:33

Pavan G's user avatar

This module provides some simple utilities for validating data contained in CSV
files, or other similar data sources.

The source code for this module lives at:

https://github.com/alimanfoo/csvvalidator

Please report any bugs or feature requests via the issue tracker there.

Installation

This module is registered with the Python package index, so you can do:

$ easy_install csvvalidator

… or download from http://pypi.python.org/pypi/csvvalidator and
install in the usual way:

$ python setup.py install

If you want the bleeding edge, clone the source code repository:

$ git clone git://github.com/alimanfoo/csvvalidator.git
$ cd csvvalidator
$ python setup.py install

Usage

The CSVValidator class is the foundation for all validator objects that are
capable of validating CSV data.

You can use the CSVValidator class to dynamically construct a validator, e.g.:

import sys
import csv
from csvvalidator import *

field_names = (
               'study_id',
               'patient_id',
               'gender',
               'age_years',
               'age_months',
               'date_inclusion'
               )

validator = CSVValidator(field_names)

# basic header and record length checks
validator.add_header_check('EX1', 'bad header')
validator.add_record_length_check('EX2', 'unexpected record length')

# some simple value checks
validator.add_value_check('study_id', int,
                          'EX3', 'study id must be an integer')
validator.add_value_check('patient_id', int,
                          'EX4', 'patient id must be an integer')
validator.add_value_check('gender', enumeration('M', 'F'),
                          'EX5', 'invalid gender')
validator.add_value_check('age_years', number_range_inclusive(0, 120, int),
                          'EX6', 'invalid age in years')
validator.add_value_check('date_inclusion', datetime_string('%Y-%m-%d'),
                          'EX7', 'invalid date')

# a more complicated record check
def check_age_variables(r):
    age_years = int(r['age_years'])
    age_months = int(r['age_months'])
    valid = (age_months >= age_years * 12 and
             age_months % age_years < 12)
    if not valid:
        raise RecordError('EX8', 'invalid age variables')
validator.add_record_check(check_age_variables)

# validate the data and write problems to stdout
data = csv.reader('/path/to/data.csv', delimiter='t')
problems = validator.validate(data)
write_problems(problems, sys.stdout)

For more complex use cases you can also sub-class CSVValidator to define
re-usable validator classes for specific data sources.

For a complete account of all of the functionality available from this module,
see the example.py and tests.py modules in the source code repository.

Notes

Note that the csvvalidator module is intended to be used in combination with
the standard Python csv module. The csvvalidator module will not
validate the syntax of a CSV file. Rather, the csvvalidator module can be
used to validate any source of row-oriented data, such as is provided by a
csv.reader object.

I.e., if you want to validate data from a CSV file, you have to first construct
a CSV reader using the standard Python csv module, specifying the appropriate
dialect, and then pass the CSV reader as the source of data to either the
CSVValidator.validate or the CSVValidator.ivalidate method.


Go to Python


r/Python


r/Python

News about the programming language Python. If you have something to teach others post here. If you have questions or are a newbie use r/learnpython




Members





Online



by

python9293



I have a cvs file with 100s of rows, i seen a few ways on github but they are all out of date, is there any simple examples that i can go back to validate that each column has the correct type of data eg string, number, date.

Проверить файл CSV в Python

В этом уроке мы увидим, как проверить файл CSV в программировании на Python, поэтому, если вы хотите полное решение с кодом, прочитайте эту статью до конца.

CSV — это формат файла, который используется в электронной таблице Excel, поэтому, если вы работаете с файлами csv в python, вам необходимо проверить его перед обработкой.

Итак, для проверки файла CSV в python я буду использовать панды library это очень мощная библиотека, когда дело доходит до анализа данных, поэтому давайте посмотрим код, который это делает.


import pandas as pd

try:
    df = pd.read_csv('test.csv')
    print("CSV file is valid")
except Exception as e:
    print(f"CSV file not valid: {e}")

Выше приведен код Python для проверки файла csv, поскольку вы можете видеть, что это небольшая программа, которую вы можете протестировать на своем компьютере или с помощью онлайн-компилятор.

Вышеуказанная программа требует, чтобы вы установили библиотеку pandas, чтобы вы могли сделать это, используя приведенную ниже команду, вставьте ее в свой терминал.


pip install pandas

Код очень прост, как вы можете видеть, я обернул код оператором try и exception. При попытке он попытается прочитать файл csv, поэтому, если при попытке чтения возникнет какая-либо ошибка, он выдаст ошибку в операторе исключения, где я печатаю файл недействителен с ошибкой.

Вот еще несколько руководств по Python для вас:

Итак, это был простой учебник по проверке файла CSV в программировании на Python. Я надеюсь, что вы нашли эту программу полезной, поделитесь ею с теми, кому она может понадобиться. Вы можете присоединиться к нашему телеграмм канал если вы хотите обновления наших последних руководств.

Спасибо за прочтение, хорошего дня 🙂

Ссылка на источник

Понравилась статья? Поделить с друзьями:
  • Проверка ccd на ошибки
  • Проверить текст нав ошибки
  • Проверка bind на ошибки
  • Проверить текст на содержание ошибок
  • Провериться на ошибки