validatex package

Subpackages

Module contents

ValidateX - A powerful data quality validation framework.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames.

Usage:
>>> import validatex as vx
>>> suite = vx.ExpectationSuite("my_suite")
>>> suite.add("expect_column_to_not_be_null", column="user_id")
>>> suite.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
>>> result = vx.validate(df, suite)
>>> result.to_html("report.html")
class validatex.ColumnHealthSummary(column: str, checks: int = 0, passed: int = 0, failed: int = 0, errors: int = 0, null_count: int | None = None, null_percent: float | None = None, unique_count: int | None = None, unique_percent: float | None = None, total_rows: int | None = None)[source]

Bases: object

Aggregated health metrics for a single column.

checks: int = 0
column: str
errors: int = 0
failed: int = 0
property health_score: float
null_count: int | None = None
null_percent: float | None = None
passed: int = 0
to_dict() Dict[str, Any][source]
total_rows: int | None = None
unique_count: int | None = None
unique_percent: float | None = None
class validatex.DataProfiler[source]

Bases: object

Analyse a Pandas DataFrame and produce a DataProfile.

Usage

>>> profiler = DataProfiler()
>>> profile = profiler.profile(df)
>>> print(profile.summary())
>>> suite = profiler.suggest_expectations(df, suite_name="auto_suite")
profile(df: DataFrame) DataProfile[source]

Profile every column in df.

Return type:

DataProfile

suggest_expectations(df: DataFrame, suite_name: str = 'auto_generated_suite') ExpectationSuite[source]

Auto-generate an ExpectationSuite based on the data profile.

Heuristics

  • If a column has zero nulls → expect_column_to_not_be_null

  • If a column is fully unique → expect_column_values_to_be_unique

  • For numeric columns → expect_column_values_to_be_between with observed min/max.

  • For string columns with few distinct values → expect_column_values_to_be_in_set

  • For string columns → expect_column_value_lengths_to_be_between

class validatex.DriftDetector(psi_threshold: float = 0.2, bins: int = 10)[source]

Bases: object

Detects data drift between a baseline and a current Pandas DataFrame. Calculates Population Stability Index (PSI) to detect statistical shifts in distributions.

compare(df_base: DataFrame, df_current: DataFrame) DriftReport[source]

Run schema and statistical drift comparison between two DataFrames.

class validatex.DriftReport(schema_added_columns: List[str], schema_removed_columns: List[str], schema_type_changes: Dict[str, Dict[str, str]], column_drifts: Dict[str, ColumnDriftResult])[source]

Bases: object

Represents a full data drift comparison report.

column_drifts: Dict[str, ColumnDriftResult]
schema_added_columns: List[str]
schema_removed_columns: List[str]
schema_type_changes: Dict[str, Dict[str, str]]
summary() str[source]

Return a human-readable summary of the drift report.

to_dict() Dict[str, Any][source]

Convert the drift report to a dictionary.

to_json(indent: int = 2) str[source]

Convert the report to a JSON string.

class validatex.Expectation(column: str | None = None, kwargs: Dict[str, ~typing.Any]=<factory>, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: ABC

Abstract base class for all expectations.

Subclasses must:
  1. Set the class attribute expectation_type (a unique string id).

  2. Implement _validate_pandas() and/or _validate_spark().

column: str | None = None
expectation_type: str = 'base_expectation'
classmethod from_dict(d: Dict[str, Any]) Expectation[source]

Deserialize from a dictionary.

kwargs: Dict[str, Any]
meta: Dict[str, Any]
to_dict() Dict[str, Any][source]

Serialize to a plain dictionary (for YAML / JSON configs).

validate(data: Any, engine: str = 'pandas') ExpectationResult[source]

Run this expectation against data using the specified engine.

Parameters:
  • data (Any) – The dataset (pd.DataFrame or pyspark.sql.DataFrame).

  • engine (str) – "pandas" or "spark".

Return type:

ExpectationResult

class validatex.ExpectationResult(expectation_type: str, success: bool, column: str | None = None, observed_value: Any = None, element_count: int = 0, unexpected_count: int = 0, unexpected_percent: float = 0.0, unexpected_values: List[Any] = <factory>, details: Dict[str, ~typing.Any]=<factory>, exception_info: str | None = None, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Result of a single expectation evaluation.

column: str | None = None
details: Dict[str, Any]
element_count: int = 0
exception_info: str | None = None
expectation_type: str
property human_observed: str

Return a human-readable string for the observed value.

Converts raw dicts / technical strings into executive-friendly text.

meta: Dict[str, Any]
observed_value: Any = None
property severity: str

Return severity level for this expectation.

property severity_icon: str
property status: str
property status_icon: str
success: bool
to_dict() Dict[str, Any][source]
unexpected_count: int = 0
unexpected_percent: float = 0.0
unexpected_values: List[Any]
class validatex.ExpectationSuite(name: str, expectations: List[Expectation] = <factory>, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

A named collection of expectations.

Examples

>>> suite = ExpectationSuite("user_data_quality")
>>> suite.add("expect_column_to_not_be_null", column="user_id")
>>> suite.add("expect_column_values_to_be_between",
...           column="age", min_value=0, max_value=150)
add(expectation_type: str, column: str | None = None, meta: Dict[str, Any] | None = None, **kwargs: Any) ExpectationSuite[source]

Add an expectation to this suite.

Parameters:
  • expectation_type (str) – The registered name of the expectation (e.g. "expect_column_to_not_be_null").

  • column (str, optional) – Target column name.

  • meta (dict, optional) – Arbitrary metadata to attach.

  • **kwargs – Additional arguments forwarded to the expectation (e.g. min_value, regex).

Returns:

self for fluent chaining.

Return type:

ExpectationSuite

add_expectation(expectation: Expectation) ExpectationSuite[source]

Add a pre-built Expectation instance.

clear() ExpectationSuite[source]

Remove all expectations.

expectations: List[Expectation]
classmethod from_dict(data: Dict[str, Any]) ExpectationSuite[source]

Create a suite from a plain dictionary.

classmethod load(filepath: str) ExpectationSuite[source]

Load from a YAML or JSON file.

meta: Dict[str, Any]
name: str
remove(index: int) ExpectationSuite[source]

Remove an expectation by index.

save(filepath: str) None[source]

Save to YAML or JSON based on file extension.

to_dict() Dict[str, Any][source]
to_json(indent: int = 2) str[source]
to_yaml() str[source]
class validatex.ValidationResult(suite_name: str, results: List[ExpectationResult] = <factory>, run_time: datetime | None = None, run_duration_seconds: float = 0.0, data_source: str | None = None, engine: str = 'pandas', statistics: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Aggregate result of running an entire expectation suite.

column_health() List[ColumnHealthSummary][source]

Aggregate expectation results by column.

Extracts null % and unique % from specific expectation types when present.

compute_quality_score() float[source]

Compute a weighted data quality score (0–100).

Severity weights:
  • Critical: ×3

  • Warning : ×2

  • Info : ×1

Score = 100 × (weighted_passed / weighted_total)

compute_statistics() Dict[str, Any][source]

Compute summary statistics and store them.

data_source: str | None = None
engine: str = 'pandas'
property errored_expectations: int
property failed_expectations: int
results: List[ExpectationResult]
run_duration_seconds: float = 0.0
run_time: datetime | None = None
statistics: Dict[str, Any]
property success: bool

True only if every expectation passed.

property success_percent: float
property successful_expectations: int
suite_name: str
summary() str[source]

Return a human-readable summary string.

to_dict() Dict[str, Any][source]
to_html(filepath: str) None[source]

Generate a rich HTML report and write to filepath.

to_json(indent: int = 2) str[source]

Serialize the full result to a JSON string.

to_json_file(filepath: str) None[source]

Write the validation result to a JSON file.

property total_expectations: int
class validatex.Validator(suite: ExpectationSuite, engine: str = 'pandas')[source]

Bases: object

Runs an ExpectationSuite against a dataset.

Parameters:
  • suite (ExpectationSuite) – The suite of expectations to evaluate.

  • engine (str) – "pandas" or "spark".

run(data: Any, data_source: str | None = None) ValidationResult[source]

Execute every expectation in the suite against data.

Parameters:
  • data (pd.DataFrame | pyspark.sql.DataFrame) – The dataset to validate.

  • data_source (str, optional) – A label describing where the data came from.

Return type:

ValidationResult

validatex.validate(data: Any, suite: ExpectationSuite, engine: str = 'pandas', data_source: str | None = None) ValidationResult[source]

Convenience function to validate data against a suite.

Parameters:
  • data (pd.DataFrame | pyspark.sql.DataFrame)

  • suite (ExpectationSuite)

  • engine (str) – "pandas" or "spark".

  • data_source (str, optional)

Return type:

ValidationResult