validatex package

Subpackages

Module contents

ValidateX - A powerful data quality validation framework.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames.

Usage:

>>> import validatex as vx
>>> suite = vx.ExpectationSuite("my_suite")
>>> suite.add("expect_column_to_not_be_null", column="user_id")
>>> suite.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
>>> result = vx.validate(df, suite)
>>> result.to_html("report.html")

class validatex.ColumnHealthSummary(column: str, checks: int = 0, passed: int = 0, failed: int = 0, errors: int = 0, null_count: int | None = None, null_percent: float | None = None, unique_count: int | None = None, unique_percent: float | None = None, total_rows: int | None = None)[source]

Bases: object

Aggregated health metrics for a single column.

checks: int = 0

column: str

errors: int = 0

failed: int = 0

property health_score: float

null_count: int | None = None

null_percent: float | None = None

passed: int = 0

to_dict() → Dict[str, Any][source]

total_rows: int | None = None

unique_count: int | None = None

unique_percent: float | None = None

class validatex.DataProfiler[source]

Bases: object

Analyse a Pandas DataFrame and produce a DataProfile.

Usage

>>> profiler = DataProfiler()
>>> profile = profiler.profile(df)
>>> print(profile.summary())
>>> suite = profiler.suggest_expectations(df, suite_name="auto_suite")

profile(df: DataFrame) → DataProfile[source]

Profile every column in df.

Return type:: DataProfile

suggest_expectations(df: DataFrame, suite_name: str = 'auto_generated_suite') → ExpectationSuite[source]

Auto-generate an ExpectationSuite based on the data profile.

Heuristics

If a column has zero nulls → expect_column_to_not_be_null
If a column is fully unique → expect_column_values_to_be_unique
For numeric columns → expect_column_values_to_be_between with observed min/max.
For string columns with few distinct values → expect_column_values_to_be_in_set
For string columns → expect_column_value_lengths_to_be_between

class validatex.DriftDetector(psi_threshold: float = 0.2, bins: int = 10)[source]

Bases: object

Detects data drift between a baseline and a current Pandas DataFrame. Calculates Population Stability Index (PSI) to detect statistical shifts in distributions.

compare(df_base: DataFrame, df_current: DataFrame) → DriftReport[source]: Run schema and statistical drift comparison between two DataFrames.

class validatex.DriftReport(schema_added_columns: List[str], schema_removed_columns: List[str], schema_type_changes: Dict[str, Dict[str, str]], column_drifts: Dict[str, ColumnDriftResult])[source]

Bases: object

Represents a full data drift comparison report.

column_drifts: Dict[str, ColumnDriftResult]

schema_added_columns: List[str]

schema_removed_columns: List[str]

schema_type_changes: Dict[str, Dict[str, str]]

summary() → str[source]: Return a human-readable summary of the drift report.

to_dict() → Dict[str, Any][source]: Convert the drift report to a dictionary.

to_json(indent: int = 2) → str[source]: Convert the report to a JSON string.

class validatex.Expectation(column: str | None = None, kwargs: Dict[str, ~typing.Any]=<factory>, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: ABC

Abstract base class for all expectations.

Subclasses must:

Set the class attribute expectation_type (a unique string id).
Implement _validate_pandas() and/or _validate_spark().

column: str | None = None

expectation_type: str = 'base_expectation'

classmethod from_dict(d: Dict[str, Any]) → Expectation[source]: Deserialize from a dictionary.

kwargs: Dict[str, Any]

meta: Dict[str, Any]

to_dict() → Dict[str, Any][source]: Serialize to a plain dictionary (for YAML / JSON configs).

validate(data: Any, engine: str = 'pandas') → ExpectationResult[source]

Run this expectation against data using the specified engine.

Parameters:

data (Any) – The dataset (pd.DataFrame or pyspark.sql.DataFrame).
engine (str) – "pandas" or "spark".

Return type:

ExpectationResult

class validatex.ExpectationResult(expectation_type: str, success: bool, column: str | None = None, observed_value: Any = None, element_count: int = 0, unexpected_count: int = 0, unexpected_percent: float = 0.0, unexpected_values: List[Any] = <factory>, details: Dict[str, ~typing.Any]=<factory>, exception_info: str | None = None, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Result of a single expectation evaluation.

column: str | None = None

details: Dict[str, Any]

element_count: int = 0

exception_info: str | None = None

expectation_type: str

property human_observed: str

Return a human-readable string for the observed value.

Converts raw dicts / technical strings into executive-friendly text.

meta: Dict[str, Any]

observed_value: Any = None

property severity: str: Return severity level for this expectation.

property severity_icon: str

property status: str

property status_icon: str

success: bool

to_dict() → Dict[str, Any][source]

unexpected_count: int = 0

unexpected_percent: float = 0.0

unexpected_values: List[Any]

class validatex.ExpectationSuite(name: str, expectations: List[Expectation] = <factory>, meta: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

A named collection of expectations.

Examples

>>> suite = ExpectationSuite("user_data_quality")
>>> suite.add("expect_column_to_not_be_null", column="user_id")
>>> suite.add("expect_column_values_to_be_between",
...           column="age", min_value=0, max_value=150)

add(expectation_type: str, column: str | None = None, meta: Dict[str, Any] | None = None, **kwargs: Any) → ExpectationSuite[source]

Add an expectation to this suite.

Parameters:

expectation_type (str) – The registered name of the expectation (e.g. "expect_column_to_not_be_null").
column (str, optional) – Target column name.
meta (dict, optional) – Arbitrary metadata to attach.
**kwargs – Additional arguments forwarded to the expectation (e.g. min_value, regex).

Returns:

self for fluent chaining.

Return type:

ExpectationSuite

add_expectation(expectation: Expectation) → ExpectationSuite[source]: Add a pre-built Expectation instance.

clear() → ExpectationSuite[source]: Remove all expectations.

expectations: List[Expectation]

classmethod from_dict(data: Dict[str, Any]) → ExpectationSuite[source]: Create a suite from a plain dictionary.

classmethod load(filepath: str) → ExpectationSuite[source]: Load from a YAML or JSON file.

meta: Dict[str, Any]

name: str

remove(index: int) → ExpectationSuite[source]: Remove an expectation by index.

save(filepath: str) → None[source]: Save to YAML or JSON based on file extension.

to_dict() → Dict[str, Any][source]

to_json(indent: int = 2) → str[source]

to_yaml() → str[source]

class validatex.ValidationResult(suite_name: str, results: List[ExpectationResult] = <factory>, run_time: datetime | None = None, run_duration_seconds: float = 0.0, data_source: str | None = None, engine: str = 'pandas', statistics: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Aggregate result of running an entire expectation suite.

column_health() → List[ColumnHealthSummary][source]

Aggregate expectation results by column.

Extracts null % and unique % from specific expectation types when present.

compute_quality_score() → float[source]

Compute a weighted data quality score (0–100).

Severity weights:

Critical: ×3
Warning : ×2
Info : ×1

Score = 100 × (weighted_passed / weighted_total)

compute_statistics() → Dict[str, Any][source]: Compute summary statistics and store them.

data_source: str | None = None

engine: str = 'pandas'

property errored_expectations: int

property failed_expectations: int

results: List[ExpectationResult]

run_duration_seconds: float = 0.0

run_time: datetime | None = None

statistics: Dict[str, Any]

property success: bool: True only if every expectation passed.

property success_percent: float

property successful_expectations: int

suite_name: str

summary() → str[source]: Return a human-readable summary string.

to_dict() → Dict[str, Any][source]

to_html(filepath: str) → None[source]: Generate a rich HTML report and write to filepath.

to_json(indent: int = 2) → str[source]: Serialize the full result to a JSON string.

to_json_file(filepath: str) → None[source]: Write the validation result to a JSON file.

property total_expectations: int

class validatex.Validator(suite: ExpectationSuite, engine: str = 'pandas')[source]

Bases: object

Runs an ExpectationSuite against a dataset.

Parameters:

suite (ExpectationSuite) – The suite of expectations to evaluate.
engine (str) – "pandas" or "spark".

run(data: Any, data_source: str | None = None) → ValidationResult[source]

Execute every expectation in the suite against data.

Parameters:

data (pd.DataFrame | pyspark.sql.DataFrame) – The dataset to validate.
data_source (str, optional) – A label describing where the data came from.

Return type:

ValidationResult

validatex.validate(data: Any, suite: ExpectationSuite, engine: str = 'pandas', data_source: str | None = None) → ValidationResult[source]

Convenience function to validate data against a suite.

Parameters:

data (pd.DataFrame | pyspark.sql.DataFrame)
suite (ExpectationSuite)
engine (str) – "pandas" or "spark".
data_source (str, optional)

Return type:

ValidationResult