-
Notifications
You must be signed in to change notification settings - Fork 0
Class initialiser #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
25fe6ec
clean-up install requirements
Jday7879 abdc200
updating requirements and testing skipping tests
Jday7879 740e3ff
remove old test, adding int test for pyspark validator
Jday7879 00bfe4e
adding pytest to workflows for testing
Jday7879 350c302
temp adding runing workflows on PR
Jday7879 939d79f
updating logic for selecting pandera pyspark
Jday7879 b8f999e
update python versions
Jday7879 5059e4c
dont run pyspark tests on windows
Jday7879 6d71576
restructure workflows
Jday7879 5baf1cf
restructure pyspark test to avoid import errors
Jday7879 a4c08e9
remove setup method
Jday7879 99ca73a
expanding test coverage
Jday7879 c271a40
fix failing unit test
Jday7879 8ff8433
new test to check pyspark log
Jday7879 c6f1c8d
update to test failing check
Jday7879 87ceb61
update to handle pyspark return validation
Jday7879 cb32648
Update schema to convert types to pyspark
Jday7879 69fa594
replace error message for pyspark validation fails
Jday7879 cfbe6da
update tests
Jday7879 cb02cf2
handle no pyspark validator errors
Jday7879 74b815e
remove old code
Jday7879 297885b
update unit test to expected fail due to bug
Jday7879 e49c997
remove print and add support for dates.
Jday7879 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,6 @@ | ||
| import pandas as pd | ||
| import re | ||
|
|
||
| import pyspark.sql.types as T | ||
| from pyspark.sql import functions as F | ||
|
|
||
| from datachecker.data_checkers.general_validator import Validator | ||
|
|
@@ -14,8 +16,53 @@ def __init__( | |
| hard_check: bool = True, | ||
| custom_checks: dict = None, | ||
| ): | ||
| raise NotImplementedError("PySpark support is not implemented yet") | ||
| # raise NotImplementedError("PySpark support is not implemented yet") | ||
| super().__init__(schema, data, file, format, hard_check, custom_checks) | ||
| self._convert_schema_dtypes() | ||
|
|
||
| def validate(self): | ||
| super().validate() | ||
| self._convert_pyspark_error_messages() | ||
|
|
||
| def _convert_pyspark_error_messages(self): | ||
| message = "Pyspark does not return cases or index" | ||
| for i in range(1, len(self.log) - 1): | ||
| entry = self.log[i] | ||
| if ( | ||
| entry["failing_ids"] is None | ||
| or len(entry["failing_ids"]) == 0 | ||
| or not isinstance(entry["failing_ids"][0], str) | ||
| ): | ||
| continue | ||
| # <Schema Column ...> is the message when a check fails for pyspark | ||
| # replace it with blanket statement. Should still pass other important errors | ||
| # back to user if they are not related to pyspark checks | ||
| elif re.search(r"<Schema Column", entry["failing_ids"][0]) is not None: | ||
| entry["failing_ids"][0] = message | ||
| else: | ||
| continue | ||
|
|
||
| def _convert_schema_dtypes(self): | ||
| mapping_dtypes = { | ||
| "int": T.IntegerType(), | ||
| "float": T.FloatType(), | ||
| "string": T.StringType(), | ||
| "str": T.StringType(), | ||
| "bool": T.BooleanType(), | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anything for timestamps?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice spot will add now |
||
| "date": T.DateType(), | ||
| "datetime": T.DateType(), | ||
| "timestamp": T.TimestampType(), | ||
| } | ||
| for col in self.schema.get("columns", {}): | ||
| input_type = self.schema["columns"][col].get("type") | ||
| if input_type not in mapping_dtypes and input_type not in mapping_dtypes.values(): | ||
| raise ValueError( | ||
| f"Unsupported data type '{input_type}' for column '{col}' in schema. " | ||
| f"Supported types are: {list(mapping_dtypes.keys())}" | ||
| ) | ||
| self.schema["columns"][col]["type"] = mapping_dtypes.get( | ||
| self.schema["columns"][col]["type"], self.schema["columns"][col]["type"] | ||
| ) | ||
|
|
||
| def _check_duplicates(self): | ||
| # Check for duplicate rows in the dataframe | ||
|
|
@@ -70,32 +117,3 @@ def _check_completeness(self): | |
| outcome=result, | ||
| entry_type="error", | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| # Example usage (pandas) | ||
|
|
||
| data = pd.DataFrame( | ||
| [ | ||
| (1, "A"), | ||
| (2, "B"), | ||
| (1, "A"), # Duplicate row | ||
| (3, "C"), | ||
| ], | ||
| columns=["id", "value"], | ||
| ) | ||
|
|
||
| schema = { | ||
| "check_duplicates": True, | ||
| "check_completeness": True, | ||
| "completeness_columns": ["id", "value"], | ||
| "columns": { | ||
| "id": {"type": "integer", "check_duplicates": True}, | ||
| "value": {"type": "string", "check_duplicates": True}, | ||
| }, | ||
| } | ||
|
|
||
| validator = PySparkValidator(schema, data, "datafile.csv", "csv") | ||
| validator.run_checks() | ||
| for entry in validator.qa_report: | ||
| print(entry) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this is necessary? Without this function, when supplying "int", "str", etc, the types still get coerced to PySpark types, which I assume pandera is doing for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the schema was being formatted correctly but without converting from strings into pyspark dtypes it would not actually trigger the type checks or other checks. I can take a look into it more in the future and will add a backlog ticket to review