GitHub - Celaton-Ltd/ai-test: This is a take-home test to evaluate a candidates AI/ML skills

Technical Exercise: Generative Synthetic Data Augmentation with Privacy & Security

Background & Scenario

You are provided with a simulated dataset containing customer reviews. Note: This dataset also includes personally identifiable information (PII) such as names, email addresses, and possibly other sensitive data. The goal is twofold:

Data Augmentation:
Build a generative AI model that creates synthetic customer reviews. These synthetic reviews will be used to augment the original dataset for training a sentiment analysis classifier.
Data Privacy & Security:
Throughout the process, ensure that all sensitive information is properly anonymized and that data security best practices are in place. The candidate must demonstrate awareness of potential privacy pitfalls in generative modeling (e.g., inadvertent memorization of PII) and describe methods to mitigate these risks.

Exercise Tasks

1. Data Ingestion and Anonymization (20 points)

Task:
- Load the provided dataset.
- Identify and anonymize or pseudonymize all sensitive information. For example, remove or obfuscate names, emails, addresses, etc.
- Document the steps and reasoning behind your anonymization approach.
Deliverable:
- A script/notebook that loads and sanitizes the data.
- A short write-up (can be part of your README) explaining the chosen anonymization techniques.

2. Generative Model Implementation (20 points)

Task:
- Implement a generative model for text. You may choose an approach such as fine-tuning a pre-trained language model (e.g., GPT-2) or training a Variational Autoencoder (VAE) for text generation.
- The model should be capable of generating realistic synthetic customer reviews.
- Incorporate techniques (or provide a discussion) that help prevent the leakage of sensitive information (e.g., differential privacy mechanisms during training, careful filtering of generated content).
Deliverable:
- Code implementing the generative model.
- Example outputs of synthetic reviews.
- A brief explanation of how your model avoids memorizing or leaking sensitive data.

3. Sentiment Classifier and Data Augmentation (20 points)

Task:
- Train a baseline sentiment classifier on the original (sanitized) dataset.
- Augment the training data with the synthetic reviews generated by your model.
- Retrain the classifier on the augmented dataset.
- Evaluate and compare the performance (accuracy, F1-score, etc.) of the classifier before and after augmentation.
Deliverable:
- Code for training and evaluating the sentiment classifier.
- A report (or detailed README section) containing evaluation metrics and a discussion on the impact of data augmentation.

4. Data Security & Privacy Controls (20 points)

Task:
- Describe and implement (where applicable) measures to secure data during storage, processing, and transmission.
- This may include encryption at rest/in transit, access controls, logging, and the application of differential privacy in model training.
- Explain how these measures ensure that data security and privacy requirements are met.
Deliverable:
- Code snippets or configurations that illustrate your security measures (e.g., use of libraries for encryption or secure storage).
- A document section that details your security and privacy strategy, including any trade-offs or challenges encountered.

5. Code Quality, Documentation & Presentation (20 points)

Task:
- Organize your code in a clear and maintainable way (e.g., modular code, proper function definitions, and comments).
- Provide a comprehensive README that explains:
  - How to set up and run your code.
  - Your overall approach and design decisions.
  - Any challenges you encountered and how you addressed them.
- Ensure that your documentation is clear, concise, and professional.
Deliverable:
- A well-organized repository (e.g., GitHub) containing your complete codebase.
- A README file and any additional documentation or reports (max 2 pages for the summary report).

Submission Requirements

Code Repository: Provide access to a public (or shared private) repository containing:
- All source code and notebooks/scripts.
- A README with instructions for setup and execution.
- Documentation of your design decisions, data privacy, and security measures.
Written Report: A short report (no longer than 2 pages) summarizing:
- Your approach to data anonymization.
- Details of your generative model implementation and security considerations.
- Performance comparison of the sentiment classifier with and without augmentation.
- Key lessons learned or challenges faced.

Scoring Rubric (Total 100 Points)

Criteria	Points
Data Ingestion & Anonymization	20
- Correct loading and sanitizing of data
- Clarity and justification of anonymization
Generative Model Implementation	20
- Correct and effective model implementation
- Measures to prevent leakage of sensitive info
Sentiment Classifier & Augmentation	20
- Baseline and augmented model training
- Clear evaluation and performance improvement
Data Security & Privacy Controls	20
- Implementation and explanation of security measures
Code Quality, Documentation & Presentation	20
- Code clarity, structure, and modularity
- Comprehensive README and clear presentation
Total	100

Evaluation

Your submission will be evaluated based on:

Technical correctness and completeness: Did you implement all required parts of the exercise?
Innovation and understanding: How effectively did you apply generative AI techniques and ensure data privacy/security?
Practical performance: Does the augmented dataset improve the sentiment classifier, and are the improvements well-documented?
Clarity and professionalism: Is your code well-documented, and does your write-up clearly explain your design decisions?

Good luck, and we look forward to seeing how you tackle this challenge!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
simulated_customer_reviews.csv		simulated_customer_reviews.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Exercise: Generative Synthetic Data Augmentation with Privacy & Security

Background & Scenario

Exercise Tasks

1. Data Ingestion and Anonymization (20 points)

2. Generative Model Implementation (20 points)

3. Sentiment Classifier and Data Augmentation (20 points)

4. Data Security & Privacy Controls (20 points)

5. Code Quality, Documentation & Presentation (20 points)

Submission Requirements

Scoring Rubric (Total 100 Points)

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Technical Exercise: Generative Synthetic Data Augmentation with Privacy & Security

Background & Scenario

Exercise Tasks

1. Data Ingestion and Anonymization (20 points)

2. Generative Model Implementation (20 points)

3. Sentiment Classifier and Data Augmentation (20 points)

4. Data Security & Privacy Controls (20 points)

5. Code Quality, Documentation & Presentation (20 points)

Submission Requirements

Scoring Rubric (Total 100 Points)

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages