You are provided with a simulated dataset containing customer reviews. Note: This dataset also includes personally identifiable information (PII) such as names, email addresses, and possibly other sensitive data. The goal is twofold:
-
Data Augmentation:
Build a generative AI model that creates synthetic customer reviews. These synthetic reviews will be used to augment the original dataset for training a sentiment analysis classifier. -
Data Privacy & Security:
Throughout the process, ensure that all sensitive information is properly anonymized and that data security best practices are in place. The candidate must demonstrate awareness of potential privacy pitfalls in generative modeling (e.g., inadvertent memorization of PII) and describe methods to mitigate these risks.
- Task:
- Load the provided dataset.
- Identify and anonymize or pseudonymize all sensitive information. For example, remove or obfuscate names, emails, addresses, etc.
- Document the steps and reasoning behind your anonymization approach.
- Deliverable:
- A script/notebook that loads and sanitizes the data.
- A short write-up (can be part of your README) explaining the chosen anonymization techniques.
- Task:
- Implement a generative model for text. You may choose an approach such as fine-tuning a pre-trained language model (e.g., GPT-2) or training a Variational Autoencoder (VAE) for text generation.
- The model should be capable of generating realistic synthetic customer reviews.
- Incorporate techniques (or provide a discussion) that help prevent the leakage of sensitive information (e.g., differential privacy mechanisms during training, careful filtering of generated content).
- Deliverable:
- Code implementing the generative model.
- Example outputs of synthetic reviews.
- A brief explanation of how your model avoids memorizing or leaking sensitive data.
- Task:
- Train a baseline sentiment classifier on the original (sanitized) dataset.
- Augment the training data with the synthetic reviews generated by your model.
- Retrain the classifier on the augmented dataset.
- Evaluate and compare the performance (accuracy, F1-score, etc.) of the classifier before and after augmentation.
- Deliverable:
- Code for training and evaluating the sentiment classifier.
- A report (or detailed README section) containing evaluation metrics and a discussion on the impact of data augmentation.
- Task:
- Describe and implement (where applicable) measures to secure data during storage, processing, and transmission.
- This may include encryption at rest/in transit, access controls, logging, and the application of differential privacy in model training.
- Explain how these measures ensure that data security and privacy requirements are met.
- Deliverable:
- Code snippets or configurations that illustrate your security measures (e.g., use of libraries for encryption or secure storage).
- A document section that details your security and privacy strategy, including any trade-offs or challenges encountered.
- Task:
- Organize your code in a clear and maintainable way (e.g., modular code, proper function definitions, and comments).
- Provide a comprehensive README that explains:
- How to set up and run your code.
- Your overall approach and design decisions.
- Any challenges you encountered and how you addressed them.
- Ensure that your documentation is clear, concise, and professional.
- Deliverable:
- A well-organized repository (e.g., GitHub) containing your complete codebase.
- A README file and any additional documentation or reports (max 2 pages for the summary report).
- Code Repository: Provide access to a public (or shared private) repository containing:
- All source code and notebooks/scripts.
- A README with instructions for setup and execution.
- Documentation of your design decisions, data privacy, and security measures.
- Written Report: A short report (no longer than 2 pages) summarizing:
- Your approach to data anonymization.
- Details of your generative model implementation and security considerations.
- Performance comparison of the sentiment classifier with and without augmentation.
- Key lessons learned or challenges faced.
| Criteria | Points |
|---|---|
| Data Ingestion & Anonymization | 20 |
| - Correct loading and sanitizing of data | |
| - Clarity and justification of anonymization | |
| Generative Model Implementation | 20 |
| - Correct and effective model implementation | |
| - Measures to prevent leakage of sensitive info | |
| Sentiment Classifier & Augmentation | 20 |
| - Baseline and augmented model training | |
| - Clear evaluation and performance improvement | |
| Data Security & Privacy Controls | 20 |
| - Implementation and explanation of security measures | |
| Code Quality, Documentation & Presentation | 20 |
| - Code clarity, structure, and modularity | |
| - Comprehensive README and clear presentation | |
| Total | 100 |
Your submission will be evaluated based on:
- Technical correctness and completeness: Did you implement all required parts of the exercise?
- Innovation and understanding: How effectively did you apply generative AI techniques and ensure data privacy/security?
- Practical performance: Does the augmented dataset improve the sentiment classifier, and are the improvements well-documented?
- Clarity and professionalism: Is your code well-documented, and does your write-up clearly explain your design decisions?
Good luck, and we look forward to seeing how you tackle this challenge!