diff --git a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb index 873f5985..49028837 100644 --- a/02_activities/assignments/a1_sampling_and_reproducibility.ipynb +++ b/02_activities/assignments/a1_sampling_and_reproducibility.ipynb @@ -2,62 +2,69 @@ "cells": [ { "cell_type": "markdown", - "id": "ed39f379", "metadata": {}, "source": [ "# Assignment 1: Sampling and Reproducibility\n", "\n", "The code at the end of this file explores contact tracing data about an outbreak of the flu, and demonstrates the dangers of incomplete and non-random samples. This assignment is modified from [Contact tracing can give a biased sample of COVID-19 cases](https://andrewwhitby.com/2020/11/24/contact-tracing-biased/) by Andrew Whitby.\n", "\n", - "Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved. \n" + "1. Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved.\n", + "\n", + "2. Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results.\n", + "\n", + "3. Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times." ] }, { "cell_type": "markdown", - "id": "4ea73db3", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "3d9b2ccc", "metadata": {}, "source": [ - "Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results." + "## Identifying Sampling Stages\n", + "\n", + "## There are 4 sampling stages in this simulation.\n", + "\n", + "## Stage 1: Defining the sampling frame: The population is constructed as a fixed list of 1,000 individuals: 200 wedding attendees and 800 brunch attendees.\n", + "\n", + "## Stage 2: Infection sampling: Using `np.random.choice()`, a simple random sample of 100 individuals is drawn without replacement from the 1,000-person index (`size = int(1000 × ATTACK_RATE) = 100`). Each person has equal probability of selection.\n", + "\n", + "## Stage 3: Primary contact tracing: Each of the 100 infected individuals is independently assigned a traced status using `np.random.rand()`, with each person traced with probability `TRACE_SUCCESS = 0.20`. This is Bernoulli sampling applied independently to each infected individual, yielding an expected ~20 traced cases. The number traced follows a bi-nomial distribution.\n", + "\n", + "## Stage 4: Secondary contact tracing: Any event for which at least `SECONDARY_TRACE_THRESHOLD = 2` attendees were traced in Stage 3 triggers tracing of ALL infected attendees at that event. This step introduces systematic bias: larger events (weddings, n = 200) are more likely than smaller ones to exceed the threshold by chance, causing wedding-linked infections to be over-represented in the final traced sample relative to their true share of infections due to the increased # of people/ larger sample." ] }, { "cell_type": "markdown", - "id": "4cf5d993", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "32603ce7", "metadata": {}, "source": [ - "Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times." + "## Reproducibility at 10 and 100 Repetitions\n", + "\n", + "## When the number of repetitions is reduced to 10, the output histograms vary substantially across runs. The distributions are sparse and unstable... their shape, centre, and spread change meaningfully every time the script is executed. No consistent pattern is discernible.\n", + "\n", + "## With 100 repetitions, the histograms are more close in shape, but still differ between runs. The relative positions and heights of distributions shift across iterations.\n", + "\n", + "## In all cases, 10, 100, or 1,000 repetitions, re-running the script produces a different graph each time. This is because NumPy's pseudo-random number generator (`np.random`) is seeded each time, meaning the sequence of random numbers differs on every execution. The results are therefore not reproducible." ] }, { "cell_type": "markdown", - "id": "77613cc3", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "30b4a74f", "metadata": {}, "source": [ - "## Code" + "## Making the Script Reproducible\n", + "\n", + "## A single line was added near the top of the script, immediately before the simulation loop:\n", + "\n", + "## ```python\n", + "## np.random.seed(42)\n", + "## ```\n", + "\n", + "## This sets NumPy's pseudo-random number generator to a fixed, deterministic initial state. All subsequent calls to `np.random.choice()` and `np.random.rand()` within `simulate_event()` draw from a sequence that is entirely determined by this seed value. As a result, the script produces identical output, the same histogram with the same bar heights, every time it is run in the same environment.\n", + "\n", + "## The value `42` is arbitrary; any fixed integer produces the same reproducibility guarantee. Without this line, the random number is initialised from any random point, making results irreproducible across runs. With the random seed set to 42, the simulation is fully reproducible." ] }, { "cell_type": "code", "execution_count": null, - "id": "ab8587a0", "metadata": {}, "outputs": [], "source": [ @@ -75,6 +82,11 @@ "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", + "# Set random seed for reproducibility\n", + "# This ensures that np.random.choice() and np.random.rand() produce the same\n", + "# sequence of random numbers on every run, making results fully reproducible.\n", + "np.random.seed(42)\n", + "\n", "# Constants representing the parameters of the model\n", "ATTACK_RATE = 0.10\n", "TRACE_SUCCESS = 0.20\n", @@ -107,13 +119,17 @@ " ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n", "\n", " # Infect a random subset of people\n", + " # Sampling stage 2: simple random sample without replacement, size = 100\n", + " # Sampling frame: all 1000 individuals in ppl.index\n", " infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n", " ppl.loc[infected_indices, 'infected'] = True\n", "\n", " # Primary contact tracing: randomly decide which infected people get traced\n", + " # Sampling stage 3: Bernoulli sampling, each infected person traced with p = 0.20\n", " ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n", "\n", " # Secondary contact tracing based on event attendance\n", + " # Sampling stage 4: deterministic — all infected attendees at events with >= 2 traced cases are traced\n", " event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n", " events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n", " ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n", @@ -145,50 +161,6 @@ "plt.tight_layout()\n", "plt.show()" ] - }, - { - "cell_type": "markdown", - "id": "f418c720", - "metadata": {}, - "source": [ - "## Criteria" - ] - }, - { - "cell_type": "markdown", - "id": "c0b3f93f", - "metadata": {}, - "source": [ - "|Criteria|Complete|Incomplete|\n", - "|--------|----|----|\n", - "|Alteration of the code|The code changes made, made it reproducible.|The code is still not reproducible.|\n", - "|Description of changes|The author answered questions and explained the reasonings for the changes made well.|The author did not answer questions or explain the reasonings for the changes made well.|" - ] - }, - { - "cell_type": "markdown", - "id": "83cec589", - "metadata": {}, - "source": [ - "## Submission Information\n", - "🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.\n", - "\n", - "### Submission Parameters:\n", - "* Submission Due Date: `23:59 - 02 February 2026`\n", - "* The branch name for your repo should be: `assignment-1`\n", - "* What to submit for this assignment:\n", - " * This markdown file (`a1_sampling_and_reproducibility.ipynb`) should be populated with the code changed.\n", - "* What the pull request link should look like for this assignment: `https://github.com//sampling/pull/`\n", - " * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.\n", - "\n", - "#### Checklist:\n", - "- [ ] Create a branch called `assignment-1`.\n", - "- [ ] Ensure that the repository is public.\n", - "- [ ] Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them.\n", - "- [ ] Verify that the link is accessible in a private browser window.\n", - "\n", - "If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via the help channel in Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.\n" - ] } ], "metadata": { @@ -198,18 +170,10 @@ "name": "python3" }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.0" + "version": "3.10.0" } }, "nbformat": 4, - "nbformat_minor": 5 + "nbformat_minor": 4 } diff --git a/02_activities/assignments/a2_survey_design_and_evaluation.ipynb b/02_activities/assignments/a2_survey_design_and_evaluation.ipynb new file mode 100644 index 00000000..00da38f5 --- /dev/null +++ b/02_activities/assignments/a2_survey_design_and_evaluation.ipynb @@ -0,0 +1,229 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Assignment: Questionnaire Design and Sample Evaluation\n", + "\n", + "## Requirements\n", + "\n", + "The goal of this assignment is to practice developing and evaluating sampling materials.\n", + "\n", + "### Part A - Survey Design:\n", + "\n", + "Select one of the scenarios below and design a survey to meet the need(s) outlined in the prompt.\n", + "\n", + "1. In two to three sentences, describe the purpose of your survey\n", + "2. Describe your target population, sampling frame, sampling units, and overall sampling strategy.\n", + "3. Write a 5-10 question survey to address your chosen scenario below.\n", + "\n", + "##### Scenarios\n", + "1. You work in the Human Resources Department at a large tech company. Over the past few months, the company has been experiencing a high turnover rate across many of its departments, specifically within the entry- and lower-level positions. The company wishes to understand why this turnover is happening, and what changes need to occur to improve employee satisfaction.\n", + "2. You work for a Canadian national political party during a federal election. Throughout the campaign period, your party has seen relatively high approval ratings, but an opposing party is also polling favorably and may still have a chance to win the election. You are one month away from the election and you want to understand what voters want from your party and its leader in order to maintain your lead and eventually win the election.\n", + "3. You are a student researcher in the sociology department at the University of Toronto. You are working on a research project that concerns the relationship between music taste and age. This involves both comparisons between different people of different ages and comparisons of the same individual at different ages during their lifetime. You wish to understand to what extent age influences music taste, specifically as it relates to perceptions of popular music. Your results will be written into an academic paper that you hope to publish.\n", + "\n", + "### Part B - Survey Evaluation:\n", + "\n", + "For the Canadian General Social Survey on Giving, Volunteering, and Participating, 2018 (cycle 33), conducted by Statistics Canada, find any and all available documentation for the data gathered and identify and describe the survey features indicated below.\n", + "\n", + "1. Sample type\n", + "2. Sample size\n", + "3. Target population\n", + "4. Sampling frame\n", + "5. Survey mode(s)\n", + "6. Timeline\n", + "7. Response rate\n", + "8. Weights\n", + "9. Data processing\n", + "10. Cleaning, imputation, etc.\n", + "11. Sources of error\n", + "12. Limitations, known biases, etc.\n", + "13. Link to documentation and any additional sources used" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "# Your Changes\n", + "\n", + "## Part A - Survey Design\n", + "\n", + "Chosen scenario: 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Purpose of the survey\n", + "\n", + "## This survey aims to examine the relationship between age and music taste, specifically how perceptions of popular music change across different stages of life. The survey is designed to support both cross-sectional comparisons between people of different ages and retrospective comparisons of the same individual's tastes over time. Findings will be written up for academic publication in a sociological journal." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Target population, sampling frame, sampling units, and observational units\n", + "\n", + "## Target population: All adults aged 18 and older residing in Canada, spanning a broad age range to support meaningful comparisons across life stages (e.g., 18–25, 26–40, 41–60, 60+).\n", + "\n", + "## Sampling frame: A list of University of Toronto students, staff, and alumni supplemented by members of the general public recruited through community postings, social media, and Prolific Academic, an online research participant platform that allows demographic filtering by age. Because a single university list would over-represent young, highly educated adults, the supplementary recruitment is essential to achieve age diversity.\n", + "\n", + "## Sampling units: Individual adults. One respondent = one sampling unit.\n", + "\n", + "## Observational units: Individual adults. Each respondent is also the unit of observation, since we are measuring their personal music preferences and retrospective self-reports.\n", + "\n", + "## Sampling strategy: Stratified random sampling, with age as the stratifying variable. The population is divided into four strata (18–25, 26–40, 41–60, 60+) and participants are recruited until a minimum quota per stratum is achieved. This ensures each age group is adequately represented for meaningful comparison, random sampling risks under-representing older adults, who are less likely to self-select into online studies. Within each stratum, participants are selected randomly from the available pool." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Survey questions\n", + "\n", + "## 1. How old are you?\n", + "## - [ ] 18–25\n", + "## - [ ] 26–40\n", + "## - [ ] 41–60\n", + "## - [ ] 61 or older\n", + "\n", + "## 2. How often do you listen to music that is currently in the popular charts (e.g., top 40, trending on streaming platforms)?\n", + "## - [ ] Daily\n", + "## - [ ] Several times a week\n", + "## - [ ] Once a week\n", + "## - [ ] Rarely\n", + "## - [ ] Never\n", + "\n", + "## 3. How would you describe your general attitude toward popular music today?\n", + "## - [ ] Very positive — I enjoy most of it\n", + "## - [ ] Somewhat positive — I like some of it\n", + "## - [ ] Neutral — I have no strong feelings either way\n", + "## - [ ] Somewhat negative — I find most of it unappealing\n", + "## - [ ] Very negative — I strongly dislike most of it\n", + "\n", + "## 4. Compared to the popular music of your teenage years (roughly ages 13–19), how do you rate today's popular music?\n", + "## - [ ] Much better\n", + "## - [ ] Somewhat better\n", + "## - [ ] About the same\n", + "## - [ ] Somewhat worse\n", + "## - [ ] Much worse\n", + "## - [ ] I'm not sure / can't compare\n", + "\n", + "## 5. At what age do you feel your music taste became relatively stable — that is, when did you stop discovering new genres or artists as frequently as before?\n", + "## - [ ] Before age 20\n", + "## - [ ] 20–29\n", + "## - [ ] 30–39\n", + "## - [ ] 40 or older\n", + "## - [ ] My taste is still actively changing\n", + "## - [ ] I'm not sure\n", + "\n", + "## 6. Which of the following best describes how you primarily discover new music? (Select all that apply)\n", + "## - [ ] Streaming platform recommendations (e.g., Spotify, Apple Music)\n", + "## - [ ] Social media (e.g., TikTok, Instagram, YouTube)\n", + "## - [ ] Radio\n", + "## - [ ] Recommendations from friends or family\n", + "## - [ ] I rarely seek out new music\n", + "\n", + "## 7. Do you feel that your appreciation for music you loved in your youth has increased, decreased, or stayed the same over time?\n", + "## - [ ] Increased significantly\n", + "## - [ ] Increased somewhat\n", + "## - [ ] Stayed about the same\n", + "## [ ] Decreased somewhat\n", + "## [ ] Decreased significantly\n", + "\n", + "## 8. To what extent do you agree with the following statement: \"Popular music made after I turned 30 is harder for me to connect with emotionally.\"\n", + "## - [ ] Strongly agree\n", + "## - [ ] Agree\n", + "## - [ ] Neither agree nor disagree\n", + "## - [ ] Disagree\n", + "## - [ ] Strongly disagree\n", + "\n", + "## 9. What is the highest level of education you have completed?\n", + "## - [ ] High school or less\n", + "## - [ ] Some college or university\n", + "## - [ ] Bachelor's degree\n", + "## - [ ] Graduate or professional degree\n", + "\n", + "## 10. How would you describe your gender?\n", + "## - [ ] Woman\n", + "## - [ ] Man\n", + "## - [ ] Non-binary / gender diverse\n", + "## - [ ] Prefer to self-describe: ________\n", + "## - [ ] Prefer not to say" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Part B - Survey Evaluation\n", + "\n", + "## Survey: Canadian General Social Survey on Giving, Volunteering, and Participating, 2018 (Cycle 33), Statistics Canada\n", + "## https://www150.statcan.gc.ca/n1/pub/45-25-0001/cat5/c33_2018.zip\n", + "\n", + "## 1. Sample type\n", + "## Stratified probability sample, cross-sectional design. Two-stage: telephone number groups sampled first, then one eligible individual randomly selected per household. \"Rejective sampling\" was used to oversample volunteers; non-volunteers were sub-sampled into long- and short-interview groups.\n", + "\n", + "## 2. Sample size\n", + "## Field sample of ~50,000 units; ~40,000 invitation letters sent; ~24,000 completed questionnaires targeted and achieved.\n", + "\n", + "## 3. Target population\n", + "## All persons aged 15+ living in private households across Canada's ten provinces. Excludes full-time institutional residents.\n", + "\n", + "## 4. Sampling frame\n", + "## A linked frame combining landline and cellular telephone numbers from the Census and administrative sources with Statistics Canada's dwelling frame. Stratified into 27 strata at the province/CMA level (15 individual CMAs, grouped CMA strata for QC, ON, BC, and one non-CMA stratum per province).\n", + "\n", + "## 5. Survey mode(s)\n", + "## Electronic questionnaire (EQ, introduced for the first time in 2018) or computer-assisted telephone interviewing (CATI). Available in English and French. Average completion time ~44 minutes.\n", + "\n", + "## 6. Timeline\n", + "## Collection: September 4 – December 28, 2018. Reference period: 12 months preceding each interview. Public release: January 26, 2021.\n", + "\n", + "## 7. Response rate\n", + "## 41.9% overall.\n", + "\n", + "## 8. Weights\n", + "## Person-level weight (WGHT_PER) assigned to each respondent. Adjusted for rejective sampling, non-response (using administrative data on non-responding households), and calibrated to match the 2017 Canadian Income Survey income distribution by province. Bootstrap weights provided for variance estimation.\n", + "\n", + "## 9. Data processing\n", + "## Followed Statistics Canada's SSPE framework. Automated and manual edits applied at multiple stages: family edits (household relational consistency), consistency edits (e.g., age vs. birth date), and flow edits (correct questionnaire pathing). CATI enforced valid response ranges in real time. Data linked to tax records (T1, T1FF, T4) for consenting respondents to improve income data quality.\n", + "\n", + "## 10. Cleaning, imputation, etc.\n", + "## Imputation carried out in nine steps. Nearest-neighbour donor imputation used for most variables; mean imputation applied where donor imputation was not feasible. Key targets: volunteering and donation variables. Income obtained via tax linkage for 81.9% of respondents (personal) and 81.7% of households (family); remainder imputed.\n", + "\n", + "## 11. Sources of error\n", + "## - Sampling error: Estimates vary across samples; bootstrap weights provided for variance estimation.\n", + "## - Coverage error: Households without telephones excluded; some over- and under-coverage remains despite linked frame.\n", + "## - Non-response error: Non-response at household and individual levels; partially addressed through weighting adjustments using administrative data.\n", + "## - Response error: Recall inaccuracy and social desirability bias, particularly for self-reported volunteering hours and donation amounts.\n", + "## - Processing error: Minimized through automated and manual edits, but cannot be fully eliminated.\n", + "\n", + "## Limitations, known biases, etc.\n", + "## - 2018 was the first cycle to offer an electronic questionnaire, making direct comparisons with prior cycles unreliable.\n", + "## - Response rate of 41.9% is low; non-response bias remains a concern despite weighting.\n", + "## - Households without telephone access and institutional residents are excluded, potentially under-representing marginalized and older populations.\n", + "## - Voluntary participation may attract respondents with stronger views on giving and volunteering (self-selection bias).\n", + "## - Retrospective self-report of behaviour is subject to recall and social desirability bias." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/02_activities/assignments/a2_survey_design_and_evaluation.md b/02_activities/assignments/a2_survey_design_and_evaluation.md deleted file mode 100644 index b4f036f2..00000000 --- a/02_activities/assignments/a2_survey_design_and_evaluation.md +++ /dev/null @@ -1,101 +0,0 @@ -# Assignment: Questionnaire Design and Sample Evaluation - -## Requirements - -The goal of this assignment is to practice developing and evaluating sampling materials. - -### Part A - Survey Design: - -Select one of the scenarios below and design a survey to meet the need(s) outlined in the prompt. - -1. In two to three sentences, describe the purpose of your survey -2. Describe your target population, sampling frame, sampling units, and overall sampling strategy. -3. Write a 5-10 question survey to address your chosen scenario below. - -##### Scenarios -1. You work in the Human Resources Department at a large tech company. Over the past few months, the company has been experiencing a high turnover rate across many of its departments, specifically within the entry- and lower-level positions. The company wishes to understand why this turnover is happening, and what changes need to occur to improve employee satisfaction. -2. You work for a Canadian national political party during a federal election. Throughout the campaign period, your party has seen relatively high approval ratings, but an opposing party is also polling favorably and may still have a chance to win the election. You are one month away from the election and you want to understand what voters want from your party and its leader in order to maintain your lead and eventually win the election. -3. You are a student researcher in the sociology department at the University of Toronto. You are working on a research project that concerns the relationship between music taste and age. This involves both comparisons between different people of different ages and comparisons of the same individual at different ages during their lifetime. You wish to understand to what extent age influences music taste, specifically as it relates to perceptions of popular music. Your results will be written into an academic paper that you hope to publish. - -### Part B - Survey Evaluation: - -For the **Canadian General Social Survey on Giving, Volunteering, and Participating, 2018 (cycle 33)**, conducted by Statistics Canada find any and all available documentation for the data gathered and identify and describe the survey features indicated below. - -1. Sample type -2. Sample size -3. Target population -4. Sampling frame -5. Survey mode(s) -6. Timeline -7. Response rate -8. Weights -9. Data processing -10. Cleaning, imputation, etc -11. Sources of error -12. Limitations, known biases, etc -13. Link to documentation and any additional sources used - - -# Your Changes - -## Part A - Survey Design: - -The number of your chosen topic: `#` - -Describe the purpose of your survey: -``` -write your answer here... -``` - -Describe your target population, sampling frame, sampling units, and observational units: -``` -write your answer here... -``` - -Your 5-10 question survey: -``` -1. write your question here... -2. write your question here... -3. write your question here... -4. write your question here... -5. write your question here... -6. write your question here... (optional) -7. write your question here... (optional) -8. write your question here... (optional) -9. write your question here... (optional) -10. write your question here... (optional) -``` - -## Part B - Survey Evaluation: - -Identify and describe survey features: - -``` -write your answer here -``` - -## Rubric - -- All required components are present and complete **Complete / Incomplete** -- Choice of sampling strategy for Part A is justified and related to survey purpose **Complete / Incomplete** -- Information for Part B is complete and correct **Complete / Incomplete** - -## Submission Information - -🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly. - -### Submission Parameters: -* Submission Due Date: `23:59 - 09 February 2026` -* The branch name for your repo should be: `assignment-2` -* What to submit for this assignment: - * This markdown file (a2_survey_design_and_evaluation.md) should be populated and should be the only change in your pull request. -* What the pull request link should look like for this assignment: `https://github.com//sampling/pull/` - * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily. - -Checklist: -- [ ] Create a branch called `assignment-2`. -- [ ] Ensure that the repository is public. -- [ ] Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them. -- [ ] Verify that the link is accessible in a private browser window. - -If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via the help channel in Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges. diff --git a/surveys/a1_sampling_and_reproducibility_1.ipynb b/surveys/a1_sampling_and_reproducibility_1.ipynb new file mode 100644 index 00000000..49028837 --- /dev/null +++ b/surveys/a1_sampling_and_reproducibility_1.ipynb @@ -0,0 +1,179 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Assignment 1: Sampling and Reproducibility\n", + "\n", + "The code at the end of this file explores contact tracing data about an outbreak of the flu, and demonstrates the dangers of incomplete and non-random samples. This assignment is modified from [Contact tracing can give a biased sample of COVID-19 cases](https://andrewwhitby.com/2020/11/24/contact-tracing-biased/) by Andrew Whitby.\n", + "\n", + "1. Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved.\n", + "\n", + "2. Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results.\n", + "\n", + "3. Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Identifying Sampling Stages\n", + "\n", + "## There are 4 sampling stages in this simulation.\n", + "\n", + "## Stage 1: Defining the sampling frame: The population is constructed as a fixed list of 1,000 individuals: 200 wedding attendees and 800 brunch attendees.\n", + "\n", + "## Stage 2: Infection sampling: Using `np.random.choice()`, a simple random sample of 100 individuals is drawn without replacement from the 1,000-person index (`size = int(1000 × ATTACK_RATE) = 100`). Each person has equal probability of selection.\n", + "\n", + "## Stage 3: Primary contact tracing: Each of the 100 infected individuals is independently assigned a traced status using `np.random.rand()`, with each person traced with probability `TRACE_SUCCESS = 0.20`. This is Bernoulli sampling applied independently to each infected individual, yielding an expected ~20 traced cases. The number traced follows a bi-nomial distribution.\n", + "\n", + "## Stage 4: Secondary contact tracing: Any event for which at least `SECONDARY_TRACE_THRESHOLD = 2` attendees were traced in Stage 3 triggers tracing of ALL infected attendees at that event. This step introduces systematic bias: larger events (weddings, n = 200) are more likely than smaller ones to exceed the threshold by chance, causing wedding-linked infections to be over-represented in the final traced sample relative to their true share of infections due to the increased # of people/ larger sample." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reproducibility at 10 and 100 Repetitions\n", + "\n", + "## When the number of repetitions is reduced to 10, the output histograms vary substantially across runs. The distributions are sparse and unstable... their shape, centre, and spread change meaningfully every time the script is executed. No consistent pattern is discernible.\n", + "\n", + "## With 100 repetitions, the histograms are more close in shape, but still differ between runs. The relative positions and heights of distributions shift across iterations.\n", + "\n", + "## In all cases, 10, 100, or 1,000 repetitions, re-running the script produces a different graph each time. This is because NumPy's pseudo-random number generator (`np.random`) is seeded each time, meaning the sequence of random numbers differs on every execution. The results are therefore not reproducible." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Making the Script Reproducible\n", + "\n", + "## A single line was added near the top of the script, immediately before the simulation loop:\n", + "\n", + "## ```python\n", + "## np.random.seed(42)\n", + "## ```\n", + "\n", + "## This sets NumPy's pseudo-random number generator to a fixed, deterministic initial state. All subsequent calls to `np.random.choice()` and `np.random.rand()` within `simulate_event()` draw from a sequence that is entirely determined by this seed value. As a result, the script produces identical output, the same histogram with the same bar heights, every time it is run in the same environment.\n", + "\n", + "## The value `42` is arbitrary; any fixed integer produces the same reproducibility guarantee. Without this line, the random number is initialised from any random point, making results irreproducible across runs. With the random seed set to 42, the simulation is fully reproducible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import necessary libraries\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Note: Suppressing FutureWarnings to maintain a clean output. This is specifically to ignore warnings about\n", + "# deprecated features in the libraries we're using (e.g., 'use_inf_as_na' option in Pandas, used by Seaborn),\n", + "# which we currently have no direct control over. This action is taken to ensure that our output remains\n", + "# focused on relevant information, acknowledging that we rely on external library updates to fully resolve\n", + "# these deprecations. Always consider reviewing and removing this suppression after significant library updates.\n", + "import warnings\n", + "warnings.simplefilter(action='ignore', category=FutureWarning)\n", + "\n", + "# Set random seed for reproducibility\n", + "# This ensures that np.random.choice() and np.random.rand() produce the same\n", + "# sequence of random numbers on every run, making results fully reproducible.\n", + "np.random.seed(42)\n", + "\n", + "# Constants representing the parameters of the model\n", + "ATTACK_RATE = 0.10\n", + "TRACE_SUCCESS = 0.20\n", + "SECONDARY_TRACE_THRESHOLD = 2\n", + "\n", + "def simulate_event(m):\n", + " \"\"\"\n", + " Simulates the infection and tracing process for a series of events.\n", + " \n", + " This function creates a DataFrame representing individuals attending weddings and brunches,\n", + " infects a subset of them based on the ATTACK_RATE, performs primary and secondary contact tracing,\n", + " and calculates the proportions of infections and traced cases that are attributed to weddings.\n", + " \n", + " Parameters:\n", + " - m: Dummy parameter for iteration purposes.\n", + " \n", + " Returns:\n", + " - A tuple containing the proportion of infections and the proportion of traced cases\n", + " that are attributed to weddings.\n", + " \"\"\"\n", + " # Create DataFrame for people at events with initial infection and traced status\n", + " events = ['wedding'] * 200 + ['brunch'] * 800\n", + " ppl = pd.DataFrame({\n", + " 'event': events,\n", + " 'infected': False,\n", + " 'traced': np.nan # Initially setting traced status as NaN\n", + " })\n", + "\n", + " # Explicitly set 'traced' column to nullable boolean type\n", + " ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n", + "\n", + " # Infect a random subset of people\n", + " # Sampling stage 2: simple random sample without replacement, size = 100\n", + " # Sampling frame: all 1000 individuals in ppl.index\n", + " infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n", + " ppl.loc[infected_indices, 'infected'] = True\n", + "\n", + " # Primary contact tracing: randomly decide which infected people get traced\n", + " # Sampling stage 3: Bernoulli sampling, each infected person traced with p = 0.20\n", + " ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n", + "\n", + " # Secondary contact tracing based on event attendance\n", + " # Sampling stage 4: deterministic — all infected attendees at events with >= 2 traced cases are traced\n", + " event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n", + " events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n", + " ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n", + "\n", + " # Calculate proportions of infections and traces attributed to each event type\n", + " ppl['event_type'] = ppl['event'].str[0] # 'w' for wedding, 'b' for brunch\n", + " wedding_infections = sum(ppl['infected'] & (ppl['event_type'] == 'w'))\n", + " brunch_infections = sum(ppl['infected'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_infections = wedding_infections / (wedding_infections + brunch_infections)\n", + "\n", + " wedding_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'w'))\n", + " brunch_traces = sum(ppl['infected'] & ppl['traced'] & (ppl['event_type'] == 'b'))\n", + " p_wedding_traces = wedding_traces / (wedding_traces + brunch_traces)\n", + "\n", + " return p_wedding_infections, p_wedding_traces\n", + "\n", + "# Run the simulation 1000 times\n", + "results = [simulate_event(m) for m in range(1000)]\n", + "props_df = pd.DataFrame(results, columns=[\"Infections\", \"Traces\"])\n", + "\n", + "# Plotting the results\n", + "plt.figure(figsize=(10, 6))\n", + "sns.histplot(props_df['Infections'], color=\"blue\", alpha=0.75, binwidth=0.05, kde=False, label='Infections from Weddings')\n", + "sns.histplot(props_df['Traces'], color=\"red\", alpha=0.75, binwidth=0.05, kde=False, label='Traced to Weddings')\n", + "plt.xlabel(\"Proportion of cases\")\n", + "plt.ylabel(\"Frequency\")\n", + "plt.title(\"Impact of Contact Tracing on Perceived Flu Infection Sources\")\n", + "plt.legend()\n", + "plt.tight_layout()\n", + "plt.show()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/surveys/a2_survey_design_and_evaluation.ipynb b/surveys/a2_survey_design_and_evaluation.ipynb new file mode 100644 index 00000000..00da38f5 --- /dev/null +++ b/surveys/a2_survey_design_and_evaluation.ipynb @@ -0,0 +1,229 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Assignment: Questionnaire Design and Sample Evaluation\n", + "\n", + "## Requirements\n", + "\n", + "The goal of this assignment is to practice developing and evaluating sampling materials.\n", + "\n", + "### Part A - Survey Design:\n", + "\n", + "Select one of the scenarios below and design a survey to meet the need(s) outlined in the prompt.\n", + "\n", + "1. In two to three sentences, describe the purpose of your survey\n", + "2. Describe your target population, sampling frame, sampling units, and overall sampling strategy.\n", + "3. Write a 5-10 question survey to address your chosen scenario below.\n", + "\n", + "##### Scenarios\n", + "1. You work in the Human Resources Department at a large tech company. Over the past few months, the company has been experiencing a high turnover rate across many of its departments, specifically within the entry- and lower-level positions. The company wishes to understand why this turnover is happening, and what changes need to occur to improve employee satisfaction.\n", + "2. You work for a Canadian national political party during a federal election. Throughout the campaign period, your party has seen relatively high approval ratings, but an opposing party is also polling favorably and may still have a chance to win the election. You are one month away from the election and you want to understand what voters want from your party and its leader in order to maintain your lead and eventually win the election.\n", + "3. You are a student researcher in the sociology department at the University of Toronto. You are working on a research project that concerns the relationship between music taste and age. This involves both comparisons between different people of different ages and comparisons of the same individual at different ages during their lifetime. You wish to understand to what extent age influences music taste, specifically as it relates to perceptions of popular music. Your results will be written into an academic paper that you hope to publish.\n", + "\n", + "### Part B - Survey Evaluation:\n", + "\n", + "For the Canadian General Social Survey on Giving, Volunteering, and Participating, 2018 (cycle 33), conducted by Statistics Canada, find any and all available documentation for the data gathered and identify and describe the survey features indicated below.\n", + "\n", + "1. Sample type\n", + "2. Sample size\n", + "3. Target population\n", + "4. Sampling frame\n", + "5. Survey mode(s)\n", + "6. Timeline\n", + "7. Response rate\n", + "8. Weights\n", + "9. Data processing\n", + "10. Cleaning, imputation, etc.\n", + "11. Sources of error\n", + "12. Limitations, known biases, etc.\n", + "13. Link to documentation and any additional sources used" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "# Your Changes\n", + "\n", + "## Part A - Survey Design\n", + "\n", + "Chosen scenario: 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Purpose of the survey\n", + "\n", + "## This survey aims to examine the relationship between age and music taste, specifically how perceptions of popular music change across different stages of life. The survey is designed to support both cross-sectional comparisons between people of different ages and retrospective comparisons of the same individual's tastes over time. Findings will be written up for academic publication in a sociological journal." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Target population, sampling frame, sampling units, and observational units\n", + "\n", + "## Target population: All adults aged 18 and older residing in Canada, spanning a broad age range to support meaningful comparisons across life stages (e.g., 18–25, 26–40, 41–60, 60+).\n", + "\n", + "## Sampling frame: A list of University of Toronto students, staff, and alumni supplemented by members of the general public recruited through community postings, social media, and Prolific Academic, an online research participant platform that allows demographic filtering by age. Because a single university list would over-represent young, highly educated adults, the supplementary recruitment is essential to achieve age diversity.\n", + "\n", + "## Sampling units: Individual adults. One respondent = one sampling unit.\n", + "\n", + "## Observational units: Individual adults. Each respondent is also the unit of observation, since we are measuring their personal music preferences and retrospective self-reports.\n", + "\n", + "## Sampling strategy: Stratified random sampling, with age as the stratifying variable. The population is divided into four strata (18–25, 26–40, 41–60, 60+) and participants are recruited until a minimum quota per stratum is achieved. This ensures each age group is adequately represented for meaningful comparison, random sampling risks under-representing older adults, who are less likely to self-select into online studies. Within each stratum, participants are selected randomly from the available pool." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Survey questions\n", + "\n", + "## 1. How old are you?\n", + "## - [ ] 18–25\n", + "## - [ ] 26–40\n", + "## - [ ] 41–60\n", + "## - [ ] 61 or older\n", + "\n", + "## 2. How often do you listen to music that is currently in the popular charts (e.g., top 40, trending on streaming platforms)?\n", + "## - [ ] Daily\n", + "## - [ ] Several times a week\n", + "## - [ ] Once a week\n", + "## - [ ] Rarely\n", + "## - [ ] Never\n", + "\n", + "## 3. How would you describe your general attitude toward popular music today?\n", + "## - [ ] Very positive — I enjoy most of it\n", + "## - [ ] Somewhat positive — I like some of it\n", + "## - [ ] Neutral — I have no strong feelings either way\n", + "## - [ ] Somewhat negative — I find most of it unappealing\n", + "## - [ ] Very negative — I strongly dislike most of it\n", + "\n", + "## 4. Compared to the popular music of your teenage years (roughly ages 13–19), how do you rate today's popular music?\n", + "## - [ ] Much better\n", + "## - [ ] Somewhat better\n", + "## - [ ] About the same\n", + "## - [ ] Somewhat worse\n", + "## - [ ] Much worse\n", + "## - [ ] I'm not sure / can't compare\n", + "\n", + "## 5. At what age do you feel your music taste became relatively stable — that is, when did you stop discovering new genres or artists as frequently as before?\n", + "## - [ ] Before age 20\n", + "## - [ ] 20–29\n", + "## - [ ] 30–39\n", + "## - [ ] 40 or older\n", + "## - [ ] My taste is still actively changing\n", + "## - [ ] I'm not sure\n", + "\n", + "## 6. Which of the following best describes how you primarily discover new music? (Select all that apply)\n", + "## - [ ] Streaming platform recommendations (e.g., Spotify, Apple Music)\n", + "## - [ ] Social media (e.g., TikTok, Instagram, YouTube)\n", + "## - [ ] Radio\n", + "## - [ ] Recommendations from friends or family\n", + "## - [ ] I rarely seek out new music\n", + "\n", + "## 7. Do you feel that your appreciation for music you loved in your youth has increased, decreased, or stayed the same over time?\n", + "## - [ ] Increased significantly\n", + "## - [ ] Increased somewhat\n", + "## - [ ] Stayed about the same\n", + "## [ ] Decreased somewhat\n", + "## [ ] Decreased significantly\n", + "\n", + "## 8. To what extent do you agree with the following statement: \"Popular music made after I turned 30 is harder for me to connect with emotionally.\"\n", + "## - [ ] Strongly agree\n", + "## - [ ] Agree\n", + "## - [ ] Neither agree nor disagree\n", + "## - [ ] Disagree\n", + "## - [ ] Strongly disagree\n", + "\n", + "## 9. What is the highest level of education you have completed?\n", + "## - [ ] High school or less\n", + "## - [ ] Some college or university\n", + "## - [ ] Bachelor's degree\n", + "## - [ ] Graduate or professional degree\n", + "\n", + "## 10. How would you describe your gender?\n", + "## - [ ] Woman\n", + "## - [ ] Man\n", + "## - [ ] Non-binary / gender diverse\n", + "## - [ ] Prefer to self-describe: ________\n", + "## - [ ] Prefer not to say" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Part B - Survey Evaluation\n", + "\n", + "## Survey: Canadian General Social Survey on Giving, Volunteering, and Participating, 2018 (Cycle 33), Statistics Canada\n", + "## https://www150.statcan.gc.ca/n1/pub/45-25-0001/cat5/c33_2018.zip\n", + "\n", + "## 1. Sample type\n", + "## Stratified probability sample, cross-sectional design. Two-stage: telephone number groups sampled first, then one eligible individual randomly selected per household. \"Rejective sampling\" was used to oversample volunteers; non-volunteers were sub-sampled into long- and short-interview groups.\n", + "\n", + "## 2. Sample size\n", + "## Field sample of ~50,000 units; ~40,000 invitation letters sent; ~24,000 completed questionnaires targeted and achieved.\n", + "\n", + "## 3. Target population\n", + "## All persons aged 15+ living in private households across Canada's ten provinces. Excludes full-time institutional residents.\n", + "\n", + "## 4. Sampling frame\n", + "## A linked frame combining landline and cellular telephone numbers from the Census and administrative sources with Statistics Canada's dwelling frame. Stratified into 27 strata at the province/CMA level (15 individual CMAs, grouped CMA strata for QC, ON, BC, and one non-CMA stratum per province).\n", + "\n", + "## 5. Survey mode(s)\n", + "## Electronic questionnaire (EQ, introduced for the first time in 2018) or computer-assisted telephone interviewing (CATI). Available in English and French. Average completion time ~44 minutes.\n", + "\n", + "## 6. Timeline\n", + "## Collection: September 4 – December 28, 2018. Reference period: 12 months preceding each interview. Public release: January 26, 2021.\n", + "\n", + "## 7. Response rate\n", + "## 41.9% overall.\n", + "\n", + "## 8. Weights\n", + "## Person-level weight (WGHT_PER) assigned to each respondent. Adjusted for rejective sampling, non-response (using administrative data on non-responding households), and calibrated to match the 2017 Canadian Income Survey income distribution by province. Bootstrap weights provided for variance estimation.\n", + "\n", + "## 9. Data processing\n", + "## Followed Statistics Canada's SSPE framework. Automated and manual edits applied at multiple stages: family edits (household relational consistency), consistency edits (e.g., age vs. birth date), and flow edits (correct questionnaire pathing). CATI enforced valid response ranges in real time. Data linked to tax records (T1, T1FF, T4) for consenting respondents to improve income data quality.\n", + "\n", + "## 10. Cleaning, imputation, etc.\n", + "## Imputation carried out in nine steps. Nearest-neighbour donor imputation used for most variables; mean imputation applied where donor imputation was not feasible. Key targets: volunteering and donation variables. Income obtained via tax linkage for 81.9% of respondents (personal) and 81.7% of households (family); remainder imputed.\n", + "\n", + "## 11. Sources of error\n", + "## - Sampling error: Estimates vary across samples; bootstrap weights provided for variance estimation.\n", + "## - Coverage error: Households without telephones excluded; some over- and under-coverage remains despite linked frame.\n", + "## - Non-response error: Non-response at household and individual levels; partially addressed through weighting adjustments using administrative data.\n", + "## - Response error: Recall inaccuracy and social desirability bias, particularly for self-reported volunteering hours and donation amounts.\n", + "## - Processing error: Minimized through automated and manual edits, but cannot be fully eliminated.\n", + "\n", + "## Limitations, known biases, etc.\n", + "## - 2018 was the first cycle to offer an electronic questionnaire, making direct comparisons with prior cycles unreliable.\n", + "## - Response rate of 41.9% is low; non-response bias remains a concern despite weighting.\n", + "## - Households without telephone access and institutional residents are excluded, potentially under-representing marginalized and older populations.\n", + "## - Voluntary participation may attract respondents with stronger views on giving and volunteering (self-selection bias).\n", + "## - Retrospective self-report of behaviour is subject to recall and social desirability bias." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}