Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 45 additions & 81 deletions 02_activities/assignments/a1_sampling_and_reproducibility.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,62 +2,69 @@
"cells": [
{
"cell_type": "markdown",
"id": "ed39f379",
"metadata": {},
"source": [
"# Assignment 1: Sampling and Reproducibility\n",
"\n",
"The code at the end of this file explores contact tracing data about an outbreak of the flu, and demonstrates the dangers of incomplete and non-random samples. This assignment is modified from [Contact tracing can give a biased sample of COVID-19 cases](https://andrewwhitby.com/2020/11/24/contact-tracing-biased/) by Andrew Whitby.\n",
"\n",
"Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved. \n"
"1. Examine the code below. Identify all stages at which sampling is occurring in the model. Describe in words the sampling procedure, referencing the functions used, sample size, sampling frame, any underlying distributions involved.\n",
"\n",
"2. Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results.\n",
"\n",
"3. Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times."
]
},
{
"cell_type": "markdown",
"id": "4ea73db3",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3d9b2ccc",
"metadata": {},
"source": [
"Modify the number of repetitions in the simulation to 10 and 100 (from the original 1000). Run the script multiple times and observe the outputted graphs. Comment on the reproducibility of the results."
"## Identifying Sampling Stages\n",
"\n",
"## There are 4 sampling stages in this simulation.\n",
"\n",
"## Stage 1: Defining the sampling frame: The population is constructed as a fixed list of 1,000 individuals: 200 wedding attendees and 800 brunch attendees.\n",
"\n",
"## Stage 2: Infection sampling: Using `np.random.choice()`, a simple random sample of 100 individuals is drawn without replacement from the 1,000-person index (`size = int(1000 × ATTACK_RATE) = 100`). Each person has equal probability of selection.\n",
"\n",
"## Stage 3: Primary contact tracing: Each of the 100 infected individuals is independently assigned a traced status using `np.random.rand()`, with each person traced with probability `TRACE_SUCCESS = 0.20`. This is Bernoulli sampling applied independently to each infected individual, yielding an expected ~20 traced cases. The number traced follows a bi-nomial distribution.\n",
"\n",
"## Stage 4: Secondary contact tracing: Any event for which at least `SECONDARY_TRACE_THRESHOLD = 2` attendees were traced in Stage 3 triggers tracing of ALL infected attendees at that event. This step introduces systematic bias: larger events (weddings, n = 200) are more likely than smaller ones to exceed the threshold by chance, causing wedding-linked infections to be over-represented in the final traced sample relative to their true share of infections due to the increased # of people/ larger sample."
]
},
{
"cell_type": "markdown",
"id": "4cf5d993",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "32603ce7",
"metadata": {},
"source": [
"Alter the code so that it is reproducible. Describe the changes you made to the code and how they affected the reproducibility of the script. The script needs to produce the same output when run multiple times."
"## Reproducibility at 10 and 100 Repetitions\n",
"\n",
"## When the number of repetitions is reduced to 10, the output histograms vary substantially across runs. The distributions are sparse and unstable... their shape, centre, and spread change meaningfully every time the script is executed. No consistent pattern is discernible.\n",
"\n",
"## With 100 repetitions, the histograms are more close in shape, but still differ between runs. The relative positions and heights of distributions shift across iterations.\n",
"\n",
"## In all cases, 10, 100, or 1,000 repetitions, re-running the script produces a different graph each time. This is because NumPy's pseudo-random number generator (`np.random`) is seeded each time, meaning the sequence of random numbers differs on every execution. The results are therefore not reproducible."
]
},
{
"cell_type": "markdown",
"id": "77613cc3",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "30b4a74f",
"metadata": {},
"source": [
"## Code"
"## Making the Script Reproducible\n",
"\n",
"## A single line was added near the top of the script, immediately before the simulation loop:\n",
"\n",
"## ```python\n",
"## np.random.seed(42)\n",
"## ```\n",
"\n",
"## This sets NumPy's pseudo-random number generator to a fixed, deterministic initial state. All subsequent calls to `np.random.choice()` and `np.random.rand()` within `simulate_event()` draw from a sequence that is entirely determined by this seed value. As a result, the script produces identical output, the same histogram with the same bar heights, every time it is run in the same environment.\n",
"\n",
"## The value `42` is arbitrary; any fixed integer produces the same reproducibility guarantee. Without this line, the random number is initialised from any random point, making results irreproducible across runs. With the random seed set to 42, the simulation is fully reproducible."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab8587a0",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -75,6 +82,11 @@
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"\n",
"# Set random seed for reproducibility\n",
"# This ensures that np.random.choice() and np.random.rand() produce the same\n",
"# sequence of random numbers on every run, making results fully reproducible.\n",
"np.random.seed(42)\n",
"\n",
"# Constants representing the parameters of the model\n",
"ATTACK_RATE = 0.10\n",
"TRACE_SUCCESS = 0.20\n",
Expand Down Expand Up @@ -107,13 +119,17 @@
" ppl['traced'] = ppl['traced'].astype(pd.BooleanDtype())\n",
"\n",
" # Infect a random subset of people\n",
" # Sampling stage 2: simple random sample without replacement, size = 100\n",
" # Sampling frame: all 1000 individuals in ppl.index\n",
" infected_indices = np.random.choice(ppl.index, size=int(len(ppl) * ATTACK_RATE), replace=False)\n",
" ppl.loc[infected_indices, 'infected'] = True\n",
"\n",
" # Primary contact tracing: randomly decide which infected people get traced\n",
" # Sampling stage 3: Bernoulli sampling, each infected person traced with p = 0.20\n",
" ppl.loc[ppl['infected'], 'traced'] = np.random.rand(sum(ppl['infected'])) < TRACE_SUCCESS\n",
"\n",
" # Secondary contact tracing based on event attendance\n",
" # Sampling stage 4: deterministic — all infected attendees at events with >= 2 traced cases are traced\n",
" event_trace_counts = ppl[ppl['traced'] == True]['event'].value_counts()\n",
" events_traced = event_trace_counts[event_trace_counts >= SECONDARY_TRACE_THRESHOLD].index\n",
" ppl.loc[ppl['event'].isin(events_traced) & ppl['infected'], 'traced'] = True\n",
Expand Down Expand Up @@ -145,50 +161,6 @@
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "f418c720",
"metadata": {},
"source": [
"## Criteria"
]
},
{
"cell_type": "markdown",
"id": "c0b3f93f",
"metadata": {},
"source": [
"|Criteria|Complete|Incomplete|\n",
"|--------|----|----|\n",
"|Alteration of the code|The code changes made, made it reproducible.|The code is still not reproducible.|\n",
"|Description of changes|The author answered questions and explained the reasonings for the changes made well.|The author did not answer questions or explain the reasonings for the changes made well.|"
]
},
{
"cell_type": "markdown",
"id": "83cec589",
"metadata": {},
"source": [
"## Submission Information\n",
"🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.\n",
"\n",
"### Submission Parameters:\n",
"* Submission Due Date: `23:59 - 02 February 2026`\n",
"* The branch name for your repo should be: `assignment-1`\n",
"* What to submit for this assignment:\n",
" * This markdown file (`a1_sampling_and_reproducibility.ipynb`) should be populated with the code changed.\n",
"* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/sampling/pull/<pr_id>`\n",
" * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.\n",
"\n",
"#### Checklist:\n",
"- [ ] Create a branch called `assignment-1`.\n",
"- [ ] Ensure that the repository is public.\n",
"- [ ] Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them.\n",
"- [ ] Verify that the link is accessible in a private browser window.\n",
"\n",
"If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via the help channel in Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.\n"
]
}
],
"metadata": {
Expand All @@ -198,18 +170,10 @@
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
"nbformat_minor": 4
}
Loading
Loading