-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
1249 lines (1223 loc) · 77.7 KB
/
index.html
File metadata and controls
1249 lines (1223 loc) · 77.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Bill McMillin" />
<title>What Changed? Detecting statistically significant changes in electronic resource usage with R</title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" type="text/css" media="screen, projection, print"
href="http://www.w3.org/Talks/Tools/Slidy2/styles/slidy.css" />
<script src="http://www.w3.org/Talks/Tools/Slidy2/scripts/slidy.js"
charset="utf-8" type="text/javascript"></script>
</head>
<body>
<div class="slide titlepage">
<h1 class="title">What Changed? Detecting statistically significant changes in electronic resource usage with R</h1>
<p class="author">
Bill McMillin
</p>
<p class="date">7/20/2017</p>
</div>
<div class="slide section level2">
<p>Disclaimers</p>
<ul>
<li>Data analysis can show where to look, but can't understand the full context in which the data was created</li>
<li>Data used here is close enough to the original to demonstrate how the models were built, but it's not the actual data from the sources</li>
</ul>
</div>
<div id="crisp-dm" class="slide section level2">
<h2>CRISP-DM</h2>
<h4 id="cross-industry-standard-process-for-data-mining">Cross Industry Standard Process for Data Mining</h4>
<ol style="list-style-type: decimal">
<li>Business Understanding</li>
<li>Data Understanding</li>
<li>Data Preparation</li>
<li>Modeling</li>
<li>Evaluation</li>
<li>Deployment</li>
</ol>
<ul>
<li>It's cyclical! Lots of returning to previous steps.</li>
</ul>
</div>
<div id="crisp-dm-stage-1-business-understanding" class="slide section level2">
<h2>CRISP-DM Stage 1: Business Understanding</h2>
<h3 id="academic-library">Academic Library</h3>
<ul>
<li>Google Analytics in use on most Web sites across campus - consistently since Summer 2017</li>
<li>Electronic resource usage has been tracked consistently via a COUNTER aggregation service since 2014</li>
<li>Everybody wants "data analytics"</li>
</ul>
</div>
<div class="slide section level2">
<h3 id="questions-for-the-data-to-answer">Questions for the data to answer</h3>
<ul>
<li>Should be answerable with an integer, most effective if yes/no</li>
<li>Has use of our discovery service increased significantly in the last year?</li>
<li>Have article donwloads increased since our new website redesign?</li>
<li>Has vendor x's new interface resulted in increased searches? Full-text downloads?</li>
</ul>
<h3 id="questions-for-the-subject-matter-expert-sme-to-answer">Questions for the Subject Matter Expert (SME) to answer</h3>
<ul>
<li>Why did full-text downloads decrease by x%?</li>
<li>Should we continue to pay for access to resource y?</li>
<li>Did moving the link to the library's site on the university home page cause a decrease in Web traffic?</li>
</ul>
</div>
<div id="crisp-dm-stage-2-data-understanding" class="slide section level2">
<h2>CRISP-DM Stage 2: Data Understanding</h2>
<h3 id="data-sources">Data Sources</h3>
<p>Reports from Google Analytics were exported for</p>
<ul>
<li>Libraries Website - monthly sessions</li>
<li>Libraries Website - monthly pageviews</li>
<li>Libraries Website - monthly number of users</li>
<li>Libraries Catalog - monthly sessions</li>
<li>Libraries Catalog - monthly pageviews</li>
<li>Libraries Catalog - monthly number of users</li>
<li>Discovery Service (Summon) - monthly sessions</li>
<li>Discovery Service (Summon) - monthly pageviews</li>
<li>Discovery Service (Summon) - monthly number of users</li>
</ul>
</div>
<div id="scope-of-the-data" class="slide section level2">
<h2>Scope of the Data</h2>
<ul>
<li>All variables are continuous except for Month. Month is nominal because it is not an integer that can be ranked. Is December equal to two Junes?</li>
<li>Different levels of granularity exist across platforms</li>
<li>When comparing across platforms, all data must be converted to highest level of granularity (we're looking at monthly hit counts and session data even though Google Analytics offers much more detail)</li>
</ul>
</div>
<div id="crisp-dm-stage-3-data-preparation" class="slide section level2">
<h2>CRISP-DM Stage 3: Data Preparation</h2>
<h3 id="cleaning">Cleaning</h3>
<h4 id="steps-for-each-google-analytics-report">Steps for each Google Analytics Report</h4>
<ol style="list-style-type: decimal">
<li>In LibreOffice change the month index to ISO-formatted date</li>
<li>Remove December 2013 data for web traffic because it's incomplete</li>
<li>Replace 0 with NULL to avoid skewing average</li>
</ol>
<h4 id="scripts-for-automating-the-cleaning-and-extraction-process">Scripts for automating the cleaning and extraction process</h4>
<ul>
<li>Much cleaning can be done in R</li>
<li>Complex aggregation (turning the values of multiple rows into one variable) may be easier with other tools</li>
<li>https://github.com/billmcmillin/usage_stats/blob/master/DB1_append.py</li>
</ul>
</div>
<div class="slide section level2">
<h4 id="importing-the-cleaned-data-into-r">Importing the Cleaned Data into R</h4>
<ol style="list-style-type: decimal">
<li>Import the csv file</li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">raw_usage <-<span class="st"> </span><span class="kw">read.table</span>(<span class="dt">file=</span><span class="st">"../data/all_consolidated.csv"</span>, <span class="dt">header=</span><span class="ot">TRUE</span>, <span class="dt">sep=</span><span class="st">","</span>, <span class="dt">na.strings =</span> <span class="st">"NULL"</span>)</code></pre></div>
<ol start="2" style="list-style-type: decimal">
<li>How many variables and observations do we have?</li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ls</span>(raw_usage)</code></pre></div>
<pre><code>## [1] "BR1" "catalog_pageviews"
## [3] "catalog_sessions" "catalog_users"
## [5] "DB1_record_views" "DB1_result_clicks"
## [7] "DB1_searches" "DB1_sessions"
## [9] "JR1_retrievals" "libraries_site_pageviews"
## [11] "libraries_site_sessions" "libraries_site_users"
## [13] "Month" "off_campus_all_sites_pageviews"
## [15] "off_campus_all_sites_sessions" "off_campus_all_sites_users"
## [17] "on_campus_all_sites_pageviews" "on_campus_all_sites_sessions"
## [19] "on_campus_all_sites_users" "summon_pageviews"
## [21] "summon_sessions" "summon_users"</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">nrow</span>(raw_usage)</code></pre></div>
<pre><code>## [1] 64</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(raw_usage)</code></pre></div>
<pre><code>## Month off_campus_all_sites_pageviews off_campus_all_sites_sessions
## 1 2012-01 NA NA
## 2 2012-02 NA NA
## 3 2012-03 NA NA
## 4 2012-04 NA NA
## 5 2012-05 NA NA
## 6 2012-06 NA NA
## off_campus_all_sites_users on_campus_all_sites_pageviews
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## on_campus_all_sites_sessions on_campus_all_sites_users
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## libraries_site_pageviews libraries_site_sessions libraries_site_users
## 1 437517 240274 137719
## 2 416660 242389 139184
## 3 346238 205562 121410
## 4 421705 249439 140058
## 5 411641 245991 130575
## 6 232712 131941 68827
## catalog_pageviews catalog_sessions catalog_users summon_pageviews
## 1 11 11 9 NA
## 2 0 0 0 129071
## 3 1258 30 18 107908
## 4 148 4 3 164398
## 5 127 15 5 176487
## 6 43 11 9 78581
## summon_sessions summon_users JR1_retrievals BR1 DB1_searches
## 1 NA NA NA NA NA
## 2 28354 14881 NA NA NA
## 3 24337 13109 NA NA NA
## 4 32273 16241 NA NA NA
## 5 36029 17244 NA NA NA
## 6 16179 8475 NA NA NA
## DB1_sessions DB1_record_views DB1_result_clicks
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">describe</span>(raw_usage)</code></pre></div>
<pre><code>## raw_usage
##
## 22 Variables 64 Observations
## ---------------------------------------------------------------------------
## Month
## n missing distinct
## 64 0 64
##
## lowest : 2012-01 2012-02 2012-03 2012-04
## highest: 2016-11 2016-12 2017-01 2017-02 2017-03
## ---------------------------------------------------------------------------
## off_campus_all_sites_pageviews
## n missing distinct Info Mean Gmd .05 .10
## 34 30 34 1 9371461 10216852 3040690 3717012
## .25 .50 .75 .90 .95
## 4189599 5025962 5461030 6318733 6765830
##
## Value 1500000 2000000 3500000 4000000 4500000 5000000
## Frequency 1 1 2 5 3 10
## Proportion 0.029 0.029 0.059 0.147 0.088 0.294
##
## Value 5500000 6000000 6500000 7000000 159500000
## Frequency 7 1 2 1 1
## Proportion 0.206 0.029 0.059 0.029 0.029
## ---------------------------------------------------------------------------
## off_campus_all_sites_sessions
## n missing distinct Info Mean Gmd .05 .10
## 33 31 33 1 1705539 343080 1152359 1341652
## .25 .50 .75 .90 .95
## 1579658 1800016 1928174 1977192 2011710
##
## lowest : 459293 892557 1325560 1341519 1342184
## highest: 1962604 1980839 2010471 2013569 2097284
## ---------------------------------------------------------------------------
## off_campus_all_sites_users
## n missing distinct Info Mean Gmd .05 .10
## 33 31 33 1 675000 127848 496293 575655
## .25 .50 .75 .90 .95
## 635404 687697 731818 761306 797239
##
## lowest : 215313 390005 567152 572529 588157
## highest: 761121 761352 788001 811097 1031082
## ---------------------------------------------------------------------------
## on_campus_all_sites_pageviews
## n missing distinct Info Mean Gmd .05 .10
## 39 25 39 1 1884188 684089 927874 956022
## .25 .50 .75 .90 .95
## 1546636 2017640 2200134 2653493 2756755
##
## lowest : 802053 840741 937555 946894 958304
## highest: 2646888 2679913 2741307 2895786 3013732
## ---------------------------------------------------------------------------
## on_campus_all_sites_sessions
## n missing distinct Info Mean Gmd .05 .10
## 39 25 39 1 777822 328778 313523 336509
## .25 .50 .75 .90 .95
## 556458 832962 948695 1143116 1153880
##
## lowest : 299071 307357 314208 335977 336642
## highest: 1141523 1149490 1152374 1167437 1363493
## ---------------------------------------------------------------------------
## on_campus_all_sites_users
## n missing distinct Info Mean Gmd .05 .10
## 39 25 39 1 217649 114657 81002 86175
## .25 .50 .75 .90 .95
## 161304 207786 274304 357686 412793
##
## lowest : 76988 78012 81334 85164 86428, highest: 344792 409263 412245 417725 458796
## ---------------------------------------------------------------------------
## libraries_site_pageviews
## n missing distinct Info Mean Gmd .05 .10
## 63 1 63 1 217895 122213 23976 106680
## .25 .50 .75 .90 .95
## 157444 204708 269299 384002 414515
##
## lowest : 155 177 1001 16541 90892, highest: 411641 414834 416660 421705 437517
## ---------------------------------------------------------------------------
## libraries_site_sessions
## n missing distinct Info Mean Gmd .05 .10
## 63 1 63 1 120076 74094 15327 52096
## .25 .50 .75 .90 .95
## 80368 108162 139934 221567 242178
##
## lowest : 83 117 9072 12427 41429, highest: 240274 242389 245115 245991 249439
## ---------------------------------------------------------------------------
## libraries_site_users
## n missing distinct Info Mean Gmd .05 .10
## 63 1 63 1 59547 43900 7436 20738
## .25 .50 .75 .90 .95
## 30284 51499 76083 121163 130541
##
## lowest : 79 99 161 6229 18298, highest: 130231 130575 137719 139184 140058
## ---------------------------------------------------------------------------
## catalog_pageviews
## n missing distinct Info Mean Gmd .05 .10
## 63 1 62 1 78581 61966 12.9 37.4
## .25 .50 .75 .90 .95
## 2070.5 88004.0 114317.5 138575.6 165805.9
##
## lowest : 0 3 11 30 34, highest: 154969 167010 169413 180430 194813
## ---------------------------------------------------------------------------
## catalog_sessions
## n missing distinct Info Mean Gmd .05 .10
## 63 1 60 1 12891 9923 4.0 6.4
## .25 .50 .75 .90 .95
## 49.0 14735.0 18967.5 22801.6 25075.8
##
## lowest : 0 3 4 5 6, highest: 23823 25215 26423 27470 29864
## ---------------------------------------------------------------------------
## catalog_users
## n missing distinct Info Mean Gmd .05 .10
## 63 1 57 1 6498 5131 3.0 5.0
## .25 .50 .75 .90 .95
## 18.5 7490.0 9507.0 11819.4 12767.9
##
## lowest : 0 2 3 4 5, highest: 12641 12782 14017 14711 15875
## ---------------------------------------------------------------------------
## summon_pageviews
## n missing distinct Info Mean Gmd .05 .10
## 62 2 62 1 131895 59111 54064 78660
## .25 .50 .75 .90 .95
## 89794 126264 176471 204644 208119
##
## lowest : 39724 46681 50593 53232 69869, highest: 207086 208173 218593 228999 243986
## ---------------------------------------------------------------------------
## summon_sessions
## n missing distinct Info Mean Gmd .05 .10
## 62 2 62 1 25960 9370 15136 15972
## .25 .50 .75 .90 .95
## 18048 26813 33034 35957 37338
##
## lowest : 12223 13799 14618 15135 15146, highest: 37329 37338 38181 39506 40205
## ---------------------------------------------------------------------------
## summon_users
## n missing distinct Info Mean Gmd .05 .10
## 62 2 62 1 12853 4402 7347 7899
## .25 .50 .75 .90 .95
## 8776 13533 16124 17322 17710
##
## lowest : 7088 7189 7317 7337 7533, highest: 17589 17716 18104 20104 20504
## ---------------------------------------------------------------------------
## JR1_retrievals
## n missing distinct Info Mean Gmd .05 .10
## 36 28 36 1 146409 43951 93908 96278
## .25 .50 .75 .90 .95
## 116069 149737 171105 188149 203824
##
## lowest : 59062 89218 95472 95851 96704, highest: 187212 189086 199427 217016 222864
## ---------------------------------------------------------------------------
## BR1
## n missing distinct Info Mean Gmd .05 .10
## 36 28 36 1 31778 15684 12928 16594
## .25 .50 .75 .90 .95
## 21001 31063 41781 51232 55413
##
## lowest : 6590 12406 13102 15623 17565, highest: 49325 53140 55262 55867 58278
## ---------------------------------------------------------------------------
## DB1_searches
## n missing distinct Info Mean Gmd .05 .10
## 36 28 36 1 301221 176414 93385 112040
## .25 .50 .75 .90 .95
## 186124 265755 440438 502309 540238
##
## lowest : 56596 74171 99790 110628 113453, highest: 501846 502772 531451 566601 596172
## ---------------------------------------------------------------------------
## DB1_sessions
## n missing distinct Info Mean Gmd .05 .10
## 36 28 35 1 970.1 314.6 635.8 655.0
## .25 .50 .75 .90 .95
## 790.5 893.0 1028.8 1294.5 1752.0
##
## lowest : 559 617 642 646 664, highest: 1286 1303 1732 1812 1845
## ---------------------------------------------------------------------------
## DB1_record_views
## n missing distinct Info Mean Gmd .05 .10
## 36 28 36 1 53821 31855 22216 24124
## .25 .50 .75 .90 .95
## 31748 46788 70542 87484 90260
##
## lowest : 20776 21749 22372 23393 24854, highest: 87005 87962 90202 90434 182014
## ---------------------------------------------------------------------------
## DB1_result_clicks
## n missing distinct Info Mean Gmd .05 .10
## 36 28 36 1 80582 40807 35772 38192
## .25 .50 .75 .90 .95
## 49623 76194 106415 133546 143302
##
## lowest : 34131 35571 35839 37902 38482, highest: 130334 136758 141990 147239 147905
## ---------------------------------------------------------------------------</code></pre>
</div>
<div class="slide section level2">
<h3 id="cleaning-and-examining-the-data">Cleaning and examining the data</h3>
<p>The month is not in a format that R will recognize as a date, so that needs to be converted</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">clean_usage <-<span class="st"> </span>raw_usage
<span class="co">#first convert the factor data type to characters</span>
char_mon <-<span class="st"> </span><span class="kw">as.character</span>(raw_usage$Month)
clean_usage$Month <-<span class="st"> </span><span class="kw">as.Date</span>(<span class="kw">paste</span>(char_mon,<span class="st">"-01"</span>,<span class="dt">sep=</span><span class="st">""</span>))
clean_usage$Month</code></pre></div>
<pre><code>## [1] "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01"
## [6] "2012-06-01" "2012-07-01" "2012-08-01" "2012-09-01" "2012-10-01"
## [11] "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
## [16] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01"
## [21] "2013-09-01" "2013-10-01" "2013-11-01" "2013-12-01" "2014-01-01"
## [26] "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01"
## [31] "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01"
## [36] "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01"
## [41] "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01"
## [46] "2015-10-01" "2015-11-01" "2015-12-01" "2016-01-01" "2016-02-01"
## [51] "2016-03-01" "2016-04-01" "2016-05-01" "2016-06-01" "2016-07-01"
## [56] "2016-08-01" "2016-09-01" "2016-10-01" "2016-11-01" "2016-12-01"
## [61] "2017-01-01" "2017-02-01" "2017-03-01" NA</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">##we'll also want the number of the month
clean_usage[<span class="st">"month_num"</span>] <-<span class="st"> </span><span class="kw">month</span>(clean_usage$Month)</code></pre></div>
</div>
<div class="slide section level2">
<p>Examine the set for missing values</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">comp_obs <-<span class="st"> </span>clean_usage[<span class="kw">complete.cases</span>(raw_usage), ]
<span class="kw">nrow</span>(comp_obs)</code></pre></div>
<pre><code>## [1] 30</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#Which months are included in our analysis?</span>
comp_obs$Month</code></pre></div>
<pre><code>## [1] "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01"
## [6] "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01"
## [11] "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01"
## [16] "2015-10-01" "2015-11-01" "2015-12-01" "2016-01-01" "2016-02-01"
## [21] "2016-03-01" "2016-04-01" "2016-05-01" "2016-06-01" "2016-07-01"
## [26] "2016-08-01" "2016-09-01" "2016-10-01" "2016-11-01" "2016-12-01"</code></pre>
</div>
<div class="slide section level2">
<p>Start to look at variables of interest</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qplot</span>(comp_obs$Month,comp_obs$off_campus_all_sites_pageviews, <span class="dt">main=</span><span class="st">"All campus website hits from off campus by month"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Hit count - all university sites"</span>) +<span class="st"> </span><span class="kw">geom_line</span>()</code></pre></div>
<div class="figure">
<img src="figure/firstplot-1.png" alt="plot of chunk firstplot" />
<p class="caption">plot of chunk firstplot</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qplot</span>(comp_obs$Month,comp_obs$on_campus_all_sites_users, <span class="dt">main=</span><span class="st">"All campus website users from on campus by month"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"User count - all university sites"</span>) +<span class="st"> </span><span class="kw">geom_line</span>()</code></pre></div>
<div class="figure">
<img src="figure/firstplot-2.png" alt="plot of chunk firstplot" />
<p class="caption">plot of chunk firstplot</p>
</div>
<ul>
<li>A quick look at these two plots shows that July of 2014 had both the most users and the fewest hits. Something is off. Given that this was the first month of collection, data errors are likely, so it will be safest to discard the July 2014 data.</li>
</ul>
</div>
<div class="slide section level2">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#get a subset of observations with month greater than 2014-07</span>
comp_obs2 <-<span class="st"> </span><span class="kw">subset</span>(comp_obs, comp_obs$Month ><span class="st"> '2014-07-01'</span>)
comp_obs2$Month</code></pre></div>
<pre><code>## [1] "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01"
## [6] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [11] "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01"
## [16] "2015-11-01" "2015-12-01" "2016-01-01" "2016-02-01" "2016-03-01"
## [21] "2016-04-01" "2016-05-01" "2016-06-01" "2016-07-01" "2016-08-01"
## [26] "2016-09-01" "2016-10-01" "2016-11-01" "2016-12-01"</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#in case we want to do more cleaning steps, we'll just use cur_data as the name of the most recent data in the cleaning process</span>
cur_data <-<span class="st"> </span>comp_obs2
cur_df <-<span class="st"> </span><span class="kw">as.data.frame</span>(cur_data[,-<span class="dv">1</span>])
<span class="co">#we'll want to access the variable names</span>
header <-<span class="st"> </span><span class="kw">colnames</span>(cur_df)</code></pre></div>
</div>
<div class="slide section level2">
<ul>
<li>We may want to label our time periods as t = 1,2...n instead of as months, so we can add a variable with:</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">cur_data[<span class="st">"period"</span>] <-<span class="st"> </span><span class="kw">c</span>(<span class="kw">seq</span>(<span class="dt">from =</span> <span class="dv">1</span>, <span class="dt">to =</span> <span class="kw">nrow</span>(cur_data), <span class="dt">by =</span> <span class="dv">1</span>))
<span class="kw">print</span>(cur_data$Month[cur_data$period ==<span class="st"> </span><span class="dv">12</span>])</code></pre></div>
<pre><code>## [1] "2015-07-01"</code></pre>
<h3 id="data-exploration">Data Exploration</h3>
<h4 id="test-of-normality">Test of Normality</h4>
<ul>
<li>Our later tests assume a normal distribution, so let's take a look at the distributions of variables</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for (i in <span class="dv">2</span>:<span class="kw">length</span>(cur_data))
{
swtest <-<span class="st"> </span><span class="kw">shapiro.test</span>((cur_data[,i]))
pval <-<span class="st"> </span>swtest$p.value
if(pval <<span class="st"> </span><span class="fl">0.05</span>)
{
<span class="kw">print</span>(header[i])
<span class="kw">print</span>(<span class="kw">shapiro.test</span>(cur_data[,i]))
}
}</code></pre></div>
<pre><code>## [1] "off_campus_all_sites_users"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.87882, p-value = 0.003163
##
## [1] "on_campus_all_sites_pageviews"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.87597, p-value = 0.002732
##
## [1] "on_campus_all_sites_sessions"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.91295, p-value = 0.02028
##
## [1] "on_campus_all_sites_users"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.91023, p-value = 0.01737
##
## [1] "libraries_site_pageviews"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.92745, p-value = 0.04728
##
## [1] "summon_sessions"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.92177, p-value = 0.03382
##
## [1] "summon_users"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.86521, p-value = 0.001588
##
## [1] "JR1_retrievals"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.91301, p-value = 0.02036
##
## [1] "DB1_result_clicks"
##
## Shapiro-Wilk normality test
##
## data: cur_data[, i]
## W = 0.81417, p-value = 0.0001516</code></pre>
</div>
<div class="slide section level2">
<h4 id="looking-for-relationships-between-variables">Looking for relationships between variables</h4>
<ul>
<li>We'll first look for relationships between pairs of variables to determine if any relationships exist that warrant further investigation</li>
<li>Covariance is a measure of how much two variables vary in relation to each other. They can have a positive relationship (a rise in temperature coincides with a rise in ice cream sales) or a negative relationship (a rise in temperature coincides with a drop in coat sales)</li>
<li>Covariance does not allow us to compare apples and apples. For that, we need to scale the covariance so that we can measure the relative strength of the relationship. For this, we'll use Pearson's Correlation Coefficient, which will give us a result between -1 and 1, -1 indicating a strong negative relationship, 0 no relationship, and 1 a strong positive relationship.</li>
<li>We'll want to identify the variables that concern us most
<ul>
<li>Article Downloads (JR1_retrievals)</li>
<li>Discovery sessions and users (summon_sessions, summon_users)</li>
<li>Users and sessions on the library website (library_site_sessions, library_site_users)</li>
</ul></li>
<li>A function to run through all variables and find the most significant correlations</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#adapted from https://stackoverflow.com/questions/21604997/how-to-find-significant-correlations-in-a-large-dataset</span>
<span class="kw">corrgram</span>(cur_df)</code></pre></div>
<div class="figure">
<img src="figure/corr1-1.png" alt="plot of chunk corr1" />
<p class="caption">plot of chunk corr1</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#what are the correlations with a P value under 0.05?</span>
get.correlations <-<span class="st"> </span>function(cur_df)
{
correlations <-<span class="st"> </span><span class="kw">rcorr</span>(<span class="kw">as.matrix</span>(cur_df))
for (i in <span class="dv">1</span>:<span class="kw">length</span>(cur_df))
{
for (j in <span class="dv">1</span>:<span class="kw">length</span>(cur_df))
{
if (!<span class="kw">is.na</span>(correlations$P[i,j]))
{
<span class="co">#if the p-value is less than 0.05</span>
if(correlations$P[i,j] <<span class="st"> </span><span class="fl">0.05</span>)
{
<span class="co">#Define thresholds for what constitutes a "strong" correlation</span>
positive.rel <-<span class="st"> </span><span class="fl">0.8</span>
negative.rel <-<span class="st"> </span>-<span class="fl">0.8</span>
<span class="co">#if the relationship passes our strength test</span>
if((correlations$r[i,j] ><span class="st"> </span>positive.rel) ||<span class="st"> </span>(correlations$r[i,j] <<span class="st"> </span>negative.rel))
{
<span class="kw">print</span>(<span class="kw">paste</span>(<span class="kw">rownames</span>(correlations$P)[i], <span class="st">"-"</span> , <span class="kw">colnames</span>(correlations$P)[j], <span class="st">": "</span>, correlations$r[i,j]))
}
}
}
}
}
<span class="kw">return</span>(correlations)
}
my.cor <-<span class="st"> </span><span class="kw">get.correlations</span>(cur_df)</code></pre></div>
<pre><code>## [1] "off_campus_all_sites_pageviews - off_campus_all_sites_users : 0.802825570106506"
## [1] "off_campus_all_sites_sessions - on_campus_all_sites_pageviews : 0.83315509557724"
## [1] "off_campus_all_sites_users - off_campus_all_sites_pageviews : 0.802825570106506"
## [1] "on_campus_all_sites_pageviews - off_campus_all_sites_sessions : 0.83315509557724"
## [1] "on_campus_all_sites_pageviews - on_campus_all_sites_sessions : 0.960258185863495"
## [1] "on_campus_all_sites_pageviews - on_campus_all_sites_users : 0.980291903018951"
## [1] "on_campus_all_sites_pageviews - libraries_site_sessions : 0.817626476287842"
## [1] "on_campus_all_sites_pageviews - catalog_pageviews : 0.882542729377747"
## [1] "on_campus_all_sites_pageviews - catalog_sessions : 0.93052738904953"
## [1] "on_campus_all_sites_pageviews - catalog_users : 0.888962030410767"
## [1] "on_campus_all_sites_pageviews - summon_sessions : 0.802591621875763"
## [1] "on_campus_all_sites_pageviews - summon_users : 0.891677916049957"
## [1] "on_campus_all_sites_sessions - on_campus_all_sites_pageviews : 0.960258185863495"
## [1] "on_campus_all_sites_sessions - on_campus_all_sites_users : 0.964977741241455"
## [1] "on_campus_all_sites_sessions - libraries_site_sessions : 0.856795430183411"
## [1] "on_campus_all_sites_sessions - catalog_sessions : 0.838036715984344"
## [1] "on_campus_all_sites_sessions - summon_sessions : 0.900032937526703"
## [1] "on_campus_all_sites_sessions - summon_users : 0.944520235061646"
## [1] "on_campus_all_sites_sessions - DB1_result_clicks : 0.80340051651001"
## [1] "on_campus_all_sites_users - on_campus_all_sites_pageviews : 0.980291903018951"
## [1] "on_campus_all_sites_users - on_campus_all_sites_sessions : 0.964977741241455"
## [1] "on_campus_all_sites_users - libraries_site_sessions : 0.824459075927734"
## [1] "on_campus_all_sites_users - catalog_pageviews : 0.847058236598969"
## [1] "on_campus_all_sites_users - catalog_sessions : 0.900045275688171"
## [1] "on_campus_all_sites_users - catalog_users : 0.868770956993103"
## [1] "on_campus_all_sites_users - summon_sessions : 0.825373888015747"
## [1] "on_campus_all_sites_users - summon_users : 0.904647767543793"
## [1] "libraries_site_sessions - on_campus_all_sites_pageviews : 0.817626476287842"
## [1] "libraries_site_sessions - on_campus_all_sites_sessions : 0.856795430183411"
## [1] "libraries_site_sessions - on_campus_all_sites_users : 0.824459075927734"
## [1] "libraries_site_sessions - libraries_site_users : 0.966796696186066"
## [1] "libraries_site_sessions - summon_sessions : 0.834935367107391"
## [1] "libraries_site_sessions - summon_users : 0.87566751241684"
## [1] "libraries_site_sessions - DB1_result_clicks : 0.803117275238037"
## [1] "libraries_site_users - libraries_site_sessions : 0.966796696186066"
## [1] "catalog_pageviews - on_campus_all_sites_pageviews : 0.882542729377747"
## [1] "catalog_pageviews - on_campus_all_sites_users : 0.847058236598969"
## [1] "catalog_pageviews - catalog_sessions : 0.978694677352905"
## [1] "catalog_pageviews - catalog_users : 0.973966360092163"
## [1] "catalog_sessions - on_campus_all_sites_pageviews : 0.93052738904953"
## [1] "catalog_sessions - on_campus_all_sites_sessions : 0.838036715984344"
## [1] "catalog_sessions - on_campus_all_sites_users : 0.900045275688171"
## [1] "catalog_sessions - catalog_pageviews : 0.978694677352905"
## [1] "catalog_sessions - catalog_users : 0.983946621417999"
## [1] "catalog_users - on_campus_all_sites_pageviews : 0.888962030410767"
## [1] "catalog_users - on_campus_all_sites_users : 0.868770956993103"
## [1] "catalog_users - catalog_pageviews : 0.973966360092163"
## [1] "catalog_users - catalog_sessions : 0.983946621417999"
## [1] "summon_pageviews - summon_users : 0.800552785396576"
## [1] "summon_pageviews - JR1_retrievals : 0.817062854766846"
## [1] "summon_sessions - on_campus_all_sites_pageviews : 0.802591621875763"
## [1] "summon_sessions - on_campus_all_sites_sessions : 0.900032937526703"
## [1] "summon_sessions - on_campus_all_sites_users : 0.825373888015747"
## [1] "summon_sessions - libraries_site_sessions : 0.834935367107391"
## [1] "summon_sessions - summon_users : 0.975864470005035"
## [1] "summon_sessions - DB1_result_clicks : 0.918987274169922"
## [1] "summon_users - on_campus_all_sites_pageviews : 0.891677916049957"
## [1] "summon_users - on_campus_all_sites_sessions : 0.944520235061646"
## [1] "summon_users - on_campus_all_sites_users : 0.904647767543793"
## [1] "summon_users - libraries_site_sessions : 0.87566751241684"
## [1] "summon_users - summon_pageviews : 0.800552785396576"
## [1] "summon_users - summon_sessions : 0.975864470005035"
## [1] "summon_users - DB1_result_clicks : 0.897328317165375"
## [1] "JR1_retrievals - summon_pageviews : 0.817062854766846"
## [1] "JR1_retrievals - DB1_result_clicks : 0.805746912956238"
## [1] "DB1_result_clicks - on_campus_all_sites_sessions : 0.80340051651001"
## [1] "DB1_result_clicks - libraries_site_sessions : 0.803117275238037"
## [1] "DB1_result_clicks - summon_sessions : 0.918987274169922"
## [1] "DB1_result_clicks - summon_users : 0.897328317165375"
## [1] "DB1_result_clicks - JR1_retrievals : 0.805746912956238"</code></pre>
</div>
<div class="slide section level2">
<h4 id="looking-at-relationships-of-interest">Looking at relationships of interest</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#We can view individual correlations of interest with </span>
<span class="kw">cor</span>(cur_df$catalog_users, cur_df$summon_users)</code></pre></div>
<pre><code>## [1] 0.7450736</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cor</span>(cur_df$catalog_pageviews, cur_data$JR1_retrievals)</code></pre></div>
<pre><code>## [1] 0.3562041</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cor</span>(cur_data$summon_sessions, cur_data$JR1_retrievals)</code></pre></div>
<pre><code>## [1] 0.7458323</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cor</span>(cur_data$libraries_site_sessions, cur_data$summon_sessions)</code></pre></div>
<pre><code>## [1] 0.8349354</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#We want to know how various factors impact the number of sessions in the discovery layer</span>
<span class="kw">qplot</span>(cur_data$Month,cur_data$summon_sessions,
<span class="dt">main=</span><span class="st">"Summon Sessions and Library Site Sessions"</span>,
<span class="dt">xlab=</span><span class="st">"Month"</span>,
<span class="dt">ylab=</span><span class="st">"Summon session count"</span>,
<span class="dt">col=</span><span class="st">"Summon Sessions"</span>) +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_line</span>(<span class="dt">col =</span> <span class="st">"blue"</span>) +<span class="st"> </span>
<span class="st"> </span><span class="kw">geom_line</span>(<span class="kw">aes</span>(<span class="dt">y =</span> cur_data$libraries_site_sessions, <span class="dt">col=</span><span class="st">"Library Site Sessions"</span>, <span class="dt">name=</span><span class="st">"All libraries site sessions"</span>, <span class="dt">labels=</span><span class="kw">c</span>(<span class="st">"Library Site Sessions"</span>)))</code></pre></div>
<div class="figure">
<img src="figure/relationship_isolation-1.png" alt="plot of chunk relationship_isolation" />
<p class="caption">plot of chunk relationship_isolation</p>
</div>
</div>
<div id="back-to-stage-3" class="slide section level2">
<h2>Back to Stage 3</h2>
<h3 id="always-be-willing-to-return-to-a-previous-stage-and-start-again">Always be willing to return to a previous stage and start again</h3>
<ul>
<li>We initially discarded pre-2014 data because some variables were incomplete. If our variables of interest had more observations, we'll want to use those</li>
</ul>
</div>
<div id="crisp-dm-stage-3-data-preparation-1" class="slide section level2">
<h2>CRISP-DM Stage 3: Data Preparation</h2>
<h3 id="cleaning-1">Cleaning</h3>
<h3 id="subset-of-the-data-since-our-variables-of-interest-have-more-observations-lets-focus-on-those-for-now">Subset of the data: since our variables of interest have more observations, let's focus on those for now</h3>
<p>We want to get one subset of variables that go back to 2013</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">comp_obs3 <-<span class="st"> </span><span class="kw">select</span>(clean_usage, Month, libraries_site_pageviews, libraries_site_sessions, libraries_site_users, summon_pageviews, summon_sessions, summon_users)
comp_obs4 <-<span class="st"> </span>comp_obs3[<span class="kw">complete.cases</span>(comp_obs3), ]
<span class="co">#give each observation a period number</span>
comp_obs4[<span class="st">"period"</span>] <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(comp_obs4))
<span class="co">#if data is final subset, assign to cur_data</span>
cur_data <-<span class="st"> </span>comp_obs4</code></pre></div>
</div>
<div class="slide section level2">
<ol start="3" style="list-style-type: decimal">
<li>Take a look at the data</li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qplot</span>(cur_data$period, cur_data$libraries_site_pageviews, <span class="dt">main=</span><span class="st">"Library website hits by month"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Hit count - Library sites"</span>) +<span class="st"> </span><span class="kw">geom_line</span>()</code></pre></div>
<div class="figure">
<img src="figure/secondplot-1.png" alt="plot of chunk secondplot" />
<p class="caption">plot of chunk secondplot</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qplot</span>(cur_data$period, cur_data$libraries_site_users, <span class="dt">main=</span><span class="st">"Library website users from on campus by month"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"User count - library sites"</span>) +<span class="st"> </span><span class="kw">geom_line</span>()</code></pre></div>
<div class="figure">
<img src="figure/secondplot-2.png" alt="plot of chunk secondplot" />
<p class="caption">plot of chunk secondplot</p>
</div>
</div>
<div class="slide section level2">
<ul>
<li>Clearly something was wrong with stats collection in Summer 2014, so let's remove those</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">err_remove <-<span class="st"> </span>cur_data[-<span class="kw">c</span>(<span class="dv">28</span>,<span class="dv">29</span>,<span class="dv">30</span>,<span class="dv">31</span>), ]
cur_data <-<span class="st"> </span>err_remove
<span class="kw">qplot</span>(cur_data$period, cur_data$libraries_site_pageviews, <span class="dt">main=</span><span class="st">"Library website hits and users by period"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Hit count - Library sites"</span>) +<span class="st"> </span><span class="kw">geom_line</span>() +
<span class="kw">geom_line</span>(<span class="kw">aes</span>(<span class="dt">y =</span> cur_data$libraries_site_sessions, <span class="dt">col=</span><span class="st">"Libraries Site Sessions"</span>, <span class="dt">title=</span><span class="st">"All libraries site sessions"</span>))</code></pre></div>
<div class="figure">
<img src="figure/error_remove-1.png" alt="plot of chunk error_remove" />
<p class="caption">plot of chunk error_remove</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qplot</span>(cur_data$period, cur_data$libraries_site_sessions, <span class="dt">main=</span><span class="st">"Library website and Summon sessions by month"</span>, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Hit count - Library sites"</span>, <span class="dt">col=</span><span class="st">"Library Site Sessions"</span>) +<span class="st"> </span><span class="kw">geom_line</span>() +
<span class="kw">geom_line</span>(<span class="kw">aes</span>(<span class="dt">y =</span> cur_data$summon_sessions, <span class="dt">col=</span><span class="st">"Summon Sessions"</span>, <span class="dt">title=</span><span class="st">"All libraries site sessions"</span>))</code></pre></div>
<div class="figure">
<img src="figure/error_remove-2.png" alt="plot of chunk error_remove" />
<p class="caption">plot of chunk error_remove</p>
</div>
</div>
<div id="crisp-dm-stage-4-modeling" class="slide section level2">
<h2>CRISP-DM Stage 4: Modeling</h2>
<h4 id="regression">Regression</h4>
<p>Regression can be used to estimate a trend - a gradual shift - over time <span class="math inline"><em>β</em><sub>0</sub> + <em>β</em><sub>1</sub> + <em>ϵ</em></span></p>
<ul>
<li>Some assumptions of regression:</li>
</ul>
<ol style="list-style-type: decimal">
<li>Linear relationship between dependent and independent variable</li>
<li>No correlation between any variables and the error</li>
<li>Constant variance of errors</li>
<li>Error distribution is normal</li>
</ol>
<ul>
<li>A basic linear regression</li>
<li>We want to predict the number of sessions (the response or dependent variable) given a month (the explanatory or independent variable)</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">basic_reg_smn <-<span class="st"> </span><span class="kw">lm</span>(summon_sessions~period, <span class="dt">data=</span>cur_data)
basic_reg_smn</code></pre></div>
<pre><code>##
## Call:
## lm(formula = summon_sessions ~ period, data = cur_data)
##
## Coefficients:
## (Intercept) period
## 25192.6 39.2</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">basic_reg_libses <-<span class="st"> </span><span class="kw">lm</span>(libraries_site_sessions~period, <span class="dt">data=</span>cur_data)
basic_reg_libses</code></pre></div>
<pre><code>##
## Call:
## lm(formula = libraries_site_sessions ~ period, data = cur_data)
##
## Coefficients:
## (Intercept) period
## 207358 -2574</code></pre>
</div>
<div class="slide section level2">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plot</span>(cur_data$summon_sessions~cur_data$period, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Summon Sessions"</span>)
<span class="kw">abline</span>(basic_reg_smn)</code></pre></div>
<div class="figure">
<img src="figure/reg_plot1-1.png" alt="plot of chunk reg_plot1" />
<p class="caption">plot of chunk reg_plot1</p>
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plot</span>(cur_data$libraries_site_sessions~cur_data$period, <span class="dt">xlab=</span><span class="st">"Month"</span>, <span class="dt">ylab=</span><span class="st">"Library Site Sessions"</span>)
<span class="kw">abline</span>(basic_reg_libses)</code></pre></div>
<div class="figure">
<img src="figure/reg_plot1-2.png" alt="plot of chunk reg_plot1" />
<p class="caption">plot of chunk reg_plot1</p>
</div>
<ul>
<li>We see two clear trends, but are the models accurate?</li>
</ul>
</div>
<div id="crisp-dm-stage-5-evaluation" class="slide section level2">
<h2>CRISP-DM Stage 5: Evaluation</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(basic_reg_smn)</code></pre></div>
<pre><code>##
## Call:
## lm(formula = summon_sessions ~ period, data = cur_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13244 -8769 2508 6760 13719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25192.59 2134.90 11.800 <2e-16 ***
## period 39.20 58.26 0.673 0.504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8204 on 56 degrees of freedom
## Multiple R-squared: 0.00802, Adjusted R-squared: -0.009694
## F-statistic: 0.4527 on 1 and 56 DF, p-value: 0.5038</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(basic_reg_libses)</code></pre></div>
<pre><code>##
## Call:
## lm(formula = libraries_site_sessions ~ period, data = cur_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62545 -22734 -2733 32984 65989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207357.9 9279.3 22.35 < 2e-16 ***
## period -2574.4 253.2 -10.17 2.5e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35660 on 56 degrees of freedom
## Multiple R-squared: 0.6486, Adjusted R-squared: 0.6423
## F-statistic: 103.4 on 1 and 56 DF, p-value: 2.503e-14</code></pre>
<ul>
<li>The model for Summon sessions is nowhere near our desired level of accuracy.</li>
<li>The model for Library Website Sessions is very promising</li>
</ul>
</div>
<div id="back-to-crisp-dm-stage-4-modeling" class="slide section level2">
<h2>Back to CRISP-DM Stage 4: Modeling</h2>
<h3 id="additive-decomposition-model">Additive decomposition model</h3>
<ul>
<li>Essentially means we're adding or subtracting values to reduce the impact of seasons</li>
<li>Looking for a linear trend</li>
<li>For monthly data, there are 12 seasonal time periods, so we will need 12-1 dummy variables</li>
<li>d_Mon1 = 1 if Month is January, 0 otherwise</li>
<li>d_Mon2 = 1 if Month is February, 0 otherwise...</li>
<li>We'll omit December because we need to avoid multiple collinearity, so the number of periods is always n-1</li>
<li><p>Model is <span class="math inline"><em>β</em><sub>0</sub> + <em>β</em><sub>1</sub>(<em>d</em>_<em>M</em><em>o</em><em>n</em>1)+<em>β</em><sub>2</sub>(<em>d</em>_<em>M</em><em>o</em><em>n</em>2)+...<em>β</em><sub>12</sub>(<em>t</em>)+<em>ϵ</em></span></p></li>
<li><p>So when we want to know what value to expect in January, our model is: E(y | t, Month=1) = <span class="math inline"><em>β</em><sub>0</sub> + <em>β</em><sub>1</sub>(1)+<em>β</em><sub>2</sub>(0)+...<em>β</em><sub>12</sub></span> (t)</p></li>
<li><p><span class="math inline"><em>β</em><sub>12</sub></span> (t) is the same for every month we look at. This gives us the slope of the trend line</p></li>
<li><p>The model for the 12th month (or whatever our n-1 is) is: <span class="math inline"><em>β</em><sub>0</sub> + <em>β</em><sub>12</sub></span> (t)</p></li>
</ul>
</div>
<div class="slide section level2">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#we want the month numbers as a factor so they can be used as cateogircal variables</span>
cur_data$d_Mon <-<span class="st"> </span><span class="kw">as.factor</span>(<span class="kw">month</span>(cur_data$Month) %%<span class="st"> </span><span class="dv">12</span>)
seasonal.model <-<span class="st"> </span><span class="kw">lm</span>(libraries_site_sessions~d_Mon, <span class="dt">data=</span>cur_data)
seasonal.model2 <-<span class="st"> </span><span class="kw">lm</span>(summon_sessions~d_Mon, <span class="dt">data=</span>cur_data)
seasonal.model2</code></pre></div>
<pre><code>##
## Call:
## lm(formula = summon_sessions ~ d_Mon, data = cur_data)
##
## Coefficients:
## (Intercept) d_Mon1 d_Mon2 d_Mon3 d_Mon4
## 16646.2 8520.6 15764.1 10889.5 14154.8
## d_Mon5 d_Mon6 d_Mon7 d_Mon8 d_Mon9
## 4062.6 855.1 1440.3 -628.9 16875.4
## d_Mon10 d_Mon11
## 20451.4 16954.6</code></pre>
</div>
<div id="crisp-dm-stage-5-evaluation-1" class="slide section level2">
<h2>CRISP-DM Stage 5: Evaluation</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(seasonal.model)</code></pre></div>
<pre><code>##
## Call:
## lm(formula = libraries_site_sessions ~ d_Mon, data = cur_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89154 -37610 -21278 42056 125621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81830 26484 3.090 0.00339 **
## d_Mon1 36767 37454 0.982 0.33140
## d_Mon2 66974 35859 1.868 0.06818 .
## d_Mon3 48753 35859 1.360 0.18059
## d_Mon4 71988 37454 1.922 0.06080 .
## d_Mon5 38540 39726 0.970 0.33704
## d_Mon6 7585 39726 0.191 0.84942
## d_Mon7 7641 39726 0.192 0.84832
## d_Mon8 23682 39726 0.596 0.55401
## d_Mon9 77985 37454 2.082 0.04292 *
## d_Mon10 78743 37454 2.102 0.04102 *
## d_Mon11 45018 37454 1.202 0.23552
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59220 on 46 degrees of freedom
## Multiple R-squared: 0.204, Adjusted R-squared: 0.01363
## F-statistic: 1.072 on 11 and 46 DF, p-value: 0.404</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(seasonal.model2)</code></pre></div>
<pre><code>##
## Call:
## lm(formula = summon_sessions ~ d_Mon, data = cur_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10262 -2027 2 1804 15320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16646.2 1813.6 9.179 5.78e-12 ***
## d_Mon1 8520.6 2564.8 3.322 0.00176 **
## d_Mon2 15764.1 2455.6 6.420 6.78e-08 ***
## d_Mon3 10889.5 2455.6 4.435 5.70e-05 ***
## d_Mon4 14154.8 2564.8 5.519 1.51e-06 ***
## d_Mon5 4062.6 2720.4 1.493 0.14216
## d_Mon6 855.1 2720.4 0.314 0.75470
## d_Mon7 1440.3 2720.4 0.529 0.59904
## d_Mon8 -628.9 2720.4 -0.231 0.81818
## d_Mon9 16875.4 2564.8 6.580 3.90e-08 ***
## d_Mon10 20451.4 2564.8 7.974 3.23e-10 ***
## d_Mon11 16954.6 2564.8 6.611 3.50e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4055 on 46 degrees of freedom
## Multiple R-squared: 0.8009, Adjusted R-squared: 0.7533
## F-statistic: 16.82 on 11 and 46 DF, p-value: 1.234e-12</code></pre>
<ul>
<li><p>The attempt to reduce the seasonal effect has made the library site sessions model worse.</p></li>
<li><p>The first regression on Summon sessions did not yield a useful model, but this model looks promising.</p></li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#actual Summon sessions for January and February 2017: 26600, 34370</span>
new_data <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">d_Mon =</span> <span class="st">"1"</span>)
pred_basic_reg_smn_jan <-<span class="st"> </span><span class="kw">predict</span>(seasonal.model2, new_data, <span class="dt">interval=</span><span class="st">"confidence"</span>, <span class="dt">level=</span><span class="fl">0.95</span>)
pred_basic_reg_smn_jan</code></pre></div>
<pre><code>## fit lwr upr
## 1 25166.8 21516.26 28817.34</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">new_data <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">d_Mon =</span> <span class="st">"2"</span>)
pred_basic_reg_smn_feb <-<span class="st"> </span><span class="kw">predict</span>(seasonal.model2, new_data, <span class="dt">interval=</span><span class="st">"confidence"</span>, <span class="dt">level=</span><span class="fl">0.95</span>)
pred_basic_reg_smn_feb</code></pre></div>
<pre><code>## fit lwr upr
## 1 32410.33 29077.86 35742.81</code></pre>
<ul>
<li>Seasonal Model 2 for Summon sessions seems to be working.</li>
</ul>
<ul>
<li>The same thing done manually</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#create a new variable for each month</span>
dummy.vars <-<span class="st"> </span>function(col,i){
new_val <-<span class="st"> </span>col
for (j in col){