Create a new template data diary — use

Take the arguments supplied and put them into the appropriate places in a new template diary. Write the new template diary in the supplied directory.

use_diary(
  st,
  type,
  author,
  path = "state/{st}/{type}/docs/{st}_{type}_diary.Rmd",
  auto = FALSE
)

Arguments

st: The USPS state abbreviation. State data only, no federal agencies.
type: The type of data, one of "contribs", "expends", "lobby", "contracts", "salary", or "voters".
author: The author name of the new diary.
path: The file path, relative to your working directory, where the diary file will be created. If you use NA, then the lines of the diary will be returned as a character vector. If you specify a character string file path that contains directories that do not exist then they will be created. By default, the path creates the diary in a directory that is expected by the Accountability Project GitHub repository.
auto: Must be set to TRUE for the diary to be created and opened.

Value

The file path of new diary, invisibly.

Examples

use_diary("VT", "contribs", "Kiernan Nicholls", NA, auto = FALSE)
#>   [1] "---"                                                                                
#>   [2] "title: \"Vermont Contributions\""                                                   
#>   [3] "author: \"Kiernan Nicholls\""                                                       
#>   [4] "date: \"`r date()`\""                                                               
#>   [5] "output:"                                                                            
#>   [6] "  github_document: "                                                                
#>   [7] "    df_print: tibble"                                                               
#>   [8] "    toc: true"                                                                      
#>   [9] "    toc_dept: 3"                                                                    
#>  [10] "editor_options: "                                                                   
#>  [11] "  chunk_output_type: console"                                                       
#>  [12] "---"                                                                                
#>  [13] ""                                                                                   
#>  [14] "<!-- Place comments regarding knitting here -->"                                    
#>  [15] ""                                                                                   
#>  [16] "```{r setup, include=FALSE, purl=FALSE}"                                            
#>  [17] "library(knitr)"                                                                     
#>  [18] "opts_chunk$set("                                                                    
#>  [19] "  eval = TRUE,"                                                                     
#>  [20] "  echo = TRUE,"                                                                     
#>  [21] "  warning = FALSE,"                                                                 
#>  [22] "  message = FALSE,"                                                                 
#>  [23] "  error = FALSE,"                                                                   
#>  [24] "  collapse = TRUE,"                                                                 
#>  [25] "  comment = \"#>\","                                                                
#>  [26] "  fig.path = \"../plots/\","                                                        
#>  [27] "  fig.width = 10,"                                                                  
#>  [28] "  dpi = 300"                                                                        
#>  [29] ")"                                                                                  
#>  [30] "if (!interactive()) {"                                                              
#>  [31] "  options(width = 120)"                                                             
#>  [32] "  set.seed(5)"                                                                      
#>  [33] "}"                                                                                  
#>  [34] "```"                                                                                
#>  [35] ""                                                                                   
#>  [36] "```{r create-docs-dir, eval=FALSE, echo=FALSE, include=FALSE}"                      
#>  [37] "doc_dir <- fs::dir_create(here::here(\"state\", \"vt\", \"contribs\", \"docs\"))"   
#>  [38] "```"                                                                                
#>  [39] ""                                                                                   
#>  [40] "## Project"                                                                         
#>  [41] ""                                                                                   
#>  [42] "The Accountability Project is an effort to cut across data silos and give"          
#>  [43] "journalists, policy professionals, activists, and the public at large a simple"     
#>  [44] "way to search across huge volumes of public data about people and organizations."   
#>  [45] ""                                                                                   
#>  [46] "Our goal is to standardize public data on a few key fields by thinking of each"     
#>  [47] "dataset row as a transaction. For each transaction there should be (at least) 3"    
#>  [48] "variables:"                                                                         
#>  [49] ""                                                                                   
#>  [50] "1. All **parties** to a transaction."                                               
#>  [51] "2. The **date** of the transaction."                                                
#>  [52] "3. The **amount** of money involved."                                               
#>  [53] ""                                                                                   
#>  [54] "## Objectives"                                                                      
#>  [55] ""                                                                                   
#>  [56] "This document describes the process used to complete the following objectives:"     
#>  [57] ""                                                                                   
#>  [58] "1. How many records are in the database?"                                           
#>  [59] "1. Check for entirely duplicated records."                                          
#>  [60] "1. Check ranges of continuous variables."                                           
#>  [61] "1. Is there anything blank or missing?"                                             
#>  [62] "1. Check for consistency issues."                                                   
#>  [63] "1. Create a five-digit ZIP Code called `zip`."                                      
#>  [64] "1. Create a `year` field from the transaction date."                                
#>  [65] "1. Make sure there is data on both parties to a transaction."                       
#>  [66] ""                                                                                   
#>  [67] "## Packages"                                                                        
#>  [68] ""                                                                                   
#>  [69] "The following packages are needed to collect, manipulate, visualize, analyze,"      
#>  [70] "and communicate these results. The `pacman` package will facilitate their"          
#>  [71] "installation and attachment."                                                       
#>  [72] ""                                                                                   
#>  [73] "```{r load-packages, message=FALSE, warning=FALSE, error=FALSE}"                    
#>  [74] "if (!require(\"pacman\")) {"                                                        
#>  [75] "  install.packages(\"pacman\")"                                                     
#>  [76] "}"                                                                                  
#>  [77] "pacman::p_load("                                                                    
#>  [78] "  tidyverse, # data manipulation"                                                   
#>  [79] "  lubridate, # datetime strings"                                                    
#>  [80] "  gluedown, # printing markdown"                                                    
#>  [81] "  jsonlite, # read json files"                                                      
#>  [82] "  janitor, # clean data frames"                                                     
#>  [83] "  campfin, # custom irw tools"                                                      
#>  [84] "  aws.s3, # aws cloud storage"                                                      
#>  [85] "  readxl, # read excel files"                                                       
#>  [86] "  refinr, # cluster & merge"                                                        
#>  [87] "  scales, # format strings"                                                         
#>  [88] "  knitr, # knit documents"                                                          
#>  [89] "  rvest, # scrape html"                                                             
#>  [90] "  glue, # code strings"                                                             
#>  [91] "  here, # project paths"                                                            
#>  [92] "  httr, # http requests"                                                            
#>  [93] "  fs # local storage "                                                              
#>  [94] ")"                                                                                  
#>  [95] "```"                                                                                
#>  [96] ""                                                                                   
#>  [97] "This diary was run using `campfin` version `r packageVersion(\"campfin\")`."        
#>  [98] ""                                                                                   
#>  [99] "```{r campfin-version}"                                                             
#> [100] "packageVersion(\"campfin\")"                                                        
#> [101] "```"                                                                                
#> [102] ""                                                                                   
#> [103] "```{r package-options, echo=FALSE}"                                                 
#> [104] "options(options(knitr.kable.NA = \"\"))"                                            
#> [105] "```"                                                                                
#> [106] ""                                                                                   
#> [107] "This document should be run as part of the `R_tap` project, which lives as a"       
#> [108] "sub-directory of the more general, language-agnostic"                               
#> [109] "[`irworkshop/accountability_datacleaning`][tap] GitHub repository."                 
#> [110] ""                                                                                   
#> [111] "The `R_tap` project uses the [RStudio projects][rproj] feature and should be"       
#> [112] "run as such. The project also uses the dynamic `here::here()` tool for file"        
#> [113] "paths relative to _your_ machine."                                                  
#> [114] ""                                                                                   
#> [115] "```{r where-here}"                                                                  
#> [116] "# where does this document knit?"                                                   
#> [117] "here::i_am(\"state/vt/contribs/docs/vt_contribs_diary.Rmd\")"                       
#> [118] "```"                                                                                
#> [119] ""                                                                                   
#> [120] "[tap]: https://github.com/irworkshop/accountability_datacleaning"                   
#> [121] "[rproj]: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects"    
#> [122] ""                                                                                   
#> [123] "## Source"                                                                          
#> [124] ""                                                                                   
#> [125] ""                                                                                   
#> [126] ""                                                                                   
#> [127] "## Download"                                                                        
#> [128] ""                                                                                   
#> [129] "```{r raw-dir}"                                                                     
#> [130] "raw_url <- \"https://example.com/source_file.csv\""                                 
#> [131] "raw_dir <- dir_create(here(\"state\", \"vt\", \"contribs\", \"data\", \"raw\"))"    
#> [132] "raw_csv <- path(raw_dir, basename(raw_url))"                                        
#> [133] "```"                                                                                
#> [134] ""                                                                                   
#> [135] "```{r raw-download}"                                                                
#> [136] "if (!file_exists(raw_csv)) {"                                                       
#> [137] "  download.file(raw_url, raw_csv)"                                                  
#> [138] "}"                                                                                  
#> [139] "```"                                                                                
#> [140] ""                                                                                   
#> [141] "## Read"                                                                            
#> [142] ""                                                                                   
#> [143] "```{r raw-read}"                                                                    
#> [144] "vtc <- read_delim("                                                                 
#> [145] "  file = raw_csv,"                                                                  
#> [146] "  delim = \",\","                                                                   
#> [147] "  escape_backslash = FALSE,"                                                        
#> [148] "  escape_double = FALSE,"                                                           
#> [149] "  col_types = cols("                                                                
#> [150] "    .default = col_character(),"                                                    
#> [151] "    date = col_date_mdy(),"                                                         
#> [152] "    amount = col_double()"                                                          
#> [153] "  )"                                                                                
#> [154] ")"                                                                                  
#> [155] "```"                                                                                
#> [156] ""                                                                                   
#> [157] "```{r clean-names}"                                                                 
#> [158] "vtc <- clean_names(vtc, case = \"snake\")"                                          
#> [159] "```"                                                                                
#> [160] ""                                                                                   
#> [161] "## Explore"                                                                         
#> [162] ""                                                                                   
#> [163] "There are `r comma(nrow(vtc))` rows of `r ncol(vtc)` columns. Each record"          
#> [164] "represents a single Contributions..."                                               
#> [165] ""                                                                                   
#> [166] "```{r glimpse}"                                                                     
#> [167] "glimpse(vtc)"                                                                       
#> [168] "tail(vtc)"                                                                          
#> [169] "```"                                                                                
#> [170] ""                                                                                   
#> [171] "### Missing"                                                                        
#> [172] ""                                                                                   
#> [173] "Columns vary in their degree of missing values."                                    
#> [174] ""                                                                                   
#> [175] "```{r na-count}"                                                                    
#> [176] "col_stats(vtc, count_na)"                                                           
#> [177] "```"                                                                                
#> [178] ""                                                                                   
#> [179] "We can flag any record missing a key variable needed to identify a transaction."    
#> [180] ""                                                                                   
#> [181] "```{r na-flag}"                                                                     
#> [182] "key_vars <- c(\"date\", \"last_name\", \"amount\", \"committee_name\")"             
#> [183] "vtc <- flag_na(vtc, all_of(key_vars))"                                              
#> [184] "sum(vtc$na_flag)"                                                                   
#> [185] "```"                                                                                
#> [186] ""                                                                                   
#> [187] "```{r na-view}"                                                                     
#> [188] "vtc %>% "                                                                           
#> [189] "  filter(na_flag) %>% "                                                             
#> [190] "  select(all_of(key_vars))"                                                         
#> [191] "```"                                                                                
#> [192] ""                                                                                   
#> [193] "### Duplicates"                                                                     
#> [194] ""                                                                                   
#> [195] "We can also flag any record completely duplicated across every column."             
#> [196] ""                                                                                   
#> [197] "```{r dupe-flag}"                                                                   
#> [198] "vtc <- flag_dupes(vtc, -id)"                                                        
#> [199] "sum(vtc$dupe_flag)"                                                                 
#> [200] "```"                                                                                
#> [201] ""                                                                                   
#> [202] "```{r dupe-view}"                                                                   
#> [203] "vtc %>% "                                                                           
#> [204] "  filter(dupe_flag) %>% "                                                           
#> [205] "  select(all_of(key_vars)) %>% "                                                    
#> [206] "  arrange(date)"                                                                    
#> [207] "```"                                                                                
#> [208] ""                                                                                   
#> [209] "### Categorical"                                                                    
#> [210] ""                                                                                   
#> [211] "```{r distinct-count}"                                                              
#> [212] "col_stats(vtc, n_distinct)"                                                         
#> [213] "```"                                                                                
#> [214] ""                                                                                   
#> [215] "```{r distinct-plots, echo=FALSE, fig.height=3}"                                    
#> [216] "explore_plot(vtc, type)"                                                            
#> [217] "```"                                                                                
#> [218] ""                                                                                   
#> [219] "### Amounts"                                                                        
#> [220] ""                                                                                   
#> [221] "```{r amount-round}"                                                                
#> [222] "# fix floating point precision"                                                     
#> [223] "vtc$amount <- round(vtc$amount, digits = 2)"                                        
#> [224] "```"                                                                                
#> [225] ""                                                                                   
#> [226] "```{r amount-summary}"                                                              
#> [227] "summary(vtc$amount)"                                                                
#> [228] "mean(vtc$amount <= 0)"                                                              
#> [229] "```"                                                                                
#> [230] ""                                                                                   
#> [231] "These are the records with the minimum and maximum amounts."                        
#> [232] ""                                                                                   
#> [233] "```{r amount-minmax}"                                                               
#> [234] "glimpse(vtc[c(which.max(vtc$amount), which.min(vtc$amount)), ])"                    
#> [235] "```"                                                                                
#> [236] ""                                                                                   
#> [237] "The distribution of amount values are typically log-normal."                        
#> [238] ""                                                                                   
#> [239] "```{r hist-amount, echo=FALSE}"                                                     
#> [240] "vtc %>%"                                                                            
#> [241] "  ggplot(aes(amount)) +"                                                            
#> [242] "  geom_histogram(fill = dark2[\"purple\"]) +"                                       
#> [243] "  scale_y_continuous(labels = comma) +"                                             
#> [244] "  scale_x_continuous("                                                              
#> [245] "    breaks = c(1 %o% 10^(0:6)),"                                                    
#> [246] "    labels = dollar,"                                                               
#> [247] "    trans = \"log10\""                                                              
#> [248] "  ) +"                                                                              
#> [249] "  labs("                                                                            
#> [250] "    title = \"Vermont Contributions Amount Distribution\","                         
#> [251] "    caption = \"Source: {source}\","                                                
#> [252] "    x = \"Amount\","                                                                
#> [253] "    y = \"Count\""                                                                  
#> [254] "  )"                                                                                
#> [255] "```"                                                                                
#> [256] ""                                                                                   
#> [257] "### Dates"                                                                          
#> [258] ""                                                                                   
#> [259] "We can add the calendar year from `date` with `lubridate::year()`"                  
#> [260] ""                                                                                   
#> [261] "```{r date-year}"                                                                   
#> [262] "vtc <- mutate(vtc, year = year(date))"                                              
#> [263] "```"                                                                                
#> [264] ""                                                                                   
#> [265] "```{r date-range}"                                                                  
#> [266] "min(vtc$date)"                                                                      
#> [267] "sum(vtc$year < 2000)"                                                               
#> [268] "max(vtc$date)"                                                                      
#> [269] "sum(vtc$date > today())"                                                            
#> [270] "```"                                                                                
#> [271] ""                                                                                   
#> [272] "It's common to see an increase in the number of contributins in elections years."   
#> [273] ""                                                                                   
#> [274] "```{r bar-year, echo=FALSE}"                                                        
#> [275] "vtc %>% "                                                                           
#> [276] "  count(year) %>% "                                                                 
#> [277] "  mutate(even = is_even(year)) %>% "                                                
#> [278] "  ggplot(aes(x = year, y = n)) +"                                                   
#> [279] "  geom_col(aes(fill = even)) + "                                                    
#> [280] "  scale_fill_brewer(palette = \"Dark2\") +"                                         
#> [281] "  scale_y_continuous(labels = comma) +"                                             
#> [282] "  scale_x_continuous(breaks = seq(2000, 2020, by = 2)) +"                           
#> [283] "  theme(legend.position = \"bottom\") +"                                            
#> [284] "  labs("                                                                            
#> [285] "    title = \"Vermont Contributions by Year\","                                     
#> [286] "    caption = \"Source: {source}\","                                                
#> [287] "    fill = \"Election Year\","                                                      
#> [288] "    x = \"Year Made\","                                                             
#> [289] "    y = \"Count\""                                                                  
#> [290] "  )"                                                                                
#> [291] "```"                                                                                
#> [292] ""                                                                                   
#> [293] "## Wrangle"                                                                         
#> [294] ""                                                                                   
#> [295] "To improve the searchability of the database, we will perform some consistent,"     
#> [296] "confident string normalization. For geographic variables like city names and"       
#> [297] "ZIP codes, the corresponding `campfin::normal_*()` functions are tailor made to "   
#> [298] "facilitate this process."                                                           
#> [299] ""                                                                                   
#> [300] "### Address"                                                                        
#> [301] ""                                                                                   
#> [302] "For the street `addresss` variable, the `campfin::normal_address()` function"       
#> [303] "will force consistence case, remove punctuation, and abbreviate official "          
#> [304] "USPS suffixes."                                                                     
#> [305] ""                                                                                   
#> [306] "```{r address-norm}"                                                                
#> [307] "addr_norm <- vtc %>% "                                                              
#> [308] "  distinct(address1, address2) %>% "                                                
#> [309] "  unite("                                                                           
#> [310] "    col = address_full,"                                                            
#> [311] "    starts_with(\"address\"),"                                                      
#> [312] "    sep = \" \","                                                                   
#> [313] "    remove = FALSE,"                                                                
#> [314] "    na.rm = TRUE"                                                                   
#> [315] "  ) %>% "                                                                           
#> [316] "  mutate("                                                                          
#> [317] "    address_norm = normal_address("                                                 
#> [318] "      address = address_full,"                                                      
#> [319] "      abbs = usps_street,"                                                          
#> [320] "      na_rep = TRUE"                                                                
#> [321] "    )"                                                                              
#> [322] "  ) %>% "                                                                           
#> [323] "  select(-address_full)"                                                            
#> [324] "```"                                                                                
#> [325] ""                                                                                   
#> [326] "```{r address-view}"                                                                
#> [327] "addr_norm"                                                                          
#> [328] "```"                                                                                
#> [329] ""                                                                                   
#> [330] "```{r address-join}"                                                                
#> [331] "vtc <- left_join(vtc, addr_norm, by = c(\"address1\", \"address2\"))"               
#> [332] "```"                                                                                
#> [333] ""                                                                                   
#> [334] "### ZIP"                                                                            
#> [335] ""                                                                                   
#> [336] "For ZIP codes, the `campfin::normal_zip()` function will attempt to create"         
#> [337] "valid _five_ digit codes by removing the ZIP+4 suffix and returning leading"        
#> [338] "zeroes dropped by other programs like Microsoft Excel."                             
#> [339] ""                                                                                   
#> [340] "```{r zip-norm}"                                                                    
#> [341] "vtc <- vtc %>% "                                                                    
#> [342] "  mutate("                                                                          
#> [343] "    zip_norm = normal_zip("                                                         
#> [344] "      zip = zip,"                                                                   
#> [345] "      na_rep = TRUE"                                                                
#> [346] "    )"                                                                              
#> [347] "  )"                                                                                
#> [348] "```"                                                                                
#> [349] ""                                                                                   
#> [350] "```{r zip-progress}"                                                                
#> [351] "progress_table("                                                                    
#> [352] "  vtc$zip,"                                                                         
#> [353] "  vtc$zip_norm,"                                                                    
#> [354] "  compare = valid_zip"                                                              
#> [355] ")"                                                                                  
#> [356] "```"                                                                                
#> [357] ""                                                                                   
#> [358] "### State"                                                                          
#> [359] ""                                                                                   
#> [360] "Valid two digit state abbreviations can be made using the "                         
#> [361] "`campfin::normal_state()` function."                                                
#> [362] ""                                                                                   
#> [363] "```{r state-norm}"                                                                  
#> [364] "vtc <- vtc %>% "                                                                    
#> [365] "  mutate("                                                                          
#> [366] "    state_norm = normal_state("                                                     
#> [367] "      state = state,"                                                               
#> [368] "      abbreviate = TRUE,"                                                           
#> [369] "      na_rep = TRUE,"                                                               
#> [370] "      valid = valid_state"                                                          
#> [371] "    )"                                                                              
#> [372] "  )"                                                                                
#> [373] "```"                                                                                
#> [374] ""                                                                                   
#> [375] "```{r state-view}"                                                                  
#> [376] "vtc %>% "                                                                           
#> [377] "  filter(state != state_norm) %>% "                                                 
#> [378] "  count(state, state_norm, sort = TRUE)"                                            
#> [379] "```"                                                                                
#> [380] ""                                                                                   
#> [381] "```{r state-progress}"                                                              
#> [382] "progress_table("                                                                    
#> [383] "  vtc$state,"                                                                       
#> [384] "  vtc$state_norm,"                                                                  
#> [385] "  compare = valid_state"                                                            
#> [386] ")"                                                                                  
#> [387] "```"                                                                                
#> [388] ""                                                                                   
#> [389] "### City"                                                                           
#> [390] ""                                                                                   
#> [391] "Cities are the most difficult geographic variable to normalize, simply due to"      
#> [392] "the wide variety of valid cities and formats."                                      
#> [393] ""                                                                                   
#> [394] "#### Normal"                                                                        
#> [395] ""                                                                                   
#> [396] "The `campfin::normal_city()` function is a good start, again converting case,"      
#> [397] "removing punctuation, but _expanding_ USPS abbreviations. We can also remove"       
#> [398] "`invalid_city` values."                                                             
#> [399] ""                                                                                   
#> [400] "```{r city-norm}"                                                                   
#> [401] "norm_city <- vtc %>% "                                                              
#> [402] "  distinct(city, state_norm, zip_norm) %>% "                                        
#> [403] "  mutate("                                                                          
#> [404] "    city_norm = normal_city("                                                       
#> [405] "      city = city, "                                                                
#> [406] "      abbs = usps_city,"                                                            
#> [407] "      states = c(\"VT\", \"DC\", \"VERMONT\"),"                                     
#> [408] "      na = invalid_city,"                                                           
#> [409] "      na_rep = TRUE"                                                                
#> [410] "    )"                                                                              
#> [411] "  )"                                                                                
#> [412] "```"                                                                                
#> [413] ""                                                                                   
#> [414] "#### Swap"                                                                          
#> [415] ""                                                                                   
#> [416] "We can further improve normalization by comparing our normalized value"             
#> [417] "against the _expected_ value for that record's state abbreviation and ZIP code."    
#> [418] "If the normalized value is either an abbreviation for or very similar to the"       
#> [419] "expected value, we can confidently swap those two."                                 
#> [420] ""                                                                                   
#> [421] "```{r city-swap}"                                                                   
#> [422] "norm_city <- norm_city %>% "                                                        
#> [423] "  rename(city_raw = city) %>% "                                                     
#> [424] "  left_join("                                                                       
#> [425] "    y = zipcodes,"                                                                  
#> [426] "    by = c("                                                                        
#> [427] "      \"state_norm\" = \"state\","                                                  
#> [428] "      \"zip_norm\" = \"zip\""                                                       
#> [429] "    )"                                                                              
#> [430] "  ) %>% "                                                                           
#> [431] "  rename(city_match = city) %>% "                                                   
#> [432] "  mutate("                                                                          
#> [433] "    match_abb = is_abbrev(city_norm, city_match),"                                  
#> [434] "    match_dist = str_dist(city_norm, city_match),"                                  
#> [435] "    city_swap = if_else("                                                           
#> [436] "      condition = !is.na(match_dist) & (match_abb | match_dist == 1),"              
#> [437] "      true = city_match,"                                                           
#> [438] "      false = city_norm"                                                            
#> [439] "    )"                                                                              
#> [440] "  ) %>% "                                                                           
#> [441] "  select("                                                                          
#> [442] "    -city_match,"                                                                   
#> [443] "    -match_dist,"                                                                   
#> [444] "    -match_abb"                                                                     
#> [445] "  )"                                                                                
#> [446] "```"                                                                                
#> [447] ""                                                                                   
#> [448] "```{r city-rejoin}"                                                                 
#> [449] "vtc <- left_join("                                                                  
#> [450] "  x = vtc,"                                                                         
#> [451] "  y = norm_city,"                                                                   
#> [452] "  by = c("                                                                          
#> [453] "    \"city\" = \"city_raw\", "                                                      
#> [454] "    \"state_norm\", "                                                               
#> [455] "    \"zip_norm\""                                                                   
#> [456] "  )"                                                                                
#> [457] ")"                                                                                  
#> [458] "```"                                                                                
#> [459] ""                                                                                   
#> [460] "#### Refine"                                                                        
#> [461] ""                                                                                   
#> [462] "The [OpenRefine][or] algorithms can be used to group similar strings and replace"   
#> [463] "the less common versions with their most common counterpart. This can greatly"      
#> [464] "reduce inconsistency, but with low confidence; we will only keep any refined"       
#> [465] "strings that have a valid city/state/zip combination."                              
#> [466] ""                                                                                   
#> [467] "[or]: https://openrefine.org/"                                                      
#> [468] ""                                                                                   
#> [469] "```{r city-refine}"                                                                 
#> [470] "good_refine <- vtc %>% "                                                            
#> [471] "  mutate("                                                                          
#> [472] "    city_refine = city_swap %>% "                                                   
#> [473] "      key_collision_merge() %>% "                                                   
#> [474] "      n_gram_merge(numgram = 1)"                                                    
#> [475] "  ) %>% "                                                                           
#> [476] "  filter(city_refine != city_swap) %>% "                                            
#> [477] "  inner_join("                                                                      
#> [478] "    y = zipcodes,"                                                                  
#> [479] "    by = c("                                                                        
#> [480] "      \"city_refine\" = \"city\","                                                  
#> [481] "      \"state_norm\" = \"state\","                                                  
#> [482] "      \"zip_norm\" = \"zip\""                                                       
#> [483] "    )"                                                                              
#> [484] "  )"                                                                                
#> [485] "```"                                                                                
#> [486] ""                                                                                   
#> [487] "```{r city-count, echo=FALSE}"                                                      
#> [488] "good_refine %>%"                                                                    
#> [489] "  count("                                                                           
#> [490] "    state_norm, "                                                                   
#> [491] "    zip_norm, "                                                                     
#> [492] "    city_swap, "                                                                    
#> [493] "    city_refine,"                                                                   
#> [494] "    sort = TRUE"                                                                    
#> [495] "  )"                                                                                
#> [496] "```"                                                                                
#> [497] ""                                                                                   
#> [498] "Then we can join the refined values back to the database."                          
#> [499] ""                                                                                   
#> [500] "```{r city-join}"                                                                   
#> [501] "vtc <- vtc %>% "                                                                    
#> [502] "  left_join(good_refine, by = names(.)) %>% "                                       
#> [503] "  mutate(city_refine = coalesce(city_refine, city_swap))"                           
#> [504] "```"                                                                                
#> [505] ""                                                                                   
#> [506] "#### Progress"                                                                      
#> [507] ""                                                                                   
#> [508] "Our goal for normalization was to increase the proportion of city values known"     
#> [509] "to be valid and reduce the total distinct values by correcting misspellings."       
#> [510] ""                                                                                   
#> [511] "```{r city-progress, echo=FALSE}"                                                   
#> [512] "many_city <- c(valid_city, extra_city)"                                             
#> [513] "progress <- progress_table("                                                        
#> [514] "  str_to_upper(vtc$city),"                                                          
#> [515] "  vtc$city_norm,"                                                                   
#> [516] "  vtc$city_swap,"                                                                   
#> [517] "  vtc$city_refine,"                                                                 
#> [518] "  compare = many_city"                                                              
#> [519] ") %>% mutate(stage = as_factor(stage))"                                             
#> [520] "progress %>% "                                                                      
#> [521] "    mutate(across(stage, md_code)) %>% "                                            
#> [522] "    kable(digits = 3)"                                                              
#> [523] "```"                                                                                
#> [524] ""                                                                                   
#> [525] "You can see how the percentage of valid values increased with each stage."          
#> [526] ""                                                                                   
#> [527] "```{r bar-progress, echo=FALSE}"                                                    
#> [528] "raw_in <- percent(prop_in(vtc$city, valid_city))"                                   
#> [529] "progress %>% "                                                                      
#> [530] "  ggplot(aes(x = stage, y = prop_in)) +"                                            
#> [531] "  geom_hline(yintercept = 0.99) +"                                                  
#> [532] "  geom_col(fill = dark2[\"purple\"]) +"                                             
#> [533] "  coord_cartesian(ylim = c(0.75, 1)) +"                                             
#> [534] "  scale_y_continuous(labels = percent) +"                                           
#> [535] "  labs("                                                                            
#> [536] "    title = \"Vermont City Normalization Progress\","                               
#> [537] "    subtitle = glue(\"Raw at {raw_in} before conversion to uppercase\"),"           
#> [538] "    x = \"Stage\","                                                                 
#> [539] "    y = \"Percent Valid\""                                                          
#> [540] "  )"                                                                                
#> [541] "```"                                                                                
#> [542] ""                                                                                   
#> [543] "More importantly, the number of distinct values decreased each stage. We were"      
#> [544] "able to confidently change many distinct invalid values to their valid"             
#> [545] "equivalent."                                                                        
#> [546] ""                                                                                   
#> [547] "```{r bar-distinct, echo=FALSE}"                                                    
#> [548] "progress %>% "                                                                      
#> [549] "  select("                                                                          
#> [550] "    stage, "                                                                        
#> [551] "    all = n_distinct,"                                                              
#> [552] "    bad = n_diff"                                                                   
#> [553] "  ) %>% "                                                                           
#> [554] "  mutate(good = all - bad) %>% "                                                    
#> [555] "  pivot_longer(c(\"good\", \"bad\")) %>% "                                          
#> [556] "  mutate(name = name == \"good\") %>% "                                             
#> [557] "  ggplot(aes(x = stage, y = value)) +"                                              
#> [558] "  geom_col(aes(fill = name)) +"                                                     
#> [559] "  scale_fill_brewer(palette = \"Dark2\", direction = -1) +"                         
#> [560] "  scale_y_continuous(labels = comma) +"                                             
#> [561] "  theme(legend.position = \"bottom\") +"                                            
#> [562] "  labs("                                                                            
#> [563] "    title = \"Vermont City Normalization Progress\","                               
#> [564] "    subtitle = \"Distinct values, valid and invalid\","                             
#> [565] "    x = \"Stage\","                                                                 
#> [566] "    y = \"Distinct Values\","                                                       
#> [567] "    fill = \"Valid\""                                                               
#> [568] "  )"                                                                                
#> [569] "```"                                                                                
#> [570] ""                                                                                   
#> [571] "Before exporting, we can remove the intermediary normalization columns and"         
#> [572] "rename all added variables with the `_clean` suffix."                               
#> [573] ""                                                                                   
#> [574] "```{r clean-select}"                                                                
#> [575] "vtc <- vtc %>% "                                                                    
#> [576] "  select("                                                                          
#> [577] "    -city_norm,"                                                                    
#> [578] "    -city_swap,"                                                                    
#> [579] "    city_clean = city_refine"                                                       
#> [580] "  ) %>% "                                                                           
#> [581] "  rename_all(~str_replace(., \"_norm\", \"_clean\")) %>% "                          
#> [582] "  rename_all(~str_remove(., \"_raw\")) %>% "                                        
#> [583] "  relocate(address_clean, city_clean, state_clean, .before = zip_clean)"            
#> [584] "```"                                                                                
#> [585] ""                                                                                   
#> [586] "## Conclude"                                                                        
#> [587] ""                                                                                   
#> [588] "```{r clean-glimpse}"                                                               
#> [589] "glimpse(sample_n(vtc, 1000))"                                                       
#> [590] "```"                                                                                
#> [591] ""                                                                                   
#> [592] "1. There are `r comma(nrow(vtc))` records in the database."                         
#> [593] "1. There are `r comma(sum(vtc$dupe_flag))` duplicate records in the database."      
#> [594] "1. The range and distribution of `amount` and `date` seem reasonable."              
#> [595] "1. There are `r comma(sum(vtc$na_flag))` records missing key variables."            
#> [596] "1. Consistency in geographic data has been improved with `campfin::normal_*()`."    
#> [597] "1. The 4-digit `year` variable has been created with `lubridate::year()`."          
#> [598] ""                                                                                   
#> [599] "## Export"                                                                          
#> [600] ""                                                                                   
#> [601] "Now the file can be saved on disk for upload to the Accountability server. We"      
#> [602] "will name the object using a date range of the records included."                   
#> [603] ""                                                                                   
#> [604] "```{r clean-timestamp}"                                                             
#> [605] "min_dt <- str_remove_all(min(vtc$date), \"-\")"                                     
#> [606] "max_dt <- str_remove_all(max(vtc$date), \"-\")"                                     
#> [607] "csv_ts <- paste(min_dt, max_dt, sep = \"-\")"                                       
#> [608] "```"                                                                                
#> [609] ""                                                                                   
#> [610] "```{r clean-dir}"                                                                   
#> [611] "clean_dir <- dir_create(here(\"state\", \"vt\", \"contribs\", \"data\", \"clean\"))"
#> [612] "clean_csv <- path(clean_dir, glue(\"vt_contribs_{csv_ts}.csv\"))"                   
#> [613] "clean_rds <- path_ext_set(clean_csv, \"rds\")"                                      
#> [614] "basename(clean_csv)"                                                                
#> [615] "```"                                                                                
#> [616] ""                                                                                   
#> [617] "```{r clean-write}"                                                                 
#> [618] "write_csv(vtc, clean_csv, na = \"\")"                                               
#> [619] "write_rds(vtc, clean_rds, compress = \"xz\")"                                       
#> [620] "(clean_size <- file_size(clean_csv))"                                               
#> [621] "```"                                                                                
#> [622] ""                                                                                   
#> [623] "## Upload"                                                                          
#> [624] ""                                                                                   
#> [625] "We can use the `aws.s3::put_object()` to upload the text file to the IRW server."   
#> [626] ""                                                                                   
#> [627] "```{r aws-upload, eval=FALSE}"                                                      
#> [628] "aws_key <- path(\"csv\", basename(clean_csv))"                                      
#> [629] "if (!object_exists(aws_key, \"publicaccountability\")) {"                           
#> [630] "  put_object("                                                                      
#> [631] "    file = clean_csv,"                                                              
#> [632] "    object = aws_key, "                                                             
#> [633] "    bucket = \"publicaccountability\","                                             
#> [634] "    acl = \"public-read\","                                                         
#> [635] "    show_progress = TRUE,"                                                          
#> [636] "    multipart = TRUE"                                                               
#> [637] "  )"                                                                                
#> [638] "}"                                                                                  
#> [639] "aws_head <- head_object(aws_key, \"publicaccountability\")"                         
#> [640] "(aws_size <- as_fs_bytes(attr(aws_head, \"content-length\")))"                      
#> [641] "unname(aws_size == clean_size)"                                                     
#> [642] "```"                                                                                
use_diary("DC", "expends", "Kiernan Nicholls", tempfile(), auto = FALSE)
#>   [1] "---"                                                                               
#>   [2] "title: \"District Of Columbia Expenditures\""                                      
#>   [3] "author: \"Kiernan Nicholls\""                                                      
#>   [4] "date: \"`r date()`\""                                                              
#>   [5] "output:"                                                                           
#>   [6] "  github_document: "                                                               
#>   [7] "    df_print: tibble"                                                              
#>   [8] "    toc: true"                                                                     
#>   [9] "    toc_dept: 3"                                                                   
#>  [10] "editor_options: "                                                                  
#>  [11] "  chunk_output_type: console"                                                      
#>  [12] "---"                                                                               
#>  [13] ""                                                                                  
#>  [14] "<!-- Place comments regarding knitting here -->"                                   
#>  [15] ""                                                                                  
#>  [16] "```{r setup, include=FALSE, purl=FALSE}"                                           
#>  [17] "library(knitr)"                                                                    
#>  [18] "opts_chunk$set("                                                                   
#>  [19] "  eval = TRUE,"                                                                    
#>  [20] "  echo = TRUE,"                                                                    
#>  [21] "  warning = FALSE,"                                                                
#>  [22] "  message = FALSE,"                                                                
#>  [23] "  error = FALSE,"                                                                  
#>  [24] "  collapse = TRUE,"                                                                
#>  [25] "  comment = \"#>\","                                                               
#>  [26] "  fig.path = \"../plots/\","                                                       
#>  [27] "  fig.width = 10,"                                                                 
#>  [28] "  dpi = 300"                                                                       
#>  [29] ")"                                                                                 
#>  [30] "if (!interactive()) {"                                                             
#>  [31] "  options(width = 120)"                                                            
#>  [32] "  set.seed(5)"                                                                     
#>  [33] "}"                                                                                 
#>  [34] "```"                                                                               
#>  [35] ""                                                                                  
#>  [36] "```{r create-docs-dir, eval=FALSE, echo=FALSE, include=FALSE}"                     
#>  [37] "doc_dir <- fs::dir_create(here::here(\"state\", \"dc\", \"expends\", \"docs\"))"   
#>  [38] "```"                                                                               
#>  [39] ""                                                                                  
#>  [40] "## Project"                                                                        
#>  [41] ""                                                                                  
#>  [42] "The Accountability Project is an effort to cut across data silos and give"         
#>  [43] "journalists, policy professionals, activists, and the public at large a simple"    
#>  [44] "way to search across huge volumes of public data about people and organizations."  
#>  [45] ""                                                                                  
#>  [46] "Our goal is to standardize public data on a few key fields by thinking of each"    
#>  [47] "dataset row as a transaction. For each transaction there should be (at least) 3"   
#>  [48] "variables:"                                                                        
#>  [49] ""                                                                                  
#>  [50] "1. All **parties** to a transaction."                                              
#>  [51] "2. The **date** of the transaction."                                               
#>  [52] "3. The **amount** of money involved."                                              
#>  [53] ""                                                                                  
#>  [54] "## Objectives"                                                                     
#>  [55] ""                                                                                  
#>  [56] "This document describes the process used to complete the following objectives:"    
#>  [57] ""                                                                                  
#>  [58] "1. How many records are in the database?"                                          
#>  [59] "1. Check for entirely duplicated records."                                         
#>  [60] "1. Check ranges of continuous variables."                                          
#>  [61] "1. Is there anything blank or missing?"                                            
#>  [62] "1. Check for consistency issues."                                                  
#>  [63] "1. Create a five-digit ZIP Code called `zip`."                                     
#>  [64] "1. Create a `year` field from the transaction date."                               
#>  [65] "1. Make sure there is data on both parties to a transaction."                      
#>  [66] ""                                                                                  
#>  [67] "## Packages"                                                                       
#>  [68] ""                                                                                  
#>  [69] "The following packages are needed to collect, manipulate, visualize, analyze,"     
#>  [70] "and communicate these results. The `pacman` package will facilitate their"         
#>  [71] "installation and attachment."                                                      
#>  [72] ""                                                                                  
#>  [73] "```{r load-packages, message=FALSE, warning=FALSE, error=FALSE}"                   
#>  [74] "if (!require(\"pacman\")) {"                                                       
#>  [75] "  install.packages(\"pacman\")"                                                    
#>  [76] "}"                                                                                 
#>  [77] "pacman::p_load("                                                                   
#>  [78] "  tidyverse, # data manipulation"                                                  
#>  [79] "  lubridate, # datetime strings"                                                   
#>  [80] "  gluedown, # printing markdown"                                                   
#>  [81] "  jsonlite, # read json files"                                                     
#>  [82] "  janitor, # clean data frames"                                                    
#>  [83] "  campfin, # custom irw tools"                                                     
#>  [84] "  aws.s3, # aws cloud storage"                                                     
#>  [85] "  readxl, # read excel files"                                                      
#>  [86] "  refinr, # cluster & merge"                                                       
#>  [87] "  scales, # format strings"                                                        
#>  [88] "  knitr, # knit documents"                                                         
#>  [89] "  rvest, # scrape html"                                                            
#>  [90] "  glue, # code strings"                                                            
#>  [91] "  here, # project paths"                                                           
#>  [92] "  httr, # http requests"                                                           
#>  [93] "  fs # local storage "                                                             
#>  [94] ")"                                                                                 
#>  [95] "```"                                                                               
#>  [96] ""                                                                                  
#>  [97] "This diary was run using `campfin` version `r packageVersion(\"campfin\")`."       
#>  [98] ""                                                                                  
#>  [99] "```{r campfin-version}"                                                            
#> [100] "packageVersion(\"campfin\")"                                                       
#> [101] "```"                                                                               
#> [102] ""                                                                                  
#> [103] "```{r package-options, echo=FALSE}"                                                
#> [104] "options(options(knitr.kable.NA = \"\"))"                                           
#> [105] "```"                                                                               
#> [106] ""                                                                                  
#> [107] "This document should be run as part of the `R_tap` project, which lives as a"      
#> [108] "sub-directory of the more general, language-agnostic"                              
#> [109] "[`irworkshop/accountability_datacleaning`][tap] GitHub repository."                
#> [110] ""                                                                                  
#> [111] "The `R_tap` project uses the [RStudio projects][rproj] feature and should be"      
#> [112] "run as such. The project also uses the dynamic `here::here()` tool for file"       
#> [113] "paths relative to _your_ machine."                                                 
#> [114] ""                                                                                  
#> [115] "```{r where-here}"                                                                 
#> [116] "# where does this document knit?"                                                  
#> [117] "here::i_am(\"state/dc/expends/docs/dc_expends_diary.Rmd\")"                        
#> [118] "```"                                                                               
#> [119] ""                                                                                  
#> [120] "[tap]: https://github.com/irworkshop/accountability_datacleaning"                  
#> [121] "[rproj]: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects"   
#> [122] ""                                                                                  
#> [123] "## Source"                                                                         
#> [124] ""                                                                                  
#> [125] ""                                                                                  
#> [126] ""                                                                                  
#> [127] "## Download"                                                                       
#> [128] ""                                                                                  
#> [129] "```{r raw-dir}"                                                                    
#> [130] "raw_url <- \"https://example.com/source_file.csv\""                                
#> [131] "raw_dir <- dir_create(here(\"state\", \"dc\", \"expends\", \"data\", \"raw\"))"    
#> [132] "raw_csv <- path(raw_dir, basename(raw_url))"                                       
#> [133] "```"                                                                               
#> [134] ""                                                                                  
#> [135] "```{r raw-download}"                                                               
#> [136] "if (!file_exists(raw_csv)) {"                                                      
#> [137] "  download.file(raw_url, raw_csv)"                                                 
#> [138] "}"                                                                                 
#> [139] "```"                                                                               
#> [140] ""                                                                                  
#> [141] "## Read"                                                                           
#> [142] ""                                                                                  
#> [143] "```{r raw-read}"                                                                   
#> [144] "dce <- read_delim("                                                                
#> [145] "  file = raw_csv,"                                                                 
#> [146] "  delim = \",\","                                                                  
#> [147] "  escape_backslash = FALSE,"                                                       
#> [148] "  escape_double = FALSE,"                                                          
#> [149] "  col_types = cols("                                                               
#> [150] "    .default = col_character(),"                                                   
#> [151] "    date = col_date_mdy(),"                                                        
#> [152] "    amount = col_double()"                                                         
#> [153] "  )"                                                                               
#> [154] ")"                                                                                 
#> [155] "```"                                                                               
#> [156] ""                                                                                  
#> [157] "```{r clean-names}"                                                                
#> [158] "dce <- clean_names(dce, case = \"snake\")"                                         
#> [159] "```"                                                                               
#> [160] ""                                                                                  
#> [161] "## Explore"                                                                        
#> [162] ""                                                                                  
#> [163] "There are `r comma(nrow(dce))` rows of `r ncol(dce)` columns. Each record"         
#> [164] "represents a single Expenditures..."                                               
#> [165] ""                                                                                  
#> [166] "```{r glimpse}"                                                                    
#> [167] "glimpse(dce)"                                                                      
#> [168] "tail(dce)"                                                                         
#> [169] "```"                                                                               
#> [170] ""                                                                                  
#> [171] "### Missing"                                                                       
#> [172] ""                                                                                  
#> [173] "Columns vary in their degree of missing values."                                   
#> [174] ""                                                                                  
#> [175] "```{r na-count}"                                                                   
#> [176] "col_stats(dce, count_na)"                                                          
#> [177] "```"                                                                               
#> [178] ""                                                                                  
#> [179] "We can flag any record missing a key variable needed to identify a transaction."   
#> [180] ""                                                                                  
#> [181] "```{r na-flag}"                                                                    
#> [182] "key_vars <- c(\"date\", \"last_name\", \"amount\", \"committee_name\")"            
#> [183] "dce <- flag_na(dce, all_of(key_vars))"                                             
#> [184] "sum(dce$na_flag)"                                                                  
#> [185] "```"                                                                               
#> [186] ""                                                                                  
#> [187] "```{r na-view}"                                                                    
#> [188] "dce %>% "                                                                          
#> [189] "  filter(na_flag) %>% "                                                            
#> [190] "  select(all_of(key_vars))"                                                        
#> [191] "```"                                                                               
#> [192] ""                                                                                  
#> [193] "### Duplicates"                                                                    
#> [194] ""                                                                                  
#> [195] "We can also flag any record completely duplicated across every column."            
#> [196] ""                                                                                  
#> [197] "```{r dupe-flag}"                                                                  
#> [198] "dce <- flag_dupes(dce, -id)"                                                       
#> [199] "sum(dce$dupe_flag)"                                                                
#> [200] "```"                                                                               
#> [201] ""                                                                                  
#> [202] "```{r dupe-view}"                                                                  
#> [203] "dce %>% "                                                                          
#> [204] "  filter(dupe_flag) %>% "                                                          
#> [205] "  select(all_of(key_vars)) %>% "                                                   
#> [206] "  arrange(date)"                                                                   
#> [207] "```"                                                                               
#> [208] ""                                                                                  
#> [209] "### Categorical"                                                                   
#> [210] ""                                                                                  
#> [211] "```{r distinct-count}"                                                             
#> [212] "col_stats(dce, n_distinct)"                                                        
#> [213] "```"                                                                               
#> [214] ""                                                                                  
#> [215] "```{r distinct-plots, echo=FALSE, fig.height=3}"                                   
#> [216] "explore_plot(dce, type)"                                                           
#> [217] "```"                                                                               
#> [218] ""                                                                                  
#> [219] "### Amounts"                                                                       
#> [220] ""                                                                                  
#> [221] "```{r amount-round}"                                                               
#> [222] "# fix floating point precision"                                                    
#> [223] "dce$amount <- round(dce$amount, digits = 2)"                                       
#> [224] "```"                                                                               
#> [225] ""                                                                                  
#> [226] "```{r amount-summary}"                                                             
#> [227] "summary(dce$amount)"                                                               
#> [228] "mean(dce$amount <= 0)"                                                             
#> [229] "```"                                                                               
#> [230] ""                                                                                  
#> [231] "These are the records with the minimum and maximum amounts."                       
#> [232] ""                                                                                  
#> [233] "```{r amount-minmax}"                                                              
#> [234] "glimpse(dce[c(which.max(dce$amount), which.min(dce$amount)), ])"                   
#> [235] "```"                                                                               
#> [236] ""                                                                                  
#> [237] "The distribution of amount values are typically log-normal."                       
#> [238] ""                                                                                  
#> [239] "```{r hist-amount, echo=FALSE}"                                                    
#> [240] "dce %>%"                                                                           
#> [241] "  ggplot(aes(amount)) +"                                                           
#> [242] "  geom_histogram(fill = dark2[\"purple\"]) +"                                      
#> [243] "  scale_y_continuous(labels = comma) +"                                            
#> [244] "  scale_x_continuous("                                                             
#> [245] "    breaks = c(1 %o% 10^(0:6)),"                                                   
#> [246] "    labels = dollar,"                                                              
#> [247] "    trans = \"log10\""                                                             
#> [248] "  ) +"                                                                             
#> [249] "  labs("                                                                           
#> [250] "    title = \"District Of Columbia Expenditures Amount Distribution\","            
#> [251] "    caption = \"Source: {source}\","                                               
#> [252] "    x = \"Amount\","                                                               
#> [253] "    y = \"Count\""                                                                 
#> [254] "  )"                                                                               
#> [255] "```"                                                                               
#> [256] ""                                                                                  
#> [257] "### Dates"                                                                         
#> [258] ""                                                                                  
#> [259] "We can add the calendar year from `date` with `lubridate::year()`"                 
#> [260] ""                                                                                  
#> [261] "```{r date-year}"                                                                  
#> [262] "dce <- mutate(dce, year = year(date))"                                             
#> [263] "```"                                                                               
#> [264] ""                                                                                  
#> [265] "```{r date-range}"                                                                 
#> [266] "min(dce$date)"                                                                     
#> [267] "sum(dce$year < 2000)"                                                              
#> [268] "max(dce$date)"                                                                     
#> [269] "sum(dce$date > today())"                                                           
#> [270] "```"                                                                               
#> [271] ""                                                                                  
#> [272] "It's common to see an increase in the number of contributins in elections years."  
#> [273] ""                                                                                  
#> [274] "```{r bar-year, echo=FALSE}"                                                       
#> [275] "dce %>% "                                                                          
#> [276] "  count(year) %>% "                                                                
#> [277] "  mutate(even = is_even(year)) %>% "                                               
#> [278] "  ggplot(aes(x = year, y = n)) +"                                                  
#> [279] "  geom_col(aes(fill = even)) + "                                                   
#> [280] "  scale_fill_brewer(palette = \"Dark2\") +"                                        
#> [281] "  scale_y_continuous(labels = comma) +"                                            
#> [282] "  scale_x_continuous(breaks = seq(2000, 2020, by = 2)) +"                          
#> [283] "  theme(legend.position = \"bottom\") +"                                           
#> [284] "  labs("                                                                           
#> [285] "    title = \"District Of Columbia Expenditures by Year\","                        
#> [286] "    caption = \"Source: {source}\","                                               
#> [287] "    fill = \"Election Year\","                                                     
#> [288] "    x = \"Year Made\","                                                            
#> [289] "    y = \"Count\""                                                                 
#> [290] "  )"                                                                               
#> [291] "```"                                                                               
#> [292] ""                                                                                  
#> [293] "## Wrangle"                                                                        
#> [294] ""                                                                                  
#> [295] "To improve the searchability of the database, we will perform some consistent,"    
#> [296] "confident string normalization. For geographic variables like city names and"      
#> [297] "ZIP codes, the corresponding `campfin::normal_*()` functions are tailor made to "  
#> [298] "facilitate this process."                                                          
#> [299] ""                                                                                  
#> [300] "### Address"                                                                       
#> [301] ""                                                                                  
#> [302] "For the street `addresss` variable, the `campfin::normal_address()` function"      
#> [303] "will force consistence case, remove punctuation, and abbreviate official "         
#> [304] "USPS suffixes."                                                                    
#> [305] ""                                                                                  
#> [306] "```{r address-norm}"                                                               
#> [307] "addr_norm <- dce %>% "                                                             
#> [308] "  distinct(address1, address2) %>% "                                               
#> [309] "  unite("                                                                          
#> [310] "    col = address_full,"                                                           
#> [311] "    starts_with(\"address\"),"                                                     
#> [312] "    sep = \" \","                                                                  
#> [313] "    remove = FALSE,"                                                               
#> [314] "    na.rm = TRUE"                                                                  
#> [315] "  ) %>% "                                                                          
#> [316] "  mutate("                                                                         
#> [317] "    address_norm = normal_address("                                                
#> [318] "      address = address_full,"                                                     
#> [319] "      abbs = usps_street,"                                                         
#> [320] "      na_rep = TRUE"                                                               
#> [321] "    )"                                                                             
#> [322] "  ) %>% "                                                                          
#> [323] "  select(-address_full)"                                                           
#> [324] "```"                                                                               
#> [325] ""                                                                                  
#> [326] "```{r address-view}"                                                               
#> [327] "addr_norm"                                                                         
#> [328] "```"                                                                               
#> [329] ""                                                                                  
#> [330] "```{r address-join}"                                                               
#> [331] "dce <- left_join(dce, addr_norm, by = c(\"address1\", \"address2\"))"              
#> [332] "```"                                                                               
#> [333] ""                                                                                  
#> [334] "### ZIP"                                                                           
#> [335] ""                                                                                  
#> [336] "For ZIP codes, the `campfin::normal_zip()` function will attempt to create"        
#> [337] "valid _five_ digit codes by removing the ZIP+4 suffix and returning leading"       
#> [338] "zeroes dropped by other programs like Microsoft Excel."                            
#> [339] ""                                                                                  
#> [340] "```{r zip-norm}"                                                                   
#> [341] "dce <- dce %>% "                                                                   
#> [342] "  mutate("                                                                         
#> [343] "    zip_norm = normal_zip("                                                        
#> [344] "      zip = zip,"                                                                  
#> [345] "      na_rep = TRUE"                                                               
#> [346] "    )"                                                                             
#> [347] "  )"                                                                               
#> [348] "```"                                                                               
#> [349] ""                                                                                  
#> [350] "```{r zip-progress}"                                                               
#> [351] "progress_table("                                                                   
#> [352] "  dce$zip,"                                                                        
#> [353] "  dce$zip_norm,"                                                                   
#> [354] "  compare = valid_zip"                                                             
#> [355] ")"                                                                                 
#> [356] "```"                                                                               
#> [357] ""                                                                                  
#> [358] "### State"                                                                         
#> [359] ""                                                                                  
#> [360] "Valid two digit state abbreviations can be made using the "                        
#> [361] "`campfin::normal_state()` function."                                               
#> [362] ""                                                                                  
#> [363] "```{r state-norm}"                                                                 
#> [364] "dce <- dce %>% "                                                                   
#> [365] "  mutate("                                                                         
#> [366] "    state_norm = normal_state("                                                    
#> [367] "      state = state,"                                                              
#> [368] "      abbreviate = TRUE,"                                                          
#> [369] "      na_rep = TRUE,"                                                              
#> [370] "      valid = valid_state"                                                         
#> [371] "    )"                                                                             
#> [372] "  )"                                                                               
#> [373] "```"                                                                               
#> [374] ""                                                                                  
#> [375] "```{r state-view}"                                                                 
#> [376] "dce %>% "                                                                          
#> [377] "  filter(state != state_norm) %>% "                                                
#> [378] "  count(state, state_norm, sort = TRUE)"                                           
#> [379] "```"                                                                               
#> [380] ""                                                                                  
#> [381] "```{r state-progress}"                                                             
#> [382] "progress_table("                                                                   
#> [383] "  dce$state,"                                                                      
#> [384] "  dce$state_norm,"                                                                 
#> [385] "  compare = valid_state"                                                           
#> [386] ")"                                                                                 
#> [387] "```"                                                                               
#> [388] ""                                                                                  
#> [389] "### City"                                                                          
#> [390] ""                                                                                  
#> [391] "Cities are the most difficult geographic variable to normalize, simply due to"     
#> [392] "the wide variety of valid cities and formats."                                     
#> [393] ""                                                                                  
#> [394] "#### Normal"                                                                       
#> [395] ""                                                                                  
#> [396] "The `campfin::normal_city()` function is a good start, again converting case,"     
#> [397] "removing punctuation, but _expanding_ USPS abbreviations. We can also remove"      
#> [398] "`invalid_city` values."                                                            
#> [399] ""                                                                                  
#> [400] "```{r city-norm}"                                                                  
#> [401] "norm_city <- dce %>% "                                                             
#> [402] "  distinct(city, state_norm, zip_norm) %>% "                                       
#> [403] "  mutate("                                                                         
#> [404] "    city_norm = normal_city("                                                      
#> [405] "      city = city, "                                                               
#> [406] "      abbs = usps_city,"                                                           
#> [407] "      states = c(\"DC\", \"DC\", \"DISTRICT OF COLUMBIA\"),"                       
#> [408] "      na = invalid_city,"                                                          
#> [409] "      na_rep = TRUE"                                                               
#> [410] "    )"                                                                             
#> [411] "  )"                                                                               
#> [412] "```"                                                                               
#> [413] ""                                                                                  
#> [414] "#### Swap"                                                                         
#> [415] ""                                                                                  
#> [416] "We can further improve normalization by comparing our normalized value"            
#> [417] "against the _expected_ value for that record's state abbreviation and ZIP code."   
#> [418] "If the normalized value is either an abbreviation for or very similar to the"      
#> [419] "expected value, we can confidently swap those two."                                
#> [420] ""                                                                                  
#> [421] "```{r city-swap}"                                                                  
#> [422] "norm_city <- norm_city %>% "                                                       
#> [423] "  rename(city_raw = city) %>% "                                                    
#> [424] "  left_join("                                                                      
#> [425] "    y = zipcodes,"                                                                 
#> [426] "    by = c("                                                                       
#> [427] "      \"state_norm\" = \"state\","                                                 
#> [428] "      \"zip_norm\" = \"zip\""                                                      
#> [429] "    )"                                                                             
#> [430] "  ) %>% "                                                                          
#> [431] "  rename(city_match = city) %>% "                                                  
#> [432] "  mutate("                                                                         
#> [433] "    match_abb = is_abbrev(city_norm, city_match),"                                 
#> [434] "    match_dist = str_dist(city_norm, city_match),"                                 
#> [435] "    city_swap = if_else("                                                          
#> [436] "      condition = !is.na(match_dist) & (match_abb | match_dist == 1),"             
#> [437] "      true = city_match,"                                                          
#> [438] "      false = city_norm"                                                           
#> [439] "    )"                                                                             
#> [440] "  ) %>% "                                                                          
#> [441] "  select("                                                                         
#> [442] "    -city_match,"                                                                  
#> [443] "    -match_dist,"                                                                  
#> [444] "    -match_abb"                                                                    
#> [445] "  )"                                                                               
#> [446] "```"                                                                               
#> [447] ""                                                                                  
#> [448] "```{r city-rejoin}"                                                                
#> [449] "dce <- left_join("                                                                 
#> [450] "  x = dce,"                                                                        
#> [451] "  y = norm_city,"                                                                  
#> [452] "  by = c("                                                                         
#> [453] "    \"city\" = \"city_raw\", "                                                     
#> [454] "    \"state_norm\", "                                                              
#> [455] "    \"zip_norm\""                                                                  
#> [456] "  )"                                                                               
#> [457] ")"                                                                                 
#> [458] "```"                                                                               
#> [459] ""                                                                                  
#> [460] "#### Refine"                                                                       
#> [461] ""                                                                                  
#> [462] "The [OpenRefine][or] algorithms can be used to group similar strings and replace"  
#> [463] "the less common versions with their most common counterpart. This can greatly"     
#> [464] "reduce inconsistency, but with low confidence; we will only keep any refined"      
#> [465] "strings that have a valid city/state/zip combination."                             
#> [466] ""                                                                                  
#> [467] "[or]: https://openrefine.org/"                                                     
#> [468] ""                                                                                  
#> [469] "```{r city-refine}"                                                                
#> [470] "good_refine <- dce %>% "                                                           
#> [471] "  mutate("                                                                         
#> [472] "    city_refine = city_swap %>% "                                                  
#> [473] "      key_collision_merge() %>% "                                                  
#> [474] "      n_gram_merge(numgram = 1)"                                                   
#> [475] "  ) %>% "                                                                          
#> [476] "  filter(city_refine != city_swap) %>% "                                           
#> [477] "  inner_join("                                                                     
#> [478] "    y = zipcodes,"                                                                 
#> [479] "    by = c("                                                                       
#> [480] "      \"city_refine\" = \"city\","                                                 
#> [481] "      \"state_norm\" = \"state\","                                                 
#> [482] "      \"zip_norm\" = \"zip\""                                                      
#> [483] "    )"                                                                             
#> [484] "  )"                                                                               
#> [485] "```"                                                                               
#> [486] ""                                                                                  
#> [487] "```{r city-count, echo=FALSE}"                                                     
#> [488] "good_refine %>%"                                                                   
#> [489] "  count("                                                                          
#> [490] "    state_norm, "                                                                  
#> [491] "    zip_norm, "                                                                    
#> [492] "    city_swap, "                                                                   
#> [493] "    city_refine,"                                                                  
#> [494] "    sort = TRUE"                                                                   
#> [495] "  )"                                                                               
#> [496] "```"                                                                               
#> [497] ""                                                                                  
#> [498] "Then we can join the refined values back to the database."                         
#> [499] ""                                                                                  
#> [500] "```{r city-join}"                                                                  
#> [501] "dce <- dce %>% "                                                                   
#> [502] "  left_join(good_refine, by = names(.)) %>% "                                      
#> [503] "  mutate(city_refine = coalesce(city_refine, city_swap))"                          
#> [504] "```"                                                                               
#> [505] ""                                                                                  
#> [506] "#### Progress"                                                                     
#> [507] ""                                                                                  
#> [508] "Our goal for normalization was to increase the proportion of city values known"    
#> [509] "to be valid and reduce the total distinct values by correcting misspellings."      
#> [510] ""                                                                                  
#> [511] "```{r city-progress, echo=FALSE}"                                                  
#> [512] "many_city <- c(valid_city, extra_city)"                                            
#> [513] "progress <- progress_table("                                                       
#> [514] "  str_to_upper(dce$city),"                                                         
#> [515] "  dce$city_norm,"                                                                  
#> [516] "  dce$city_swap,"                                                                  
#> [517] "  dce$city_refine,"                                                                
#> [518] "  compare = many_city"                                                             
#> [519] ") %>% mutate(stage = as_factor(stage))"                                            
#> [520] "progress %>% "                                                                     
#> [521] "    mutate(across(stage, md_code)) %>% "                                           
#> [522] "    kable(digits = 3)"                                                             
#> [523] "```"                                                                               
#> [524] ""                                                                                  
#> [525] "You can see how the percentage of valid values increased with each stage."         
#> [526] ""                                                                                  
#> [527] "```{r bar-progress, echo=FALSE}"                                                   
#> [528] "raw_in <- percent(prop_in(dce$city, valid_city))"                                  
#> [529] "progress %>% "                                                                     
#> [530] "  ggplot(aes(x = stage, y = prop_in)) +"                                           
#> [531] "  geom_hline(yintercept = 0.99) +"                                                 
#> [532] "  geom_col(fill = dark2[\"purple\"]) +"                                            
#> [533] "  coord_cartesian(ylim = c(0.75, 1)) +"                                            
#> [534] "  scale_y_continuous(labels = percent) +"                                          
#> [535] "  labs("                                                                           
#> [536] "    title = \"District Of Columbia City Normalization Progress\","                 
#> [537] "    subtitle = glue(\"Raw at {raw_in} before conversion to uppercase\"),"          
#> [538] "    x = \"Stage\","                                                                
#> [539] "    y = \"Percent Valid\""                                                         
#> [540] "  )"                                                                               
#> [541] "```"                                                                               
#> [542] ""                                                                                  
#> [543] "More importantly, the number of distinct values decreased each stage. We were"     
#> [544] "able to confidently change many distinct invalid values to their valid"            
#> [545] "equivalent."                                                                       
#> [546] ""                                                                                  
#> [547] "```{r bar-distinct, echo=FALSE}"                                                   
#> [548] "progress %>% "                                                                     
#> [549] "  select("                                                                         
#> [550] "    stage, "                                                                       
#> [551] "    all = n_distinct,"                                                             
#> [552] "    bad = n_diff"                                                                  
#> [553] "  ) %>% "                                                                          
#> [554] "  mutate(good = all - bad) %>% "                                                   
#> [555] "  pivot_longer(c(\"good\", \"bad\")) %>% "                                         
#> [556] "  mutate(name = name == \"good\") %>% "                                            
#> [557] "  ggplot(aes(x = stage, y = value)) +"                                             
#> [558] "  geom_col(aes(fill = name)) +"                                                    
#> [559] "  scale_fill_brewer(palette = \"Dark2\", direction = -1) +"                        
#> [560] "  scale_y_continuous(labels = comma) +"                                            
#> [561] "  theme(legend.position = \"bottom\") +"                                           
#> [562] "  labs("                                                                           
#> [563] "    title = \"District Of Columbia City Normalization Progress\","                 
#> [564] "    subtitle = \"Distinct values, valid and invalid\","                            
#> [565] "    x = \"Stage\","                                                                
#> [566] "    y = \"Distinct Values\","                                                      
#> [567] "    fill = \"Valid\""                                                              
#> [568] "  )"                                                                               
#> [569] "```"                                                                               
#> [570] ""                                                                                  
#> [571] "Before exporting, we can remove the intermediary normalization columns and"        
#> [572] "rename all added variables with the `_clean` suffix."                              
#> [573] ""                                                                                  
#> [574] "```{r clean-select}"                                                               
#> [575] "dce <- dce %>% "                                                                   
#> [576] "  select("                                                                         
#> [577] "    -city_norm,"                                                                   
#> [578] "    -city_swap,"                                                                   
#> [579] "    city_clean = city_refine"                                                      
#> [580] "  ) %>% "                                                                          
#> [581] "  rename_all(~str_replace(., \"_norm\", \"_clean\")) %>% "                         
#> [582] "  rename_all(~str_remove(., \"_raw\")) %>% "                                       
#> [583] "  relocate(address_clean, city_clean, state_clean, .before = zip_clean)"           
#> [584] "```"                                                                               
#> [585] ""                                                                                  
#> [586] "## Conclude"                                                                       
#> [587] ""                                                                                  
#> [588] "```{r clean-glimpse}"                                                              
#> [589] "glimpse(sample_n(dce, 1000))"                                                      
#> [590] "```"                                                                               
#> [591] ""                                                                                  
#> [592] "1. There are `r comma(nrow(dce))` records in the database."                        
#> [593] "1. There are `r comma(sum(dce$dupe_flag))` duplicate records in the database."     
#> [594] "1. The range and distribution of `amount` and `date` seem reasonable."             
#> [595] "1. There are `r comma(sum(dce$na_flag))` records missing key variables."           
#> [596] "1. Consistency in geographic data has been improved with `campfin::normal_*()`."   
#> [597] "1. The 4-digit `year` variable has been created with `lubridate::year()`."         
#> [598] ""                                                                                  
#> [599] "## Export"                                                                         
#> [600] ""                                                                                  
#> [601] "Now the file can be saved on disk for upload to the Accountability server. We"     
#> [602] "will name the object using a date range of the records included."                  
#> [603] ""                                                                                  
#> [604] "```{r clean-timestamp}"                                                            
#> [605] "min_dt <- str_remove_all(min(dce$date), \"-\")"                                    
#> [606] "max_dt <- str_remove_all(max(dce$date), \"-\")"                                    
#> [607] "csv_ts <- paste(min_dt, max_dt, sep = \"-\")"                                      
#> [608] "```"                                                                               
#> [609] ""                                                                                  
#> [610] "```{r clean-dir}"                                                                  
#> [611] "clean_dir <- dir_create(here(\"state\", \"dc\", \"expends\", \"data\", \"clean\"))"
#> [612] "clean_csv <- path(clean_dir, glue(\"dc_expends_{csv_ts}.csv\"))"                   
#> [613] "clean_rds <- path_ext_set(clean_csv, \"rds\")"                                     
#> [614] "basename(clean_csv)"                                                               
#> [615] "```"                                                                               
#> [616] ""                                                                                  
#> [617] "```{r clean-write}"                                                                
#> [618] "write_csv(dce, clean_csv, na = \"\")"                                              
#> [619] "write_rds(dce, clean_rds, compress = \"xz\")"                                      
#> [620] "(clean_size <- file_size(clean_csv))"                                              
#> [621] "```"                                                                               
#> [622] ""                                                                                  
#> [623] "## Upload"                                                                         
#> [624] ""                                                                                  
#> [625] "We can use the `aws.s3::put_object()` to upload the text file to the IRW server."  
#> [626] ""                                                                                  
#> [627] "```{r aws-upload, eval=FALSE}"                                                     
#> [628] "aws_key <- path(\"csv\", basename(clean_csv))"                                     
#> [629] "if (!object_exists(aws_key, \"publicaccountability\")) {"                          
#> [630] "  put_object("                                                                     
#> [631] "    file = clean_csv,"                                                             
#> [632] "    object = aws_key, "                                                            
#> [633] "    bucket = \"publicaccountability\","                                            
#> [634] "    acl = \"public-read\","                                                        
#> [635] "    show_progress = TRUE,"                                                         
#> [636] "    multipart = TRUE"                                                              
#> [637] "  )"                                                                               
#> [638] "}"                                                                                 
#> [639] "aws_head <- head_object(aws_key, \"publicaccountability\")"                        
#> [640] "(aws_size <- as_fs_bytes(attr(aws_head, \"content-length\")))"                     
#> [641] "unname(aws_size == clean_size)"                                                    
#> [642] "```"