Take the arguments supplied and put them into the appropriate places in a new template diary. Write the new template diary in the supplied directory.
use_diary(
st,
type,
author,
path = "state/{st}/{type}/docs/{st}_{type}_diary.Rmd",
auto = FALSE
)
The USPS state abbreviation. State data only, no federal agencies.
The type of data, one of "contribs", "expends", "lobby", "contracts", "salary", or "voters".
The author name of the new diary.
The file path, relative to your working directory, where the
diary file will be created. If you use NA
, then the lines of the diary
will be returned as a character vector. If you specify a character string
file path that contains directories that do not exist then they will be
created. By default, the path creates the diary in a directory that is
expected by the Accountability Project GitHub repository.
Must be set to TRUE
for the diary to be created and opened.
The file path of new diary, invisibly.
use_diary("VT", "contribs", "Kiernan Nicholls", NA, auto = FALSE)
#> [1] "---"
#> [2] "title: \"Vermont Contributions\""
#> [3] "author: \"Kiernan Nicholls\""
#> [4] "date: \"`r date()`\""
#> [5] "output:"
#> [6] " github_document: "
#> [7] " df_print: tibble"
#> [8] " toc: true"
#> [9] " toc_dept: 3"
#> [10] "editor_options: "
#> [11] " chunk_output_type: console"
#> [12] "---"
#> [13] ""
#> [14] "<!-- Place comments regarding knitting here -->"
#> [15] ""
#> [16] "```{r setup, include=FALSE, purl=FALSE}"
#> [17] "library(knitr)"
#> [18] "opts_chunk$set("
#> [19] " eval = TRUE,"
#> [20] " echo = TRUE,"
#> [21] " warning = FALSE,"
#> [22] " message = FALSE,"
#> [23] " error = FALSE,"
#> [24] " collapse = TRUE,"
#> [25] " comment = \"#>\","
#> [26] " fig.path = \"../plots/\","
#> [27] " fig.width = 10,"
#> [28] " dpi = 300"
#> [29] ")"
#> [30] "if (!interactive()) {"
#> [31] " options(width = 120)"
#> [32] " set.seed(5)"
#> [33] "}"
#> [34] "```"
#> [35] ""
#> [36] "```{r create-docs-dir, eval=FALSE, echo=FALSE, include=FALSE}"
#> [37] "doc_dir <- fs::dir_create(here::here(\"state\", \"vt\", \"contribs\", \"docs\"))"
#> [38] "```"
#> [39] ""
#> [40] "## Project"
#> [41] ""
#> [42] "The Accountability Project is an effort to cut across data silos and give"
#> [43] "journalists, policy professionals, activists, and the public at large a simple"
#> [44] "way to search across huge volumes of public data about people and organizations."
#> [45] ""
#> [46] "Our goal is to standardize public data on a few key fields by thinking of each"
#> [47] "dataset row as a transaction. For each transaction there should be (at least) 3"
#> [48] "variables:"
#> [49] ""
#> [50] "1. All **parties** to a transaction."
#> [51] "2. The **date** of the transaction."
#> [52] "3. The **amount** of money involved."
#> [53] ""
#> [54] "## Objectives"
#> [55] ""
#> [56] "This document describes the process used to complete the following objectives:"
#> [57] ""
#> [58] "1. How many records are in the database?"
#> [59] "1. Check for entirely duplicated records."
#> [60] "1. Check ranges of continuous variables."
#> [61] "1. Is there anything blank or missing?"
#> [62] "1. Check for consistency issues."
#> [63] "1. Create a five-digit ZIP Code called `zip`."
#> [64] "1. Create a `year` field from the transaction date."
#> [65] "1. Make sure there is data on both parties to a transaction."
#> [66] ""
#> [67] "## Packages"
#> [68] ""
#> [69] "The following packages are needed to collect, manipulate, visualize, analyze,"
#> [70] "and communicate these results. The `pacman` package will facilitate their"
#> [71] "installation and attachment."
#> [72] ""
#> [73] "```{r load-packages, message=FALSE, warning=FALSE, error=FALSE}"
#> [74] "if (!require(\"pacman\")) {"
#> [75] " install.packages(\"pacman\")"
#> [76] "}"
#> [77] "pacman::p_load("
#> [78] " tidyverse, # data manipulation"
#> [79] " lubridate, # datetime strings"
#> [80] " gluedown, # printing markdown"
#> [81] " jsonlite, # read json files"
#> [82] " janitor, # clean data frames"
#> [83] " campfin, # custom irw tools"
#> [84] " aws.s3, # aws cloud storage"
#> [85] " readxl, # read excel files"
#> [86] " refinr, # cluster & merge"
#> [87] " scales, # format strings"
#> [88] " knitr, # knit documents"
#> [89] " rvest, # scrape html"
#> [90] " glue, # code strings"
#> [91] " here, # project paths"
#> [92] " httr, # http requests"
#> [93] " fs # local storage "
#> [94] ")"
#> [95] "```"
#> [96] ""
#> [97] "This diary was run using `campfin` version `r packageVersion(\"campfin\")`."
#> [98] ""
#> [99] "```{r campfin-version}"
#> [100] "packageVersion(\"campfin\")"
#> [101] "```"
#> [102] ""
#> [103] "```{r package-options, echo=FALSE}"
#> [104] "options(options(knitr.kable.NA = \"\"))"
#> [105] "```"
#> [106] ""
#> [107] "This document should be run as part of the `R_tap` project, which lives as a"
#> [108] "sub-directory of the more general, language-agnostic"
#> [109] "[`irworkshop/accountability_datacleaning`][tap] GitHub repository."
#> [110] ""
#> [111] "The `R_tap` project uses the [RStudio projects][rproj] feature and should be"
#> [112] "run as such. The project also uses the dynamic `here::here()` tool for file"
#> [113] "paths relative to _your_ machine."
#> [114] ""
#> [115] "```{r where-here}"
#> [116] "# where does this document knit?"
#> [117] "here::i_am(\"state/vt/contribs/docs/vt_contribs_diary.Rmd\")"
#> [118] "```"
#> [119] ""
#> [120] "[tap]: https://github.com/irworkshop/accountability_datacleaning"
#> [121] "[rproj]: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects"
#> [122] ""
#> [123] "## Source"
#> [124] ""
#> [125] ""
#> [126] ""
#> [127] "## Download"
#> [128] ""
#> [129] "```{r raw-dir}"
#> [130] "raw_url <- \"https://example.com/source_file.csv\""
#> [131] "raw_dir <- dir_create(here(\"state\", \"vt\", \"contribs\", \"data\", \"raw\"))"
#> [132] "raw_csv <- path(raw_dir, basename(raw_url))"
#> [133] "```"
#> [134] ""
#> [135] "```{r raw-download}"
#> [136] "if (!file_exists(raw_csv)) {"
#> [137] " download.file(raw_url, raw_csv)"
#> [138] "}"
#> [139] "```"
#> [140] ""
#> [141] "## Read"
#> [142] ""
#> [143] "```{r raw-read}"
#> [144] "vtc <- read_delim("
#> [145] " file = raw_csv,"
#> [146] " delim = \",\","
#> [147] " escape_backslash = FALSE,"
#> [148] " escape_double = FALSE,"
#> [149] " col_types = cols("
#> [150] " .default = col_character(),"
#> [151] " date = col_date_mdy(),"
#> [152] " amount = col_double()"
#> [153] " )"
#> [154] ")"
#> [155] "```"
#> [156] ""
#> [157] "```{r clean-names}"
#> [158] "vtc <- clean_names(vtc, case = \"snake\")"
#> [159] "```"
#> [160] ""
#> [161] "## Explore"
#> [162] ""
#> [163] "There are `r comma(nrow(vtc))` rows of `r ncol(vtc)` columns. Each record"
#> [164] "represents a single Contributions..."
#> [165] ""
#> [166] "```{r glimpse}"
#> [167] "glimpse(vtc)"
#> [168] "tail(vtc)"
#> [169] "```"
#> [170] ""
#> [171] "### Missing"
#> [172] ""
#> [173] "Columns vary in their degree of missing values."
#> [174] ""
#> [175] "```{r na-count}"
#> [176] "col_stats(vtc, count_na)"
#> [177] "```"
#> [178] ""
#> [179] "We can flag any record missing a key variable needed to identify a transaction."
#> [180] ""
#> [181] "```{r na-flag}"
#> [182] "key_vars <- c(\"date\", \"last_name\", \"amount\", \"committee_name\")"
#> [183] "vtc <- flag_na(vtc, all_of(key_vars))"
#> [184] "sum(vtc$na_flag)"
#> [185] "```"
#> [186] ""
#> [187] "```{r na-view}"
#> [188] "vtc %>% "
#> [189] " filter(na_flag) %>% "
#> [190] " select(all_of(key_vars))"
#> [191] "```"
#> [192] ""
#> [193] "### Duplicates"
#> [194] ""
#> [195] "We can also flag any record completely duplicated across every column."
#> [196] ""
#> [197] "```{r dupe-flag}"
#> [198] "vtc <- flag_dupes(vtc, -id)"
#> [199] "sum(vtc$dupe_flag)"
#> [200] "```"
#> [201] ""
#> [202] "```{r dupe-view}"
#> [203] "vtc %>% "
#> [204] " filter(dupe_flag) %>% "
#> [205] " select(all_of(key_vars)) %>% "
#> [206] " arrange(date)"
#> [207] "```"
#> [208] ""
#> [209] "### Categorical"
#> [210] ""
#> [211] "```{r distinct-count}"
#> [212] "col_stats(vtc, n_distinct)"
#> [213] "```"
#> [214] ""
#> [215] "```{r distinct-plots, echo=FALSE, fig.height=3}"
#> [216] "explore_plot(vtc, type)"
#> [217] "```"
#> [218] ""
#> [219] "### Amounts"
#> [220] ""
#> [221] "```{r amount-round}"
#> [222] "# fix floating point precision"
#> [223] "vtc$amount <- round(vtc$amount, digits = 2)"
#> [224] "```"
#> [225] ""
#> [226] "```{r amount-summary}"
#> [227] "summary(vtc$amount)"
#> [228] "mean(vtc$amount <= 0)"
#> [229] "```"
#> [230] ""
#> [231] "These are the records with the minimum and maximum amounts."
#> [232] ""
#> [233] "```{r amount-minmax}"
#> [234] "glimpse(vtc[c(which.max(vtc$amount), which.min(vtc$amount)), ])"
#> [235] "```"
#> [236] ""
#> [237] "The distribution of amount values are typically log-normal."
#> [238] ""
#> [239] "```{r hist-amount, echo=FALSE}"
#> [240] "vtc %>%"
#> [241] " ggplot(aes(amount)) +"
#> [242] " geom_histogram(fill = dark2[\"purple\"]) +"
#> [243] " scale_y_continuous(labels = comma) +"
#> [244] " scale_x_continuous("
#> [245] " breaks = c(1 %o% 10^(0:6)),"
#> [246] " labels = dollar,"
#> [247] " trans = \"log10\""
#> [248] " ) +"
#> [249] " labs("
#> [250] " title = \"Vermont Contributions Amount Distribution\","
#> [251] " caption = \"Source: {source}\","
#> [252] " x = \"Amount\","
#> [253] " y = \"Count\""
#> [254] " )"
#> [255] "```"
#> [256] ""
#> [257] "### Dates"
#> [258] ""
#> [259] "We can add the calendar year from `date` with `lubridate::year()`"
#> [260] ""
#> [261] "```{r date-year}"
#> [262] "vtc <- mutate(vtc, year = year(date))"
#> [263] "```"
#> [264] ""
#> [265] "```{r date-range}"
#> [266] "min(vtc$date)"
#> [267] "sum(vtc$year < 2000)"
#> [268] "max(vtc$date)"
#> [269] "sum(vtc$date > today())"
#> [270] "```"
#> [271] ""
#> [272] "It's common to see an increase in the number of contributins in elections years."
#> [273] ""
#> [274] "```{r bar-year, echo=FALSE}"
#> [275] "vtc %>% "
#> [276] " count(year) %>% "
#> [277] " mutate(even = is_even(year)) %>% "
#> [278] " ggplot(aes(x = year, y = n)) +"
#> [279] " geom_col(aes(fill = even)) + "
#> [280] " scale_fill_brewer(palette = \"Dark2\") +"
#> [281] " scale_y_continuous(labels = comma) +"
#> [282] " scale_x_continuous(breaks = seq(2000, 2020, by = 2)) +"
#> [283] " theme(legend.position = \"bottom\") +"
#> [284] " labs("
#> [285] " title = \"Vermont Contributions by Year\","
#> [286] " caption = \"Source: {source}\","
#> [287] " fill = \"Election Year\","
#> [288] " x = \"Year Made\","
#> [289] " y = \"Count\""
#> [290] " )"
#> [291] "```"
#> [292] ""
#> [293] "## Wrangle"
#> [294] ""
#> [295] "To improve the searchability of the database, we will perform some consistent,"
#> [296] "confident string normalization. For geographic variables like city names and"
#> [297] "ZIP codes, the corresponding `campfin::normal_*()` functions are tailor made to "
#> [298] "facilitate this process."
#> [299] ""
#> [300] "### Address"
#> [301] ""
#> [302] "For the street `addresss` variable, the `campfin::normal_address()` function"
#> [303] "will force consistence case, remove punctuation, and abbreviate official "
#> [304] "USPS suffixes."
#> [305] ""
#> [306] "```{r address-norm}"
#> [307] "addr_norm <- vtc %>% "
#> [308] " distinct(address1, address2) %>% "
#> [309] " unite("
#> [310] " col = address_full,"
#> [311] " starts_with(\"address\"),"
#> [312] " sep = \" \","
#> [313] " remove = FALSE,"
#> [314] " na.rm = TRUE"
#> [315] " ) %>% "
#> [316] " mutate("
#> [317] " address_norm = normal_address("
#> [318] " address = address_full,"
#> [319] " abbs = usps_street,"
#> [320] " na_rep = TRUE"
#> [321] " )"
#> [322] " ) %>% "
#> [323] " select(-address_full)"
#> [324] "```"
#> [325] ""
#> [326] "```{r address-view}"
#> [327] "addr_norm"
#> [328] "```"
#> [329] ""
#> [330] "```{r address-join}"
#> [331] "vtc <- left_join(vtc, addr_norm, by = c(\"address1\", \"address2\"))"
#> [332] "```"
#> [333] ""
#> [334] "### ZIP"
#> [335] ""
#> [336] "For ZIP codes, the `campfin::normal_zip()` function will attempt to create"
#> [337] "valid _five_ digit codes by removing the ZIP+4 suffix and returning leading"
#> [338] "zeroes dropped by other programs like Microsoft Excel."
#> [339] ""
#> [340] "```{r zip-norm}"
#> [341] "vtc <- vtc %>% "
#> [342] " mutate("
#> [343] " zip_norm = normal_zip("
#> [344] " zip = zip,"
#> [345] " na_rep = TRUE"
#> [346] " )"
#> [347] " )"
#> [348] "```"
#> [349] ""
#> [350] "```{r zip-progress}"
#> [351] "progress_table("
#> [352] " vtc$zip,"
#> [353] " vtc$zip_norm,"
#> [354] " compare = valid_zip"
#> [355] ")"
#> [356] "```"
#> [357] ""
#> [358] "### State"
#> [359] ""
#> [360] "Valid two digit state abbreviations can be made using the "
#> [361] "`campfin::normal_state()` function."
#> [362] ""
#> [363] "```{r state-norm}"
#> [364] "vtc <- vtc %>% "
#> [365] " mutate("
#> [366] " state_norm = normal_state("
#> [367] " state = state,"
#> [368] " abbreviate = TRUE,"
#> [369] " na_rep = TRUE,"
#> [370] " valid = valid_state"
#> [371] " )"
#> [372] " )"
#> [373] "```"
#> [374] ""
#> [375] "```{r state-view}"
#> [376] "vtc %>% "
#> [377] " filter(state != state_norm) %>% "
#> [378] " count(state, state_norm, sort = TRUE)"
#> [379] "```"
#> [380] ""
#> [381] "```{r state-progress}"
#> [382] "progress_table("
#> [383] " vtc$state,"
#> [384] " vtc$state_norm,"
#> [385] " compare = valid_state"
#> [386] ")"
#> [387] "```"
#> [388] ""
#> [389] "### City"
#> [390] ""
#> [391] "Cities are the most difficult geographic variable to normalize, simply due to"
#> [392] "the wide variety of valid cities and formats."
#> [393] ""
#> [394] "#### Normal"
#> [395] ""
#> [396] "The `campfin::normal_city()` function is a good start, again converting case,"
#> [397] "removing punctuation, but _expanding_ USPS abbreviations. We can also remove"
#> [398] "`invalid_city` values."
#> [399] ""
#> [400] "```{r city-norm}"
#> [401] "norm_city <- vtc %>% "
#> [402] " distinct(city, state_norm, zip_norm) %>% "
#> [403] " mutate("
#> [404] " city_norm = normal_city("
#> [405] " city = city, "
#> [406] " abbs = usps_city,"
#> [407] " states = c(\"VT\", \"DC\", \"VERMONT\"),"
#> [408] " na = invalid_city,"
#> [409] " na_rep = TRUE"
#> [410] " )"
#> [411] " )"
#> [412] "```"
#> [413] ""
#> [414] "#### Swap"
#> [415] ""
#> [416] "We can further improve normalization by comparing our normalized value"
#> [417] "against the _expected_ value for that record's state abbreviation and ZIP code."
#> [418] "If the normalized value is either an abbreviation for or very similar to the"
#> [419] "expected value, we can confidently swap those two."
#> [420] ""
#> [421] "```{r city-swap}"
#> [422] "norm_city <- norm_city %>% "
#> [423] " rename(city_raw = city) %>% "
#> [424] " left_join("
#> [425] " y = zipcodes,"
#> [426] " by = c("
#> [427] " \"state_norm\" = \"state\","
#> [428] " \"zip_norm\" = \"zip\""
#> [429] " )"
#> [430] " ) %>% "
#> [431] " rename(city_match = city) %>% "
#> [432] " mutate("
#> [433] " match_abb = is_abbrev(city_norm, city_match),"
#> [434] " match_dist = str_dist(city_norm, city_match),"
#> [435] " city_swap = if_else("
#> [436] " condition = !is.na(match_dist) & (match_abb | match_dist == 1),"
#> [437] " true = city_match,"
#> [438] " false = city_norm"
#> [439] " )"
#> [440] " ) %>% "
#> [441] " select("
#> [442] " -city_match,"
#> [443] " -match_dist,"
#> [444] " -match_abb"
#> [445] " )"
#> [446] "```"
#> [447] ""
#> [448] "```{r city-rejoin}"
#> [449] "vtc <- left_join("
#> [450] " x = vtc,"
#> [451] " y = norm_city,"
#> [452] " by = c("
#> [453] " \"city\" = \"city_raw\", "
#> [454] " \"state_norm\", "
#> [455] " \"zip_norm\""
#> [456] " )"
#> [457] ")"
#> [458] "```"
#> [459] ""
#> [460] "#### Refine"
#> [461] ""
#> [462] "The [OpenRefine][or] algorithms can be used to group similar strings and replace"
#> [463] "the less common versions with their most common counterpart. This can greatly"
#> [464] "reduce inconsistency, but with low confidence; we will only keep any refined"
#> [465] "strings that have a valid city/state/zip combination."
#> [466] ""
#> [467] "[or]: https://openrefine.org/"
#> [468] ""
#> [469] "```{r city-refine}"
#> [470] "good_refine <- vtc %>% "
#> [471] " mutate("
#> [472] " city_refine = city_swap %>% "
#> [473] " key_collision_merge() %>% "
#> [474] " n_gram_merge(numgram = 1)"
#> [475] " ) %>% "
#> [476] " filter(city_refine != city_swap) %>% "
#> [477] " inner_join("
#> [478] " y = zipcodes,"
#> [479] " by = c("
#> [480] " \"city_refine\" = \"city\","
#> [481] " \"state_norm\" = \"state\","
#> [482] " \"zip_norm\" = \"zip\""
#> [483] " )"
#> [484] " )"
#> [485] "```"
#> [486] ""
#> [487] "```{r city-count, echo=FALSE}"
#> [488] "good_refine %>%"
#> [489] " count("
#> [490] " state_norm, "
#> [491] " zip_norm, "
#> [492] " city_swap, "
#> [493] " city_refine,"
#> [494] " sort = TRUE"
#> [495] " )"
#> [496] "```"
#> [497] ""
#> [498] "Then we can join the refined values back to the database."
#> [499] ""
#> [500] "```{r city-join}"
#> [501] "vtc <- vtc %>% "
#> [502] " left_join(good_refine, by = names(.)) %>% "
#> [503] " mutate(city_refine = coalesce(city_refine, city_swap))"
#> [504] "```"
#> [505] ""
#> [506] "#### Progress"
#> [507] ""
#> [508] "Our goal for normalization was to increase the proportion of city values known"
#> [509] "to be valid and reduce the total distinct values by correcting misspellings."
#> [510] ""
#> [511] "```{r city-progress, echo=FALSE}"
#> [512] "many_city <- c(valid_city, extra_city)"
#> [513] "progress <- progress_table("
#> [514] " str_to_upper(vtc$city),"
#> [515] " vtc$city_norm,"
#> [516] " vtc$city_swap,"
#> [517] " vtc$city_refine,"
#> [518] " compare = many_city"
#> [519] ") %>% mutate(stage = as_factor(stage))"
#> [520] "progress %>% "
#> [521] " mutate(across(stage, md_code)) %>% "
#> [522] " kable(digits = 3)"
#> [523] "```"
#> [524] ""
#> [525] "You can see how the percentage of valid values increased with each stage."
#> [526] ""
#> [527] "```{r bar-progress, echo=FALSE}"
#> [528] "raw_in <- percent(prop_in(vtc$city, valid_city))"
#> [529] "progress %>% "
#> [530] " ggplot(aes(x = stage, y = prop_in)) +"
#> [531] " geom_hline(yintercept = 0.99) +"
#> [532] " geom_col(fill = dark2[\"purple\"]) +"
#> [533] " coord_cartesian(ylim = c(0.75, 1)) +"
#> [534] " scale_y_continuous(labels = percent) +"
#> [535] " labs("
#> [536] " title = \"Vermont City Normalization Progress\","
#> [537] " subtitle = glue(\"Raw at {raw_in} before conversion to uppercase\"),"
#> [538] " x = \"Stage\","
#> [539] " y = \"Percent Valid\""
#> [540] " )"
#> [541] "```"
#> [542] ""
#> [543] "More importantly, the number of distinct values decreased each stage. We were"
#> [544] "able to confidently change many distinct invalid values to their valid"
#> [545] "equivalent."
#> [546] ""
#> [547] "```{r bar-distinct, echo=FALSE}"
#> [548] "progress %>% "
#> [549] " select("
#> [550] " stage, "
#> [551] " all = n_distinct,"
#> [552] " bad = n_diff"
#> [553] " ) %>% "
#> [554] " mutate(good = all - bad) %>% "
#> [555] " pivot_longer(c(\"good\", \"bad\")) %>% "
#> [556] " mutate(name = name == \"good\") %>% "
#> [557] " ggplot(aes(x = stage, y = value)) +"
#> [558] " geom_col(aes(fill = name)) +"
#> [559] " scale_fill_brewer(palette = \"Dark2\", direction = -1) +"
#> [560] " scale_y_continuous(labels = comma) +"
#> [561] " theme(legend.position = \"bottom\") +"
#> [562] " labs("
#> [563] " title = \"Vermont City Normalization Progress\","
#> [564] " subtitle = \"Distinct values, valid and invalid\","
#> [565] " x = \"Stage\","
#> [566] " y = \"Distinct Values\","
#> [567] " fill = \"Valid\""
#> [568] " )"
#> [569] "```"
#> [570] ""
#> [571] "Before exporting, we can remove the intermediary normalization columns and"
#> [572] "rename all added variables with the `_clean` suffix."
#> [573] ""
#> [574] "```{r clean-select}"
#> [575] "vtc <- vtc %>% "
#> [576] " select("
#> [577] " -city_norm,"
#> [578] " -city_swap,"
#> [579] " city_clean = city_refine"
#> [580] " ) %>% "
#> [581] " rename_all(~str_replace(., \"_norm\", \"_clean\")) %>% "
#> [582] " rename_all(~str_remove(., \"_raw\")) %>% "
#> [583] " relocate(address_clean, city_clean, state_clean, .before = zip_clean)"
#> [584] "```"
#> [585] ""
#> [586] "## Conclude"
#> [587] ""
#> [588] "```{r clean-glimpse}"
#> [589] "glimpse(sample_n(vtc, 1000))"
#> [590] "```"
#> [591] ""
#> [592] "1. There are `r comma(nrow(vtc))` records in the database."
#> [593] "1. There are `r comma(sum(vtc$dupe_flag))` duplicate records in the database."
#> [594] "1. The range and distribution of `amount` and `date` seem reasonable."
#> [595] "1. There are `r comma(sum(vtc$na_flag))` records missing key variables."
#> [596] "1. Consistency in geographic data has been improved with `campfin::normal_*()`."
#> [597] "1. The 4-digit `year` variable has been created with `lubridate::year()`."
#> [598] ""
#> [599] "## Export"
#> [600] ""
#> [601] "Now the file can be saved on disk for upload to the Accountability server. We"
#> [602] "will name the object using a date range of the records included."
#> [603] ""
#> [604] "```{r clean-timestamp}"
#> [605] "min_dt <- str_remove_all(min(vtc$date), \"-\")"
#> [606] "max_dt <- str_remove_all(max(vtc$date), \"-\")"
#> [607] "csv_ts <- paste(min_dt, max_dt, sep = \"-\")"
#> [608] "```"
#> [609] ""
#> [610] "```{r clean-dir}"
#> [611] "clean_dir <- dir_create(here(\"state\", \"vt\", \"contribs\", \"data\", \"clean\"))"
#> [612] "clean_csv <- path(clean_dir, glue(\"vt_contribs_{csv_ts}.csv\"))"
#> [613] "clean_rds <- path_ext_set(clean_csv, \"rds\")"
#> [614] "basename(clean_csv)"
#> [615] "```"
#> [616] ""
#> [617] "```{r clean-write}"
#> [618] "write_csv(vtc, clean_csv, na = \"\")"
#> [619] "write_rds(vtc, clean_rds, compress = \"xz\")"
#> [620] "(clean_size <- file_size(clean_csv))"
#> [621] "```"
#> [622] ""
#> [623] "## Upload"
#> [624] ""
#> [625] "We can use the `aws.s3::put_object()` to upload the text file to the IRW server."
#> [626] ""
#> [627] "```{r aws-upload, eval=FALSE}"
#> [628] "aws_key <- path(\"csv\", basename(clean_csv))"
#> [629] "if (!object_exists(aws_key, \"publicaccountability\")) {"
#> [630] " put_object("
#> [631] " file = clean_csv,"
#> [632] " object = aws_key, "
#> [633] " bucket = \"publicaccountability\","
#> [634] " acl = \"public-read\","
#> [635] " show_progress = TRUE,"
#> [636] " multipart = TRUE"
#> [637] " )"
#> [638] "}"
#> [639] "aws_head <- head_object(aws_key, \"publicaccountability\")"
#> [640] "(aws_size <- as_fs_bytes(attr(aws_head, \"content-length\")))"
#> [641] "unname(aws_size == clean_size)"
#> [642] "```"
use_diary("DC", "expends", "Kiernan Nicholls", tempfile(), auto = FALSE)
#> [1] "---"
#> [2] "title: \"District Of Columbia Expenditures\""
#> [3] "author: \"Kiernan Nicholls\""
#> [4] "date: \"`r date()`\""
#> [5] "output:"
#> [6] " github_document: "
#> [7] " df_print: tibble"
#> [8] " toc: true"
#> [9] " toc_dept: 3"
#> [10] "editor_options: "
#> [11] " chunk_output_type: console"
#> [12] "---"
#> [13] ""
#> [14] "<!-- Place comments regarding knitting here -->"
#> [15] ""
#> [16] "```{r setup, include=FALSE, purl=FALSE}"
#> [17] "library(knitr)"
#> [18] "opts_chunk$set("
#> [19] " eval = TRUE,"
#> [20] " echo = TRUE,"
#> [21] " warning = FALSE,"
#> [22] " message = FALSE,"
#> [23] " error = FALSE,"
#> [24] " collapse = TRUE,"
#> [25] " comment = \"#>\","
#> [26] " fig.path = \"../plots/\","
#> [27] " fig.width = 10,"
#> [28] " dpi = 300"
#> [29] ")"
#> [30] "if (!interactive()) {"
#> [31] " options(width = 120)"
#> [32] " set.seed(5)"
#> [33] "}"
#> [34] "```"
#> [35] ""
#> [36] "```{r create-docs-dir, eval=FALSE, echo=FALSE, include=FALSE}"
#> [37] "doc_dir <- fs::dir_create(here::here(\"state\", \"dc\", \"expends\", \"docs\"))"
#> [38] "```"
#> [39] ""
#> [40] "## Project"
#> [41] ""
#> [42] "The Accountability Project is an effort to cut across data silos and give"
#> [43] "journalists, policy professionals, activists, and the public at large a simple"
#> [44] "way to search across huge volumes of public data about people and organizations."
#> [45] ""
#> [46] "Our goal is to standardize public data on a few key fields by thinking of each"
#> [47] "dataset row as a transaction. For each transaction there should be (at least) 3"
#> [48] "variables:"
#> [49] ""
#> [50] "1. All **parties** to a transaction."
#> [51] "2. The **date** of the transaction."
#> [52] "3. The **amount** of money involved."
#> [53] ""
#> [54] "## Objectives"
#> [55] ""
#> [56] "This document describes the process used to complete the following objectives:"
#> [57] ""
#> [58] "1. How many records are in the database?"
#> [59] "1. Check for entirely duplicated records."
#> [60] "1. Check ranges of continuous variables."
#> [61] "1. Is there anything blank or missing?"
#> [62] "1. Check for consistency issues."
#> [63] "1. Create a five-digit ZIP Code called `zip`."
#> [64] "1. Create a `year` field from the transaction date."
#> [65] "1. Make sure there is data on both parties to a transaction."
#> [66] ""
#> [67] "## Packages"
#> [68] ""
#> [69] "The following packages are needed to collect, manipulate, visualize, analyze,"
#> [70] "and communicate these results. The `pacman` package will facilitate their"
#> [71] "installation and attachment."
#> [72] ""
#> [73] "```{r load-packages, message=FALSE, warning=FALSE, error=FALSE}"
#> [74] "if (!require(\"pacman\")) {"
#> [75] " install.packages(\"pacman\")"
#> [76] "}"
#> [77] "pacman::p_load("
#> [78] " tidyverse, # data manipulation"
#> [79] " lubridate, # datetime strings"
#> [80] " gluedown, # printing markdown"
#> [81] " jsonlite, # read json files"
#> [82] " janitor, # clean data frames"
#> [83] " campfin, # custom irw tools"
#> [84] " aws.s3, # aws cloud storage"
#> [85] " readxl, # read excel files"
#> [86] " refinr, # cluster & merge"
#> [87] " scales, # format strings"
#> [88] " knitr, # knit documents"
#> [89] " rvest, # scrape html"
#> [90] " glue, # code strings"
#> [91] " here, # project paths"
#> [92] " httr, # http requests"
#> [93] " fs # local storage "
#> [94] ")"
#> [95] "```"
#> [96] ""
#> [97] "This diary was run using `campfin` version `r packageVersion(\"campfin\")`."
#> [98] ""
#> [99] "```{r campfin-version}"
#> [100] "packageVersion(\"campfin\")"
#> [101] "```"
#> [102] ""
#> [103] "```{r package-options, echo=FALSE}"
#> [104] "options(options(knitr.kable.NA = \"\"))"
#> [105] "```"
#> [106] ""
#> [107] "This document should be run as part of the `R_tap` project, which lives as a"
#> [108] "sub-directory of the more general, language-agnostic"
#> [109] "[`irworkshop/accountability_datacleaning`][tap] GitHub repository."
#> [110] ""
#> [111] "The `R_tap` project uses the [RStudio projects][rproj] feature and should be"
#> [112] "run as such. The project also uses the dynamic `here::here()` tool for file"
#> [113] "paths relative to _your_ machine."
#> [114] ""
#> [115] "```{r where-here}"
#> [116] "# where does this document knit?"
#> [117] "here::i_am(\"state/dc/expends/docs/dc_expends_diary.Rmd\")"
#> [118] "```"
#> [119] ""
#> [120] "[tap]: https://github.com/irworkshop/accountability_datacleaning"
#> [121] "[rproj]: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects"
#> [122] ""
#> [123] "## Source"
#> [124] ""
#> [125] ""
#> [126] ""
#> [127] "## Download"
#> [128] ""
#> [129] "```{r raw-dir}"
#> [130] "raw_url <- \"https://example.com/source_file.csv\""
#> [131] "raw_dir <- dir_create(here(\"state\", \"dc\", \"expends\", \"data\", \"raw\"))"
#> [132] "raw_csv <- path(raw_dir, basename(raw_url))"
#> [133] "```"
#> [134] ""
#> [135] "```{r raw-download}"
#> [136] "if (!file_exists(raw_csv)) {"
#> [137] " download.file(raw_url, raw_csv)"
#> [138] "}"
#> [139] "```"
#> [140] ""
#> [141] "## Read"
#> [142] ""
#> [143] "```{r raw-read}"
#> [144] "dce <- read_delim("
#> [145] " file = raw_csv,"
#> [146] " delim = \",\","
#> [147] " escape_backslash = FALSE,"
#> [148] " escape_double = FALSE,"
#> [149] " col_types = cols("
#> [150] " .default = col_character(),"
#> [151] " date = col_date_mdy(),"
#> [152] " amount = col_double()"
#> [153] " )"
#> [154] ")"
#> [155] "```"
#> [156] ""
#> [157] "```{r clean-names}"
#> [158] "dce <- clean_names(dce, case = \"snake\")"
#> [159] "```"
#> [160] ""
#> [161] "## Explore"
#> [162] ""
#> [163] "There are `r comma(nrow(dce))` rows of `r ncol(dce)` columns. Each record"
#> [164] "represents a single Expenditures..."
#> [165] ""
#> [166] "```{r glimpse}"
#> [167] "glimpse(dce)"
#> [168] "tail(dce)"
#> [169] "```"
#> [170] ""
#> [171] "### Missing"
#> [172] ""
#> [173] "Columns vary in their degree of missing values."
#> [174] ""
#> [175] "```{r na-count}"
#> [176] "col_stats(dce, count_na)"
#> [177] "```"
#> [178] ""
#> [179] "We can flag any record missing a key variable needed to identify a transaction."
#> [180] ""
#> [181] "```{r na-flag}"
#> [182] "key_vars <- c(\"date\", \"last_name\", \"amount\", \"committee_name\")"
#> [183] "dce <- flag_na(dce, all_of(key_vars))"
#> [184] "sum(dce$na_flag)"
#> [185] "```"
#> [186] ""
#> [187] "```{r na-view}"
#> [188] "dce %>% "
#> [189] " filter(na_flag) %>% "
#> [190] " select(all_of(key_vars))"
#> [191] "```"
#> [192] ""
#> [193] "### Duplicates"
#> [194] ""
#> [195] "We can also flag any record completely duplicated across every column."
#> [196] ""
#> [197] "```{r dupe-flag}"
#> [198] "dce <- flag_dupes(dce, -id)"
#> [199] "sum(dce$dupe_flag)"
#> [200] "```"
#> [201] ""
#> [202] "```{r dupe-view}"
#> [203] "dce %>% "
#> [204] " filter(dupe_flag) %>% "
#> [205] " select(all_of(key_vars)) %>% "
#> [206] " arrange(date)"
#> [207] "```"
#> [208] ""
#> [209] "### Categorical"
#> [210] ""
#> [211] "```{r distinct-count}"
#> [212] "col_stats(dce, n_distinct)"
#> [213] "```"
#> [214] ""
#> [215] "```{r distinct-plots, echo=FALSE, fig.height=3}"
#> [216] "explore_plot(dce, type)"
#> [217] "```"
#> [218] ""
#> [219] "### Amounts"
#> [220] ""
#> [221] "```{r amount-round}"
#> [222] "# fix floating point precision"
#> [223] "dce$amount <- round(dce$amount, digits = 2)"
#> [224] "```"
#> [225] ""
#> [226] "```{r amount-summary}"
#> [227] "summary(dce$amount)"
#> [228] "mean(dce$amount <= 0)"
#> [229] "```"
#> [230] ""
#> [231] "These are the records with the minimum and maximum amounts."
#> [232] ""
#> [233] "```{r amount-minmax}"
#> [234] "glimpse(dce[c(which.max(dce$amount), which.min(dce$amount)), ])"
#> [235] "```"
#> [236] ""
#> [237] "The distribution of amount values are typically log-normal."
#> [238] ""
#> [239] "```{r hist-amount, echo=FALSE}"
#> [240] "dce %>%"
#> [241] " ggplot(aes(amount)) +"
#> [242] " geom_histogram(fill = dark2[\"purple\"]) +"
#> [243] " scale_y_continuous(labels = comma) +"
#> [244] " scale_x_continuous("
#> [245] " breaks = c(1 %o% 10^(0:6)),"
#> [246] " labels = dollar,"
#> [247] " trans = \"log10\""
#> [248] " ) +"
#> [249] " labs("
#> [250] " title = \"District Of Columbia Expenditures Amount Distribution\","
#> [251] " caption = \"Source: {source}\","
#> [252] " x = \"Amount\","
#> [253] " y = \"Count\""
#> [254] " )"
#> [255] "```"
#> [256] ""
#> [257] "### Dates"
#> [258] ""
#> [259] "We can add the calendar year from `date` with `lubridate::year()`"
#> [260] ""
#> [261] "```{r date-year}"
#> [262] "dce <- mutate(dce, year = year(date))"
#> [263] "```"
#> [264] ""
#> [265] "```{r date-range}"
#> [266] "min(dce$date)"
#> [267] "sum(dce$year < 2000)"
#> [268] "max(dce$date)"
#> [269] "sum(dce$date > today())"
#> [270] "```"
#> [271] ""
#> [272] "It's common to see an increase in the number of contributins in elections years."
#> [273] ""
#> [274] "```{r bar-year, echo=FALSE}"
#> [275] "dce %>% "
#> [276] " count(year) %>% "
#> [277] " mutate(even = is_even(year)) %>% "
#> [278] " ggplot(aes(x = year, y = n)) +"
#> [279] " geom_col(aes(fill = even)) + "
#> [280] " scale_fill_brewer(palette = \"Dark2\") +"
#> [281] " scale_y_continuous(labels = comma) +"
#> [282] " scale_x_continuous(breaks = seq(2000, 2020, by = 2)) +"
#> [283] " theme(legend.position = \"bottom\") +"
#> [284] " labs("
#> [285] " title = \"District Of Columbia Expenditures by Year\","
#> [286] " caption = \"Source: {source}\","
#> [287] " fill = \"Election Year\","
#> [288] " x = \"Year Made\","
#> [289] " y = \"Count\""
#> [290] " )"
#> [291] "```"
#> [292] ""
#> [293] "## Wrangle"
#> [294] ""
#> [295] "To improve the searchability of the database, we will perform some consistent,"
#> [296] "confident string normalization. For geographic variables like city names and"
#> [297] "ZIP codes, the corresponding `campfin::normal_*()` functions are tailor made to "
#> [298] "facilitate this process."
#> [299] ""
#> [300] "### Address"
#> [301] ""
#> [302] "For the street `addresss` variable, the `campfin::normal_address()` function"
#> [303] "will force consistence case, remove punctuation, and abbreviate official "
#> [304] "USPS suffixes."
#> [305] ""
#> [306] "```{r address-norm}"
#> [307] "addr_norm <- dce %>% "
#> [308] " distinct(address1, address2) %>% "
#> [309] " unite("
#> [310] " col = address_full,"
#> [311] " starts_with(\"address\"),"
#> [312] " sep = \" \","
#> [313] " remove = FALSE,"
#> [314] " na.rm = TRUE"
#> [315] " ) %>% "
#> [316] " mutate("
#> [317] " address_norm = normal_address("
#> [318] " address = address_full,"
#> [319] " abbs = usps_street,"
#> [320] " na_rep = TRUE"
#> [321] " )"
#> [322] " ) %>% "
#> [323] " select(-address_full)"
#> [324] "```"
#> [325] ""
#> [326] "```{r address-view}"
#> [327] "addr_norm"
#> [328] "```"
#> [329] ""
#> [330] "```{r address-join}"
#> [331] "dce <- left_join(dce, addr_norm, by = c(\"address1\", \"address2\"))"
#> [332] "```"
#> [333] ""
#> [334] "### ZIP"
#> [335] ""
#> [336] "For ZIP codes, the `campfin::normal_zip()` function will attempt to create"
#> [337] "valid _five_ digit codes by removing the ZIP+4 suffix and returning leading"
#> [338] "zeroes dropped by other programs like Microsoft Excel."
#> [339] ""
#> [340] "```{r zip-norm}"
#> [341] "dce <- dce %>% "
#> [342] " mutate("
#> [343] " zip_norm = normal_zip("
#> [344] " zip = zip,"
#> [345] " na_rep = TRUE"
#> [346] " )"
#> [347] " )"
#> [348] "```"
#> [349] ""
#> [350] "```{r zip-progress}"
#> [351] "progress_table("
#> [352] " dce$zip,"
#> [353] " dce$zip_norm,"
#> [354] " compare = valid_zip"
#> [355] ")"
#> [356] "```"
#> [357] ""
#> [358] "### State"
#> [359] ""
#> [360] "Valid two digit state abbreviations can be made using the "
#> [361] "`campfin::normal_state()` function."
#> [362] ""
#> [363] "```{r state-norm}"
#> [364] "dce <- dce %>% "
#> [365] " mutate("
#> [366] " state_norm = normal_state("
#> [367] " state = state,"
#> [368] " abbreviate = TRUE,"
#> [369] " na_rep = TRUE,"
#> [370] " valid = valid_state"
#> [371] " )"
#> [372] " )"
#> [373] "```"
#> [374] ""
#> [375] "```{r state-view}"
#> [376] "dce %>% "
#> [377] " filter(state != state_norm) %>% "
#> [378] " count(state, state_norm, sort = TRUE)"
#> [379] "```"
#> [380] ""
#> [381] "```{r state-progress}"
#> [382] "progress_table("
#> [383] " dce$state,"
#> [384] " dce$state_norm,"
#> [385] " compare = valid_state"
#> [386] ")"
#> [387] "```"
#> [388] ""
#> [389] "### City"
#> [390] ""
#> [391] "Cities are the most difficult geographic variable to normalize, simply due to"
#> [392] "the wide variety of valid cities and formats."
#> [393] ""
#> [394] "#### Normal"
#> [395] ""
#> [396] "The `campfin::normal_city()` function is a good start, again converting case,"
#> [397] "removing punctuation, but _expanding_ USPS abbreviations. We can also remove"
#> [398] "`invalid_city` values."
#> [399] ""
#> [400] "```{r city-norm}"
#> [401] "norm_city <- dce %>% "
#> [402] " distinct(city, state_norm, zip_norm) %>% "
#> [403] " mutate("
#> [404] " city_norm = normal_city("
#> [405] " city = city, "
#> [406] " abbs = usps_city,"
#> [407] " states = c(\"DC\", \"DC\", \"DISTRICT OF COLUMBIA\"),"
#> [408] " na = invalid_city,"
#> [409] " na_rep = TRUE"
#> [410] " )"
#> [411] " )"
#> [412] "```"
#> [413] ""
#> [414] "#### Swap"
#> [415] ""
#> [416] "We can further improve normalization by comparing our normalized value"
#> [417] "against the _expected_ value for that record's state abbreviation and ZIP code."
#> [418] "If the normalized value is either an abbreviation for or very similar to the"
#> [419] "expected value, we can confidently swap those two."
#> [420] ""
#> [421] "```{r city-swap}"
#> [422] "norm_city <- norm_city %>% "
#> [423] " rename(city_raw = city) %>% "
#> [424] " left_join("
#> [425] " y = zipcodes,"
#> [426] " by = c("
#> [427] " \"state_norm\" = \"state\","
#> [428] " \"zip_norm\" = \"zip\""
#> [429] " )"
#> [430] " ) %>% "
#> [431] " rename(city_match = city) %>% "
#> [432] " mutate("
#> [433] " match_abb = is_abbrev(city_norm, city_match),"
#> [434] " match_dist = str_dist(city_norm, city_match),"
#> [435] " city_swap = if_else("
#> [436] " condition = !is.na(match_dist) & (match_abb | match_dist == 1),"
#> [437] " true = city_match,"
#> [438] " false = city_norm"
#> [439] " )"
#> [440] " ) %>% "
#> [441] " select("
#> [442] " -city_match,"
#> [443] " -match_dist,"
#> [444] " -match_abb"
#> [445] " )"
#> [446] "```"
#> [447] ""
#> [448] "```{r city-rejoin}"
#> [449] "dce <- left_join("
#> [450] " x = dce,"
#> [451] " y = norm_city,"
#> [452] " by = c("
#> [453] " \"city\" = \"city_raw\", "
#> [454] " \"state_norm\", "
#> [455] " \"zip_norm\""
#> [456] " )"
#> [457] ")"
#> [458] "```"
#> [459] ""
#> [460] "#### Refine"
#> [461] ""
#> [462] "The [OpenRefine][or] algorithms can be used to group similar strings and replace"
#> [463] "the less common versions with their most common counterpart. This can greatly"
#> [464] "reduce inconsistency, but with low confidence; we will only keep any refined"
#> [465] "strings that have a valid city/state/zip combination."
#> [466] ""
#> [467] "[or]: https://openrefine.org/"
#> [468] ""
#> [469] "```{r city-refine}"
#> [470] "good_refine <- dce %>% "
#> [471] " mutate("
#> [472] " city_refine = city_swap %>% "
#> [473] " key_collision_merge() %>% "
#> [474] " n_gram_merge(numgram = 1)"
#> [475] " ) %>% "
#> [476] " filter(city_refine != city_swap) %>% "
#> [477] " inner_join("
#> [478] " y = zipcodes,"
#> [479] " by = c("
#> [480] " \"city_refine\" = \"city\","
#> [481] " \"state_norm\" = \"state\","
#> [482] " \"zip_norm\" = \"zip\""
#> [483] " )"
#> [484] " )"
#> [485] "```"
#> [486] ""
#> [487] "```{r city-count, echo=FALSE}"
#> [488] "good_refine %>%"
#> [489] " count("
#> [490] " state_norm, "
#> [491] " zip_norm, "
#> [492] " city_swap, "
#> [493] " city_refine,"
#> [494] " sort = TRUE"
#> [495] " )"
#> [496] "```"
#> [497] ""
#> [498] "Then we can join the refined values back to the database."
#> [499] ""
#> [500] "```{r city-join}"
#> [501] "dce <- dce %>% "
#> [502] " left_join(good_refine, by = names(.)) %>% "
#> [503] " mutate(city_refine = coalesce(city_refine, city_swap))"
#> [504] "```"
#> [505] ""
#> [506] "#### Progress"
#> [507] ""
#> [508] "Our goal for normalization was to increase the proportion of city values known"
#> [509] "to be valid and reduce the total distinct values by correcting misspellings."
#> [510] ""
#> [511] "```{r city-progress, echo=FALSE}"
#> [512] "many_city <- c(valid_city, extra_city)"
#> [513] "progress <- progress_table("
#> [514] " str_to_upper(dce$city),"
#> [515] " dce$city_norm,"
#> [516] " dce$city_swap,"
#> [517] " dce$city_refine,"
#> [518] " compare = many_city"
#> [519] ") %>% mutate(stage = as_factor(stage))"
#> [520] "progress %>% "
#> [521] " mutate(across(stage, md_code)) %>% "
#> [522] " kable(digits = 3)"
#> [523] "```"
#> [524] ""
#> [525] "You can see how the percentage of valid values increased with each stage."
#> [526] ""
#> [527] "```{r bar-progress, echo=FALSE}"
#> [528] "raw_in <- percent(prop_in(dce$city, valid_city))"
#> [529] "progress %>% "
#> [530] " ggplot(aes(x = stage, y = prop_in)) +"
#> [531] " geom_hline(yintercept = 0.99) +"
#> [532] " geom_col(fill = dark2[\"purple\"]) +"
#> [533] " coord_cartesian(ylim = c(0.75, 1)) +"
#> [534] " scale_y_continuous(labels = percent) +"
#> [535] " labs("
#> [536] " title = \"District Of Columbia City Normalization Progress\","
#> [537] " subtitle = glue(\"Raw at {raw_in} before conversion to uppercase\"),"
#> [538] " x = \"Stage\","
#> [539] " y = \"Percent Valid\""
#> [540] " )"
#> [541] "```"
#> [542] ""
#> [543] "More importantly, the number of distinct values decreased each stage. We were"
#> [544] "able to confidently change many distinct invalid values to their valid"
#> [545] "equivalent."
#> [546] ""
#> [547] "```{r bar-distinct, echo=FALSE}"
#> [548] "progress %>% "
#> [549] " select("
#> [550] " stage, "
#> [551] " all = n_distinct,"
#> [552] " bad = n_diff"
#> [553] " ) %>% "
#> [554] " mutate(good = all - bad) %>% "
#> [555] " pivot_longer(c(\"good\", \"bad\")) %>% "
#> [556] " mutate(name = name == \"good\") %>% "
#> [557] " ggplot(aes(x = stage, y = value)) +"
#> [558] " geom_col(aes(fill = name)) +"
#> [559] " scale_fill_brewer(palette = \"Dark2\", direction = -1) +"
#> [560] " scale_y_continuous(labels = comma) +"
#> [561] " theme(legend.position = \"bottom\") +"
#> [562] " labs("
#> [563] " title = \"District Of Columbia City Normalization Progress\","
#> [564] " subtitle = \"Distinct values, valid and invalid\","
#> [565] " x = \"Stage\","
#> [566] " y = \"Distinct Values\","
#> [567] " fill = \"Valid\""
#> [568] " )"
#> [569] "```"
#> [570] ""
#> [571] "Before exporting, we can remove the intermediary normalization columns and"
#> [572] "rename all added variables with the `_clean` suffix."
#> [573] ""
#> [574] "```{r clean-select}"
#> [575] "dce <- dce %>% "
#> [576] " select("
#> [577] " -city_norm,"
#> [578] " -city_swap,"
#> [579] " city_clean = city_refine"
#> [580] " ) %>% "
#> [581] " rename_all(~str_replace(., \"_norm\", \"_clean\")) %>% "
#> [582] " rename_all(~str_remove(., \"_raw\")) %>% "
#> [583] " relocate(address_clean, city_clean, state_clean, .before = zip_clean)"
#> [584] "```"
#> [585] ""
#> [586] "## Conclude"
#> [587] ""
#> [588] "```{r clean-glimpse}"
#> [589] "glimpse(sample_n(dce, 1000))"
#> [590] "```"
#> [591] ""
#> [592] "1. There are `r comma(nrow(dce))` records in the database."
#> [593] "1. There are `r comma(sum(dce$dupe_flag))` duplicate records in the database."
#> [594] "1. The range and distribution of `amount` and `date` seem reasonable."
#> [595] "1. There are `r comma(sum(dce$na_flag))` records missing key variables."
#> [596] "1. Consistency in geographic data has been improved with `campfin::normal_*()`."
#> [597] "1. The 4-digit `year` variable has been created with `lubridate::year()`."
#> [598] ""
#> [599] "## Export"
#> [600] ""
#> [601] "Now the file can be saved on disk for upload to the Accountability server. We"
#> [602] "will name the object using a date range of the records included."
#> [603] ""
#> [604] "```{r clean-timestamp}"
#> [605] "min_dt <- str_remove_all(min(dce$date), \"-\")"
#> [606] "max_dt <- str_remove_all(max(dce$date), \"-\")"
#> [607] "csv_ts <- paste(min_dt, max_dt, sep = \"-\")"
#> [608] "```"
#> [609] ""
#> [610] "```{r clean-dir}"
#> [611] "clean_dir <- dir_create(here(\"state\", \"dc\", \"expends\", \"data\", \"clean\"))"
#> [612] "clean_csv <- path(clean_dir, glue(\"dc_expends_{csv_ts}.csv\"))"
#> [613] "clean_rds <- path_ext_set(clean_csv, \"rds\")"
#> [614] "basename(clean_csv)"
#> [615] "```"
#> [616] ""
#> [617] "```{r clean-write}"
#> [618] "write_csv(dce, clean_csv, na = \"\")"
#> [619] "write_rds(dce, clean_rds, compress = \"xz\")"
#> [620] "(clean_size <- file_size(clean_csv))"
#> [621] "```"
#> [622] ""
#> [623] "## Upload"
#> [624] ""
#> [625] "We can use the `aws.s3::put_object()` to upload the text file to the IRW server."
#> [626] ""
#> [627] "```{r aws-upload, eval=FALSE}"
#> [628] "aws_key <- path(\"csv\", basename(clean_csv))"
#> [629] "if (!object_exists(aws_key, \"publicaccountability\")) {"
#> [630] " put_object("
#> [631] " file = clean_csv,"
#> [632] " object = aws_key, "
#> [633] " bucket = \"publicaccountability\","
#> [634] " acl = \"public-read\","
#> [635] " show_progress = TRUE,"
#> [636] " multipart = TRUE"
#> [637] " )"
#> [638] "}"
#> [639] "aws_head <- head_object(aws_key, \"publicaccountability\")"
#> [640] "(aws_size <- as_fs_bytes(attr(aws_head, \"content-length\")))"
#> [641] "unname(aws_size == clean_size)"
#> [642] "```"