Jonathan Falcon

Cleaning the Iowa Liquor Sales Dataset

This R coding project aimed to clean the Iowa Liquor Sales dataset. Missing data was imputed using appropriate logic and methods, and categories were condensed by merging similar ones for ease of analysis and meaningful interpretation.

Background

This data is taken from the State of Iowa’s publicly available data website. It was and still is collected by the Iowa Department of Commerce, Alcoholic Beverages Division, who publishes the data. The page for the dataset can be accessed here: https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy.

The data I’ll be working with is the following:

Original Column NameNew Column NameDescription
DatedateDate of order
CountycountyCounty where the store who ordered the liquor is located
CitycityCity where the store who ordered the liquor is located
Zip CodezipCodeZip code where the store who ordered the liquor is located
Store NumberstoreUnique number assigned to the store who ordered the liquor
Vendor NumbervendorThe vendor number of the company for the brand of liquor ordered
Category NamecategoryCategory of the liquor ordered
Item DescriptionitemDescription of the individual liquor product ordered
Sales (Dollars)salesTotal cost of liquor order (number of bottles multiplied by the state bottle retail)
State Bottle CostbottleCostThe amount that Alcoholic Beverages Division paid for each bottle of liquor ordered
State Bottle RetailbottleRetailThe amount the store paid for each bottle of liquor ordered

Preliminaries

Loading the Required Libraries

These two packages are required to better work with the data.

library(tidyverse)
library(lubridate)

Reading in and Sampling the Data

Here, we first create a named vector, consisting of the desired names for our columns and the names of the columns in the dataset.

selectedVars <- c(date = "Date",
                  store = "Store Number",
                  city = "City",
                  county = "County",
                  vendor = "Vendor Number",
                  bottleCost = "State Bottle Cost",
                  bottleRetail = "State Bottle Retail",
                  zipCode = "Zip Code",
                  category = "Category Name",
                  item = "Item Description",
                  sales = "Sale (Dollars)")

Before we sample the data, we set the seed to ensure reproducibility, otherwise the results of the analysis would differ each run despite the code being identical. In this instance, I used the due date as the seed.

After setting the seed, we read in the dataset using a readr function and pass it using a pipe %>% to a dplyr function to sample a fraction of the dataset, as working with a 6 gigabyte file is needlessly time-consuming. This function samples 25% of the available data.

set.seed(11282022)

liquorSales <- read_delim("./Data/Iowa_Liquor_Sales_Sample.tsv",
                          col_select = all_of(selectedVars),
                          col_types = cols("Zip Code" = "c",
                                           "Store Number" = "c",
                                           "Vendor Number" = "c")) %>%
  sample_frac(0.25)

Cleaning the Data

Converting date to Date Format

This step is relatively simple, but extremely important if we are to perform any time-series analyses.

liquorSales$date <- mdy(liquorSales$date)

Collapsing the Liquor Categories

There were too many different categories of liquor for any meaningful analyses, so collapsing the categories into twelve categories makes it easier. The exact item is still preserved though.

This first step creates a named vector like earlier. The name is structured as a regular expression—regex for short.

Breaking it down, .* means any character . zero or more times *. The vertical bar | is short for “or” in this case, e.g., tequila OR mezcal. All together, this finds matches for broad categories of alcohol.

patternList <- c(".*vodka.*" = "Vodka",
                 ".*schnapps.*" = "Schnapps",
                 ".*rum.*" = "Rum",
                 ".*tequila|mezcal.*" = "Tequila",
                 ".*gin.*" = "Gin",
                 ".*whisk|scotch|bourbon.*" = "Whiskey/Whisky",
                 ".*brand.*" = "Brandy",
                 ".*liqueur|cordial|creme|triple|anisette|amaretto.*" = "Liqueur/Cordial",
                 ".*spirit.*" = "Spirit",
                 ".*cocktail|rock.*" = "Cocktail",
                 ".*beer.*" = "Beer",
                 ".*special|american alcohol|delisted|holiday|iowa.*" = "Misc")

Here, we use a for loop to iterate over each broad category of alcohol listed above, searching for and substituting regex matches with the corresponding values. For example, categories that have some form of tequila or mezcal in it will be reassigned “Tequila.”

for (n in 1:length(patternList)) {
  liquorSales$category <- liquorSales$category %>%
    gsub(pattern = names(patternList[n]), ignore.case = TRUE, replacement = patternList[[n]])
}

All together, this section aimed to condense the number of alcohol categories to provide a more reasonable amount to analyze.

Manually Assigning Liquor Categories Based on Item

Now, admittedly, this section used pure brute force. Is there a more elegant and efficient solution? Perhaps, but I couldn’t find it in time.

In this first part, we select the category and item columns from liquorSales, keep observations with missing categories, group by item, and count the number of observations by that grouping variable item. Lastly, we sort the summary by the item count descending.

liquorSales %>%
  select(category, item) %>%
  filter(is.na(category)) %>%
  group_by(item) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

Using the above, we come up with these categories and assigned items.

vodka <- "(VODKA|BURNETTS|VAN GOGH|DEEP EDDY|GREY GOOSE|SKYY|PINNACLE|SMIRNOFF|ABSOLUT|SVEDKA|STOLICHNAYA|PEARL|360 |FOREX|BELVEDERE|BURNETT'S|CIROC|KETEL ONE|NEW AMSTERDAM|PRIDE OF THE WAPSI|SOBIESKI|THREE OLIVES|TWENTY GRAND GOLD)"
schnapps <- "(SCHNAPPS|DEKUYPER)"
rum <- "(RUM|BACARDI|CAPTAIN MORGAN|ADMIRAL NELSON|BLUE CHAIR|CRUZAN|APPLETON ESTATE RESERVE|SORGHURM|BRUGAL|CALICHE|CALICO|CHICK'S|DEL MAGUEY|DIPLOMATICO|DON Q LIMON|GOSLING'S|KRAKEN|LAWLESS|LEBLO|PALO VIEJO|RON ABUELO|RONDIAZ)"
tequila <- "(TEQUILA|MEZCAL|REPOSADO|PATRON|EL MAYOR|HORNITOS|AGAVE|PELIGROSO|1800 SILVER|DELEON|CORRALEJO|ANEJO|EXOTICO|LUNAZUL|MONTELOBOS|SAUZA CIEN ANOS BLANCO|TARANTULA)"
gin <- "(GIN|BOMBAY|TANQUERAY|BEEFEATER|FIFTY POUNDS)"
whiskey_whisky <- "(WHISKEY|GLENDRONACH|WHISKY|CROWN ROYAL|RYE|BLACK VELVET|MAKER'S MARK|SCOTCH|BOURBON|KNOB CREEK|WOODFORD RESERVE|GLENFIDDICH|SOUTHERN COMFORT|CRAIGELLACHIE|MALT|FIREBALL|JACK DANIEL'S|JACK DANIELS|ARDBEG|ABERLOUR|BIRD DOG|JIM BEAM|GENTLEMAN JACK|HENNESSY|JAMESON|HA |CODY ROAD|WHITE DOG|WILD TURKEY|CORN|GRAIN|OAT|BASTILLE|BENROMACH|BERENTZEN|CANADIAN HUNTER|CHIVAS REGAL|COLD SPELL MINT|CUTTY|DEANSTON|EAGLE RARE|FORTY CREEK|GLEN MORAY|GLENGOYNE|GLENLIVET|GREAT GLEN|JEFFERSON'S RESERVE|KAVALAN|KOVAL|LAPHROAIG|MAKERS MARK|MCCLELLAND'S|MICHTERS|OLD FORESTER|OLD MEDLEY|RED STAG|REVEL STOKE|RICH & RARE APPLE|RUSSELL'S RESERVE|TIM SMITH'S CLIMAX FIRE|TWO CASKS)"
brandy <- "(BRANDY|E & J|E&J|GRAPPA|PAUL MASSON|COGANC|COGNAC|ARMAGNAC|5 YEAR IRISH SINGLE GRAIN|BARSOL SELECTO ITALIA PISCO|CALVADOS|BUSCA|CHRISTIAN|COURVOISIER|CYNAR|D'USSE|ENCANTO|FIVE STAR PIRATE SWORD|PISCO|REMY MARTIN|TERRY CENTENARIO)"
liqueur_cordial <- "(LIQUER|LIQUEUR|CORDIAL|CREME|TRIPLE|ANISETTE|99 PINEAPPLE MINI|JAGERMEISTER|LIMONCELLO|CREAM|99|VALENTINO|ROTHMAN & WINTER|DR. MCGILLICUDDY'S|GRAND MARNIER|AMARETTO|DISARONNO|ALIZE|ANNIE'S APPLE PIE|AMARO|AVION ESPRESSO|CHERRY HEERING|COCOYAC|DI AMORE|DOMAINE DE CANTON|FRENCH KISS TROPICAL|GALLIANO|GAMLE ODE|KAHLUA|LICOR|MARASKA|MARIE BRIZARD|NORTH SHORE AQUAVIT|NUX ALPINA|PRAIRIE WOLF DARK|RICURA|THE BITTER TRUTH PIMENTO DRAM|TRAVIS HASSE'S)"
spirit <- "(SPIRIT|MOONSHINE|OLE SMOKY|EVERCLEAR|MIDNIGHT MOON|ABSINTHE|YENI RAKI|AALBORG|TY KU)"
cocktail <- "(COCKTAIL|ROCK|TGIFRIDAYS|INDIAN SUMMER|JOSE CUERVO AUTHENTIC LIME MARGARITA POUCH|JOSE CUERVO AUTHENTIC TEAGARITA MARGARITA 4PK CANS|PINA COLADA|SKINNYGIRL)"
beer <- "(BEER)"
misc <- "(SPECIAL|AMERICAN ALCOHOL|DESLISTED|HOLIDAY|IOWA|EGG NOG|SANGRIA|BARREL BUFFALO BILL|LA COLLECTION)"

After, we use another for loop to assign those categories. This time, the loop iterates over every row in liquorSales. To prevent the operation from overwriting whole data or running on observations with a missing item, a conditional is applied that checks if category is NA and if item is not NA. Then, the waterfall of if and else if statements checks if a match is found and assigns a category if there is a match. If no match is found for the first category, it checks the next category and so on.

for (x in 1:nrow(liquorSales)) {
  if (is.na(liquorSales$category[x]) && !is.na(liquorSales$item[x])) {
    if (str_detect(liquorSales$item[x], vodka)) {
      liquorSales$category[x] <- "Vodka"
    } else if (str_detect(liquorSales$item[x], schnapps)) {
      liquorSales$category[x] <- "Schnapps"
    } else if (str_detect(liquorSales$item[x], rum)) {
      liquorSales$category[x] <- "Rum"
    } else if (str_detect(liquorSales$item[x], tequila)) {
      liquorSales$category[x] <- "Tequila"
    } else if (str_detect(liquorSales$item[x], gin)) {
      liquorSales$category[x] <- "Gin"
    } else if (str_detect(liquorSales$item[x], whiskey_whisky)) {
      liquorSales$category[x] <- "Whiskey"
    } else if (str_detect(liquorSales$item[x], brandy)) {
      liquorSales$category[x] <- "Brandy"
    } else if (str_detect(liquorSales$item[x], liqueur_cordial)) {
      liquorSales$category[x] <- "Liqueur/Cordial"
    } else if (str_detect(liquorSales$item[x], spirit)) {
      liquorSales$category[x] <- "Spirit"
    } else if (str_detect(liquorSales$item[x], cocktail)) {
      liquorSales$category[x] <- "Cocktail"
    } else if (str_detect(liquorSales$item[x], beer)) {
      liquorSales$category[x] <- "Beer"
    } else if (str_detect(liquorSales$item[x], misc)) {
      liquorSales$category[x] <- "Misc"
    } 
  }
}

Exploratory Analysis

Histograms

To make it convenient, we first create a custom function that creates a histogram of a chosen variable using the ggplot2 package’s geom_histogram() function.

hist <- function(var = sales, binwidth = NULL, bins = NULL) {
  liquorSales %>%
    ggplot(aes(x = {{var}})) +
    geom_histogram(binwidth = {{binwidth}}, bins = {{bins}})
}

Sales: sales

hist(sales)
Histogram of Sales

Here, we see a positively skewed distribution. 90% of the values for sales are less than or equal to $256.44, while the maximum value is $181,962.00. Taking the natural log of sales should reveal a distribution that appears more normal.

hist(log(sales))
Histogram of Log Sales

As predicted, the histogram of the natural log of sales does show a distribution that appears more normal.

State Bottle Cost: bottleCost

hist(bottleCost)
Histogram of Bottle Cost

Here, we see another positively skewed distribution. 90% of the values for bottleCost are less than or equal to $18.49, while the maximum value is $549.99. Taking the natural log of bottleCost should again reveal a distribution that appears more normal.

hist(log(bottleCost))
Histogram of Log Bottle Cost

As with sales, the histogram of the natural log of bottleCost indeed shows a distribution that appears more normal.

Retail Bottle Cost: bottleRetail

hist(bottleRetail)
Histogram of Bottle Retail

Here, we see another positively skewed distribution. 90% of the values for bottleRetail are less than or equal to $27.74, while the maximum value is $824.99. Taking the natural log of bottleRetail should again reveal a distribution that appears more normal.

hist(log(bottleRetail))
Histogram of Log Bottle Retail

As with the prior two columns, the histogram of the natural log of bottleRetail indeed shows a distribution that appears more normal.

To do this, we first need to know what day of the week each date is. We use the weekdays() function from the lubridate package and create a new column appropriately named weekday.

liquorSales <- liquorSales %>%
  mutate(weekday = weekdays(date))

Next, we use the summarise() function after grouping by category and weekday to count the number of observations by both category and weekday. After counting, we ungroup to ensure no errors in the future. We then group again by weekday and use the slice_max() function to take the top row in each weekday group.

liquorSales %>%
  group_by(weekday, category) %>%
  summarise(category_count = n()) %>%
  ungroup() %>%
  group_by(weekday) %>%
  slice_max(order_by = category_count)