Skip to contents

Item-level validation

The documentation for data validation is illustrated using a snippet of the UK CPI microdata.

head(uk_cpi, n = 3) |> gt()
Year Date of quote COICOP5 Product code Product description Reference quantity Unit of reference quantity Region Shop identifier Type of shop Quantity observed Unit of observed quantity Price observed Reference quantity price
2018 201801 1010103 210111 WHITE SLICED LOAF BRANDED 750G 750 1 South East 1 Multiple 750 1 1.00 1.00
2018 201801 1010103 210111 WHITE SLICED LOAF BRANDED 750G 750 1 South East 1 Multiple 750 1 1.00 1.00
2018 201801 1010103 210111 WHITE SLICED LOAF BRANDED 750G 750 1 North West 1 Multiple 750 1 1.45 1.45

Intra- and inter-regional/country validation

  • Two step procedure following international recommendations (ICP 2021, 26) and (World Bank 2013, 249), trying to identify and eliminate non-sampling,1 and production errors in the underlying data.
  • Intra-regional validation: prices collected are edited and verified within a region or territorial area
  • Inter-region validation: prices validation across all regions/countries, ensuring that average prices are based on comparable products across countries and that products have been accurately priced.

Intra-regional validation

  • Intra-region validation establishes that price collectors within the same region have priced products that match the product specifications and that the prices they have reported are correct. This is done in two stages, which correspond to the outlier detection of (a) individual prices and (b) average price aggregates.
Stage 1: individual price outlier statistcs
  • For each product, a Price Observation Table is obtained, containing a characterisation of the individual product as well as two individual price outlier statistics, the Ratio-to-average price test and the t-value test; see World Bank (2013), table 9.1a for an extensive example.

  • Ratio-to-average price test: The ratio of an individual price observation \(i\), \(P_{i}\), of a specific product \(j\) and the observed average price for the product, \(\mu_j\). An observed price passes the this test if the ratio is between 0.5 and 1.5, meaning it is no less than half or no more than double the mean price. This simple check flags extreme values without relying on standard deviation, which can itself be distorted by outliers (World Bank 2013, 251).

\[ ratio-to-average = p_{ij}/\mu_j \]

  • t-value test: The ratio of the deviation of an individual price observation from the average reference quantity price for the product and the standard deviation of the product. To pass the test, the ratio must be 2.0 or less (any value greater than 2.0 is suspect because it falls outside the 95 percent confidence interval).

\[ t-val = (p_{ij} - \mu_{P_j}) / \sigma_{P_j} \]

  • Individual price quotes that do not pass these tests are flagged in the price observation table. The two statistics are implemented in the OECD price statistics package OECD_SPPP. The price observation table is generated with the function valid_pot(), which calls valid_r2a and valid_tval.

Example using UK CPI microdata

#Price Observation Table
uk_pot <- uk_cpi |>
  valid_pot()

head(uk_pot, n = 3) |>
  gt() |>
  tab_header(
    title = md("**Price Observation Table**"),
    subtitle = md("Example for `item_id` = 210111")
  ) |>
  fmt_number(
    columns = c(
      `Ratio-to-average price test`,
      `T-value test`
    ),
    decimals = 1
  )
Price Observation Table
Example for item_id = 210111
Year Product code Product description Reference quantity Unit of reference quantity Date of quote Region Shop identifier Type of shop Quantity observed Unit of observed quantity Price observed Reference quantity price Ratio-to-average price test T-value test Ratio-to-average price test FLAG T-value test FLAG
2018 210111 WHITE SLICED LOAF BRANDED 750G 750 1 201801 South East 1 Multiple 750 1 1.00 1.00 1.0 −0.2 FALSE FALSE
2018 210111 WHITE SLICED LOAF BRANDED 750G 750 1 201801 South East 1 Multiple 750 1 1.00 1.00 1.0 −0.2 FALSE FALSE
2018 210111 WHITE SLICED LOAF BRANDED 750G 750 1 201801 North West 1 Multiple 750 1 1.45 1.45 1.4 1.8 FALSE FALSE

uk_pot |>
  select(
    `Product code`,
    `Ratio-to-average price test`:`T-value test`
  ) |>
  pivot_longer(`Ratio-to-average price test`:`T-value test`) |>
  mutate(
    is.outlier = case_when(name == "Ratio-to-average price test" & (value < 0.5 | value > 1.5) ~ "Test not passed",
      name == "T-value test" & (value > 2) ~ "Test not passed",
      .default = "Test passed"
    ),
    is.outlier = factor(is.outlier, levels = c("Test passed", "Test not passed"))
  ) |>
  ggplot(aes(x = value, fill = is.outlier)) +
  facet_wrap(~name, scales = "free") +
  geom_histogram() +
  labs(
    title = "Individual price outlier statistcs",
    subtitle = "White sliced loaf branded 750G, 2018",
    x = "",
    fill = ""
  ) +
  theme_minimal()+
  theme(legend.position = "top")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Stage 2: Average price statistics
  • This stage involves identifying extreme values among the average prices of the products listed in the average price table. An extreme value is defined as an individual price or average price that for a given test scores a value that falls outside a predetermined critical value and is build on two average price outlier statistics, which are summarised in the Average Price table; see World Bank (2013), table 9.2a and 9.2b for an extensive example. The two statisticis contained in this table are the max-min ratio test and the coefficient to variation test.

  • max-min ratio test: The ratio between the maximal and minimal observed price for product \(j\). Products where the maximal observed price is more than twice as big as the minimum are flagged

\[ max-min~ratio = max(p_j)/min(p_j) \]

  • coefficient to variation test: The standard deviation for the product expressed as a percentage of the average price for the product. Products with a coefficient of variation greater than 20% will be flagged.

\[ coefficient-to-variation: \sigma_{p_j} / \mu_{p_j} \]

  • Product price quotes that do not pass these tests are flagged in the average price table and is created using function valid_apt in the OECD_SPPPs package.

Example using UK CPI microdata

# Average Price Table
uk_apt <- uk_cpi |>
  valid_apt()

head(uk_apt) |>
  gt() |>
  tab_header(
    title = md("**Average Price Table**"),
    subtitle = md("Example for `item_id` = 210111, **pre-cleaning**")
  ) |>
  fmt_number(columns = -c(Year, `Product code`),decimals = 1)
Average Price Table
Example for item_id = 210111, pre-cleaning
Year Product code Product description Number of observations Average price of product Maximum price of product Minimum price of product Standard deviation Max-min ratio Coefficient of variation Max-min ratio FLAG Coefficient of variation FLAG
2018 210111 WHITE SLICED LOAF BRANDED 750G 3,528.0 1.0 2.8 0.5 0.2 6.2 0.2 TRUE TRUE
2018 410518 CARPENTER HOURLY RATE 2,976.0 26.2 102.0 8.0 14.8 12.8 0.6 TRUE TRUE
2020 210111 WHITE SLICED LOAF BRANDED 750G 1,576.0 1.0 1.5 0.5 0.2 3.4 0.2 TRUE TRUE
2020 410518 CARPENTER HOURLY RATE 920.0 26.2 102.0 8.0 12.3 12.8 0.5 TRUE TRUE
2023 210111 WHITE SLICED LOAF BRANDED 750G 1,741.0 1.3 2.6 0.4 0.3 7.4 0.2 TRUE TRUE
2023 410518 CARPENTER HOURLY RATE 1,066.0 28.2 102.0 10.0 12.2 10.2 0.4 TRUE TRUE
Linking validation pipelines
  • The degree of validation depends on the quality of the underlying microdata. With unconsolidated and/or raw data, more extensive revisions of the raw data may be necessary.
    • For example, these would require the validation and checking of the reference quantity price.
    • For the selected product, a bimodal distribution is observed (after removing outliers).
  • Eliminating these makes selected item pass all intra-regional validation checks.
uk_irv <- uk_cpi |>
  # Apply Price Observation Table checks
  valid_pot() |>
  # Condition on price quotes which pass the Price Observation Table tests
  filter(!`Ratio-to-average price test FLAG` & !`T-value test FLAG`) |>
  # Remove bimodal distribution
  filter(`Ratio-to-average price test` > 0.8) |>
  # Apply Average Price Table checks
  valid_apt()

uk_irv |>
  gt() |>
  tab_header(
    title = md("**Average Price Table**"),
    subtitle = md("Example for `item_id` = 210111, **post-cleaning**")
  ) |>
  fmt_number(
    columns = -c(Year, `Product code`),
    decimals = 1
  )
Average Price Table
Example for item_id = 210111, post-cleaning
Year Product code Product description Number of observations Average price of product Maximum price of product Minimum price of product Standard deviation Max-min ratio Coefficient of variation Max-min ratio FLAG Coefficient of variation FLAG
2018 210111 WHITE SLICED LOAF BRANDED 750G 2,980.0 1.1 1.4 0.8 0.1 1.7 0.1 FALSE FALSE
2018 410518 CARPENTER HOURLY RATE 1,412.0 27.4 36.0 21.0 4.5 1.7 0.2 FALSE FALSE
2020 210111 WHITE SLICED LOAF BRANDED 750G 1,276.0 1.1 1.4 0.8 0.1 1.7 0.1 FALSE FALSE
2020 410518 CARPENTER HOURLY RATE 481.0 27.8 38.4 21.0 4.7 1.8 0.2 FALSE FALSE
2023 210111 WHITE SLICED LOAF BRANDED 750G 1,290.0 1.4 1.8 1.1 0.1 1.6 0.1 FALSE FALSE
2023 410518 CARPENTER HOURLY RATE 583.0 31.5 42.0 23.0 4.9 1.8 0.2 FALSE FALSE

Inter-regional validation

References

ICP. 2021. “A Guide to the Compilation of Subnational Purchasing Power Parities (PPPs).” https://thedocs.worldbank.org/en/doc/5064f2288436664bc8f9811c8a5b8c55-0050022021/original/Guide-Subnational-PPPs.pdf.
World Bank. 2013. Measuring the Real Size of the World Economy: The Framework, Methodology, and Results of the International Comparison Program ICP. Washington DC: World Bank. https://thedocs.worldbank.org/en/doc/927971487091799574-0050022017/original/ICPBookeBookFINAL.pdf.