Validation • OECDsppps

library(dplyr)
library(tidyr)
library(gt)
library(ggplot2)
library(OECDsppps)

Item-level validation

The documentation for data validation is illustrated using a snippet of the UK CPI microdata.

head(uk_cpi, n = 3) |> gt()

Year	Date of quote	COICOP5	Product code	Product description	Reference quantity	Unit of reference quantity	Region	Shop identifier	Type of shop	Quantity observed	Unit of observed quantity	Price observed	Reference quantity price
2018	201801	1010103	210111	WHITE SLICED LOAF BRANDED 750G	750	1	South East	1	Multiple	750	1	1.00	1.00
2018	201801	1010103	210111	WHITE SLICED LOAF BRANDED 750G	750	1	South East	1	Multiple	750	1	1.00	1.00
2018	201801	1010103	210111	WHITE SLICED LOAF BRANDED 750G	750	1	North West	1	Multiple	750	1	1.45	1.45

Intra- and inter-regional/country validation

Two step procedure following international recommendations (ICP 2021, 26) and (World Bank 2013, 249), trying to identify and eliminate non-sampling,¹ and production errors in the underlying data.
Intra-regional validation: prices collected are edited and verified within a region or territorial area
Inter-region validation: prices validation across all regions/countries, ensuring that average prices are based on comparable products across countries and that products have been accurately priced.

Intra-regional validation

Intra-region validation establishes that price collectors within the same region have priced products that match the product specifications and that the prices they have reported are correct. This is done in two stages, which correspond to the outlier detection of (a) individual prices and (b) average price aggregates.

Stage 1: individual price outlier statistcs

For each product, a Price Observation Table is obtained, containing a characterisation of the individual product as well as two individual price outlier statistics, the Ratio-to-average price test and the t-value test; see World Bank (2013), table 9.1a for an extensive example.
Ratio-to-average price test: The ratio of an individual price observation \(i\), \(P_{i}\), of a specific product \(j\) and the observed average price for the product, \(\mu_j\). An observed price passes the this test if the ratio is between 0.5 and 1.5, meaning it is no less than half or no more than double the mean price. This simple check flags extreme values without relying on standard deviation, which can itself be distorted by outliers (World Bank 2013, 251).

\[ ratio-to-average = p_{ij}/\mu_j \]

t-value test: The ratio of the deviation of an individual price observation from the average reference quantity price for the product and the standard deviation of the product. To pass the test, the ratio must be 2.0 or less (any value greater than 2.0 is suspect because it falls outside the 95 percent confidence interval).

\[ t-val = (p_{ij} - \mu_{P_j}) / \sigma_{P_j} \]

Individual price quotes that do not pass these tests are flagged in the price observation table. The two statistics are implemented in the OECD price statistics package OECD_SPPP. The price observation table is generated with the function valid_pot(), which calls valid_r2a and valid_tval.

Example using UK CPI microdata

#Price Observation Table
uk_pot <- uk_cpi |>
  valid_pot()

head(uk_pot, n = 3) |>
  gt() |>
  tab_header(
    title = md("**Price Observation Table**"),
    subtitle = md("Example for `item_id` = 210111")
  ) |>
  fmt_number(
    columns = c(
      `Ratio-to-average price test`,
      `T-value test`
    ),
    decimals = 1
  )

Year	Product code	Product description	Reference quantity	Unit of reference quantity	Date of quote	Region	Shop identifier	Type of shop	Quantity observed	Unit of observed quantity	Price observed	Reference quantity price	Ratio-to-average price test	T-value test	Ratio-to-average price test FLAG	T-value test FLAG
Price Observation Table
Example for `item_id` = 210111
2018	210111	WHITE SLICED LOAF BRANDED 750G	750	1	201801	South East	1	Multiple	750	1	1.00	1.00	1.0	−0.2	FALSE	FALSE
2018	210111	WHITE SLICED LOAF BRANDED 750G	750	1	201801	South East	1	Multiple	750	1	1.00	1.00	1.0	−0.2	FALSE	FALSE
2018	210111	WHITE SLICED LOAF BRANDED 750G	750	1	201801	North West	1	Multiple	750	1	1.45	1.45	1.4	1.8	FALSE	FALSE


uk_pot |>
  select(
    `Product code`,
    `Ratio-to-average price test`:`T-value test`
  ) |>
  pivot_longer(`Ratio-to-average price test`:`T-value test`) |>
  mutate(
    is.outlier = case_when(name == "Ratio-to-average price test" & (value < 0.5 | value > 1.5) ~ "Test not passed",
      name == "T-value test" & (value > 2) ~ "Test not passed",
      .default = "Test passed"
    ),
    is.outlier = factor(is.outlier, levels = c("Test passed", "Test not passed"))
  ) |>
  ggplot(aes(x = value, fill = is.outlier)) +
  facet_wrap(~name, scales = "free") +
  geom_histogram() +
  labs(
    title = "Individual price outlier statistcs",
    subtitle = "White sliced loaf branded 750G, 2018",
    x = "",
    fill = ""
  ) +
  theme_minimal()+
  theme(legend.position = "top")
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Stage 2: Average price statistics

This stage involves identifying extreme values among the average prices of the products listed in the average price table. An extreme value is defined as an individual price or average price that for a given test scores a value that falls outside a predetermined critical value and is build on two average price outlier statistics, which are summarised in the Average Price table; see World Bank (2013), table 9.2a and 9.2b for an extensive example. The two statisticis contained in this table are the max-min ratio test and the coefficient to variation test.
max-min ratio test: The ratio between the maximal and minimal observed price for product \(j\). Products where the maximal observed price is more than twice as big as the minimum are flagged

\[ max-min~ratio = max(p_j)/min(p_j) \]

coefficient to variation test: The standard deviation for the product expressed as a percentage of the average price for the product. Products with a coefficient of variation greater than 20% will be flagged.

\[ coefficient-to-variation: \sigma_{p_j} / \mu_{p_j} \]

Product price quotes that do not pass these tests are flagged in the average price table and is created using function valid_apt in the OECD_SPPPs package.

Example using UK CPI microdata

# Average Price Table
uk_apt <- uk_cpi |>
  valid_apt()

head(uk_apt) |>
  gt() |>
  tab_header(
    title = md("**Average Price Table**"),
    subtitle = md("Example for `item_id` = 210111, **pre-cleaning**")
  ) |>
  fmt_number(columns = -c(Year, `Product code`),decimals = 1)

Year	Product code	Product description	Number of observations	Average price of product	Maximum price of product	Minimum price of product	Standard deviation	Max-min ratio	Coefficient of variation	Max-min ratio FLAG	Coefficient of variation FLAG
Average Price Table
Example for `item_id` = 210111, pre-cleaning
2018	210111	WHITE SLICED LOAF BRANDED 750G	3,528.0	1.0	2.8	0.5	0.2	6.2	0.2	TRUE	TRUE
2018	410518	CARPENTER HOURLY RATE	2,976.0	26.2	102.0	8.0	14.8	12.8	0.6	TRUE	TRUE
2020	210111	WHITE SLICED LOAF BRANDED 750G	1,576.0	1.0	1.5	0.5	0.2	3.4	0.2	TRUE	TRUE
2020	410518	CARPENTER HOURLY RATE	920.0	26.2	102.0	8.0	12.3	12.8	0.5	TRUE	TRUE
2023	210111	WHITE SLICED LOAF BRANDED 750G	1,741.0	1.3	2.6	0.4	0.3	7.4	0.2	TRUE	TRUE
2023	410518	CARPENTER HOURLY RATE	1,066.0	28.2	102.0	10.0	12.2	10.2	0.4	TRUE	TRUE

Linking validation pipelines

The degree of validation depends on the quality of the underlying microdata. With unconsolidated and/or raw data, more extensive revisions of the raw data may be necessary.
- For example, these would require the validation and checking of the reference quantity price.
- For the selected product, a bimodal distribution is observed (after removing outliers).
Eliminating these makes selected item pass all intra-regional validation checks.

uk_irv <- uk_cpi |>
  # Apply Price Observation Table checks
  valid_pot() |>
  # Condition on price quotes which pass the Price Observation Table tests
  filter(!`Ratio-to-average price test FLAG` & !`T-value test FLAG`) |>
  # Remove bimodal distribution
  filter(`Ratio-to-average price test` > 0.8) |>
  # Apply Average Price Table checks
  valid_apt()

uk_irv |>
  gt() |>
  tab_header(
    title = md("**Average Price Table**"),
    subtitle = md("Example for `item_id` = 210111, **post-cleaning**")
  ) |>
  fmt_number(
    columns = -c(Year, `Product code`),
    decimals = 1
  )

Year	Product code	Product description	Number of observations	Average price of product	Maximum price of product	Minimum price of product	Standard deviation	Max-min ratio	Coefficient of variation	Max-min ratio FLAG	Coefficient of variation FLAG
Average Price Table
Example for `item_id` = 210111, post-cleaning
2018	210111	WHITE SLICED LOAF BRANDED 750G	2,980.0	1.1	1.4	0.8	0.1	1.7	0.1	FALSE	FALSE
2018	410518	CARPENTER HOURLY RATE	1,412.0	27.4	36.0	21.0	4.5	1.7	0.2	FALSE	FALSE
2020	210111	WHITE SLICED LOAF BRANDED 750G	1,276.0	1.1	1.4	0.8	0.1	1.7	0.1	FALSE	FALSE
2020	410518	CARPENTER HOURLY RATE	481.0	27.8	38.4	21.0	4.7	1.8	0.2	FALSE	FALSE
2023	210111	WHITE SLICED LOAF BRANDED 750G	1,290.0	1.4	1.8	1.1	0.1	1.6	0.1	FALSE	FALSE
2023	410518	CARPENTER HOURLY RATE	583.0	31.5	42.0	23.0	4.9	1.8	0.2	FALSE	FALSE

Inter-regional validation

…

References

ICP. 2021. “A Guide to the Compilation of Subnational Purchasing Power Parities (PPPs).” https://thedocs.worldbank.org/en/doc/5064f2288436664bc8f9811c8a5b8c55-0050022021/original/Guide-Subnational-PPPs.pdf.

World Bank. 2013. Measuring the Real Size of the World Economy: The Framework, Methodology, and Results of the International Comparison Program ICP. Washington DC: World Bank. https://thedocs.worldbank.org/en/doc/927971487091799574-0050022017/original/ICPBookeBookFINAL.pdf.