Introduction
This vignette provides a complete workflow for downloading Brazil’s quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps:
-
Downloading quarterly PNADC microdata from IBGE
using the
PNADcIBGEpackage - Stacking multiple quarters into a single dataset (critical for high determination rates)
-
Applying mensalization using the
PNADCperiodspackage
If you already have PNADC data and want to learn the package API first, see Get Started. For algorithm details, see How PNADCperiods Works.
Prerequisites
Required Packages
# Install packages if needed
install.packages(c("PNADcIBGE", "fst"))
# Install PNADCperiods from GitHub
# remotes::install_github("antrologos/PNADCperiods")
# Load packages
library(PNADcIBGE)
library(data.table)
library(fst)
library(PNADCperiods)System Requirements
- Disk space: ~5 GB for 2020-2024 data, ~15 GB for full history (2012-present)
- RAM: At least 8 GB recommended; 16 GB for comfortable processing
- Time: 2-3 hours for downloading (depends on internet speed), ~5 minutes for processing
- Internet: Required for downloading data and for SIDRA API access (weight calibration)
Understanding PNADC Data
PNADC is Brazil’s primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations.
Why stack multiple quarters? The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve over 97% determination.
| Quarters Stacked | Month % | Fortnight % | Week % |
|---|---|---|---|
| 1 (single quarter) | ~70% | ~7% | ~2% |
| 8 (2 years) | ~94% | ~9% | ~3% |
| 20 (5 years) | ~95% | ~8% | ~3% |
| 55+ (full history) | ~97% | ~9% | ~3% |
For most applications, we recommend stacking at least 2 years (8 quarters) of data.
Step 1: Set Up Your Environment
# Set your data directory (adjust path as needed)
data_dir <- "path/to/your/pnadc_data/"
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)Step 2: Define Which Quarters to Download
Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate:
# Define quarters to download (2020-2024 example)
editions <- expand.grid(
year = 2020:2024,
quarter = 1:4
)
# If downloading recent years, filter out quarters not yet available:
# editions <- editions[!(editions$year == 2025 & editions$quarter > 3), ]Step 3: Download the Data
The download loop fetches each quarter from IBGE and saves it in FST format for fast loading:
for (i in 1:nrow(editions)) {
year_i <- editions$year[i]
quarter_i <- editions$quarter[i]
filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst")
cat("Downloading:", year_i, "Q", quarter_i, "\n")
# Download from IBGE
pnadc_quarter <- get_pnadc(
year = year_i,
quarter = quarter_i,
labels = FALSE, # IMPORTANT: Use numeric codes, not labels
deflator = FALSE,
design = FALSE,
savedir = data_dir
)
# Save as FST format (fast serialization)
write_fst(pnadc_quarter, file.path(data_dir, filename))
# Clean up temporary files created by PNADcIBGE
temp_files <- list.files(data_dir,
pattern = "\\.(zip|sas|txt)$",
full.names = TRUE)
file.remove(temp_files)
rm(pnadc_quarter)
gc()
}Important: Always use
labels = FALSEwhen downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors.
Step 4: Stack the Quarterly Files
Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization:
# Columns needed for mensalization
cols_needed <- c(
# Time and identifiers
"Ano", "Trimestre", "UPA", "V1008", "V1014",
# Birthday variables (for reference period algorithm)
"V2008", "V20081", "V20082", "V2009",
# Weight and stratification (for weight calibration)
"V1028", "UF", "posest", "posest_sxi"
)
# Stack all quarters
files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE)
pnadc_stacked <- rbindlist(lapply(files, function(f) {
cat("Loading:", basename(f), "\n")
read_fst(f, columns = cols_needed, as.data.table = TRUE)
}))
cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n")Step 5: Apply Mensalization
Build the crosswalk (identify reference periods) and calibrate weights:
# Step 1: Build crosswalk (identify reference periods)
crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE)
# Check determination rates
crosswalk[, .(
month_rate = mean(determined_month),
fortnight_rate = mean(determined_fortnight),
week_rate = mean(determined_week)
)]
# Step 2: Apply crosswalk and calibrate weights
result <- pnadc_apply_periods(
data = pnadc_stacked,
crosswalk = crosswalk,
weight_var = "V1028",
anchor = "quarter",
calibrate = TRUE,
calibration_unit = "month",
verbose = TRUE
)The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination.
Step 6: Explore the Results
The result contains all original columns plus reference period indicators and calibrated weights:
# Key new columns
names(result)[grep("ref_|determined_|weight_", names(result))]
# Distribution of reference months within quarters
result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)]Key output columns:
| Column | Description |
|---|---|
ref_month_in_quarter |
Position within quarter (1, 2, or 3; NA if indeterminate) |
ref_month_yyyymm |
Reference month as YYYYMM integer (e.g., 202301) |
determined_month |
Logical flag (TRUE if month was determined) |
weight_monthly |
Calibrated monthly weight (if calibrate = TRUE) |
The distribution is approximately equal across months 1, 2, and 3
(each around 31-32%), with the remaining observations having
NA for indeterminate cases.
Step 7: Save and Use the Results
Save the mensalized data for future use:
write_fst(result, file.path(data_dir, "pnadc_mensalized.fst"))To compute monthly estimates, filter to determined observations and
aggregate by ref_month_yyyymm:
# Monthly unemployment rate
monthly_unemployment <- result[determined_month == TRUE, .(
unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) /
sum((VD4001 == 1) * weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]
# Monthly population
monthly_pop <- result[determined_month == TRUE, .(
population = sum(weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]For more analysis examples, see Applied Examples.
Memory and Performance Tips
Selective column loading: Only load the columns you need with
read_fst(..., columns = ...). This dramatically reduces memory usage.Process in batches: For very large analyses, process one year at a time and combine results.
Use FST format: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes.
Clean up regularly: Use
rm()andgc()to free memory after processing each quarter.
Extending to Full History
For the best determination rate and longitudinal analysis, download all available quarters:
# Download all available data (2012-present)
editions_full <- expand.grid(
year = 2012:2025,
quarter = 1:4
)
editions_full <- editions_full[!(editions_full$year == 2025 &
editions_full$quarter > 3), ]
# Use the same download and stacking workflow as aboveThe full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month).
Troubleshooting
“Column not found” errors: Ensure you used
labels = FALSEwhen downloading. The algorithm requires numeric codes.Download failures: IBGE servers can be slow or unavailable. The
PNADcIBGEpackage will retry automatically, but you may need to restart interrupted downloads.Memory errors: Try processing fewer quarters at a time, or use a machine with more RAM.
SIDRA API errors: Weight calibration requires internet access to the SIDRA API. If it fails, try again later or use
calibrate = FALSEfor reference period identification without weight calibration.
Next Steps
- Follow the usage patterns in Get Started with your real data
- See analysis examples in Applied Examples
- Learn about the algorithm in How PNADCperiods Works
Working with annual PNADC data? Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See Monthly Poverty Analysis with Annual PNADC Data for details on using
pnadc_apply_periods()withanchor = "year".
References
- HECKSHER, Marcos. “Valor Impreciso por Mes Exato: Microdados e Indicadores Mensais Baseados na Pnad Continua”. IPEA - Nota Tecnica Disoc, n. 62. Brasilia, DF: IPEA, 2020. https://portalantigo.ipea.gov.br/portal/index.php?option=com_content&view=article&id=35453
- HECKSHER, M. “Cinco meses de perdas de empregos e simulacao de um incentivo a contratacoes”. IPEA - Nota Tecnica Disoc, n. 87. Brasilia, DF: IPEA, 2020.
- HECKSHER, Marcos. “Mercado de trabalho: A queda da segunda quinzena de marco, aprofundada em abril”. IPEA - Carta de Conjuntura, v. 47, p. 1-6, 2020.
- Barbosa, Rogerio J; Hecksher, Marcos. (2026). PNADCperiods: Identify Reference Periods in Brazil’s PNADC Survey Data. R package version v0.1.0. https://github.com/antrologos/PNADCperiods