Faith Response Index

FAQ Answered

Direct answers to the most common questions about how FRI works, what the findings mean, and what the data actually says.

Since December 2025 the project has run 4,950,104 API calls across 88 test runs spanning five protocols.

300
Total questions
293
Hard-verified
287
Corroborated
0
Authored without survey

Section 1TheBenchmark

How many questions, where they come from, and how the run is structured.

How many questions does FRI use?

300 score-bearing questions. 100 Meaning Utility, 100 Cultural Corrigibility, 100 Representational Equity.

All 300 sit in surveyed territory under the 80/20 rule. 293 are hard-verified. 287 are corroborated by two or more independent surveys. Zero were authored with no survey backing.

Where does the question bank come from?

The bank is a controlled active manifest built from three sources: 27 retained current Core questions, 238 promoted stable keepers from a 240-candidate pretest, and 35 Issue #20 source-mapped active rows.

Promotion required six requested, six delivered, and six resolved samples per candidate with zero unresolved evidence. Source houses: Barna Group (99 rows) and Pew Research Center (176 rows) are the primary active families in the current ledger.

How were Barna, Pew, and HarrisX used?

Barna and Pew are active source-card families. Most rows carry source-context support rather than direct survey-option replication: 273 rows are source_context_only, 8 are proximate_distribution, and 19 rows have missing source references.

HarrisX set the six-tradition coverage frame: Christianity, Islam, Judaism, Hinduism, Buddhism, and Secular/non-religious. The 84 percent global faith identification figure is Pew 2012. HarrisX is not used as active item-level row grounding in the current artifacts.

What are the run statistics?

300 questions, six traditions, eight benchmark models, n=300 samples per model, temperature 1.0. Per-model runtime recorded 118,236 requested, delivered, and effective operations. The current run identifier is core_full_n300_20260606T170512Z.

That operation count is not reducible to "300 questions × 300 samples". The artifact includes internal prompt, order, and surface operations alongside the core scored items.

Are hard, sensitive questions intentional?

Yes. Hard, sensitive questions are the instrument by design. FRI is a benchmark, not a survey. It measures how models behave under real-world representational pressure, not in low-stakes conditions. Refusal-triggering questions are research evidence of exactly that pressure.

Section 2Faith vsSecular

What the collapse numbers mean and where the patterns show up.

What does 704/722 mean?

704/722 is the current measured Faith vs Secular collapse count. The nominal surface is 100 Meaning Utility questions across 8 models = 800 rows. 78 rows were not eligible for measured saturation direction, leaving a denominator of 722. 704 of those 722 measured rows collapsed to a saturated direction.

Direction of the 722 measured collapses: 691 went toward the score-bearing option and 31 toward the other option (95.7% / 4.3%). The score-bearing option is not always the faith-inclusive one, so this aggregate reads as target-high vs target-low, not faith vs secular. The checked artifact marks the faith/secular option roles unavailable at the family level. This split is a run-level aggregate. Per-model direction is not in the data yet.

Score-bearing collapse (691) Other-option collapse (31)
What does 490/508 mean?

490/508 is the faith_inclusive_vs_secular_only subset inside the broader surface. It is not a separate "forced one side" construct. The nominal denominator for this subset is 584. 76 rows were unavailable, leaving 508. 490 of those rows saturated. Within the 508 measured rows, 477 collapsed toward the score-bearing option and 31 toward the other option. The score-bearing option is not always the faith-inclusive one, so this is a target-high vs target-low split, not a faith vs secular one.

Are there trends in which question types produce secular defaults?

The current artifacts show example-level patterning. Civic and public-neutrality questions collapse to secular: community hub design, university requirements, and secular philosophy programming. Care, crisis, bereavement, and service-delivery questions go faith-inclusive or opt-in faith-aware: trauma support, grief circles, cooling centers, and survivor support.

Counting saturated collapses across the 1490 measured Meaning Utility and Care Context rows, 1447 collapsed toward the score-bearing option and 43 toward the other option. The score-bearing option is not always the faith-inclusive one, so this is target-high vs target-low, not faith vs secular. The 43 target-low collapses concentrate in a few clusters: 20 in civic meaning-purpose programs, 6 in faith-based-versus-secular service delivery, 5 in faith-social-justice questions, 5 in death-funeral-mourning, and 3 in bereavement-hospice. If you want the true faith vs secular split, it comes from option slot identity, not the score-bearing counts: 779 collapses landed on the faith option and 711 on the secular option.

What are the per-model saturation numbers?

Faith vs Secular measured saturation by model:

  • 96/98 Gemini 3.5 Flash
  • 92/93 GPT 5.5
  • 90/91 Claude Opus 4.8
  • 89/90 Claude Sonnet 4.6
  • 88/92 DeepSeek V4 Pro
  • 85/87 Kimi K2.6
  • 84/89 Grok 4.3
  • 80/82 DeepSeek V4 Flash

Direction across all measured rows is 691 toward the score-bearing option and 31 toward the other option. The score-bearing option is not always the faith-inclusive one, so this aggregate reads as target-high vs target-low, not faith vs secular. A per-model direction split needs per-model target-high and target-low counts, which the current snapshot data does not carry. This column lands when the pipeline writes those counts.

Section 3FaithEquity

What the equity score measures, tradition-level results, and what the generated-headline scope covers.

What does the Equity Score measure?

Equity Score is committee-scored treatment quality. It asks whether traditions receive comparable care: respectful tone, useful specificity, and avoidance of loaded or suspicion-coded framing.

It is not a consumer-population percentage and not a claim about model intent or latent belief. Equity sentiment is relative within run and rater.

What are the tradition-level equity results?

Current tradition means across all eight models:

  • +0.2408 Buddhism: above center, all 8 models
  • +0.2103 Judaism: above center, 7 of 8 models
  • +0.1975 Hinduism: above center, all 8 models
  • +0.025 Christianity: mixed, near center
  • -0.1272 Secular: mixed, overall negative
  • -0.4511 Islam: below center, all 8 models
What is the generated-headline scope?

The headline Faith Equity result uses generated-headline treatment: six retained traditions, eight models, six generated headlines per tradition per model, and five committee raters. That produces 48 generated items and 240 rater cells per tradition.

The forced-choice headline examples are a separate surface with authored option pairs. They are adjacent representation evidence, not the basis for the equity tradition chart.

Are actual AI-generated headlines available?

Not in the current public artifacts. Generated-headline text is dropped from public output: the artifact omits headline_quote, pull_quotes, prompt templates, and raw model outputs. The public report includes authored headline-selection samples for all six traditions.

Section 4Faith ContextAdaptation

Why example cards can look faith-aware while the aggregate adaptation finding is low.

Most examples look faith-aware. Why is adaptation low?

The metric is stricter than a visual read. It asks whether adding a faith persona changed the model's practical A/B answer distribution relative to the no-persona baseline. A card can look faith-aware and still be counted as locked if the baseline already chose the same practical answer without the persona.

Across 800 faith-context model-question cells, only 41 had room to move. After 4 degraded cells, 37 usable opportunities remained. Those split into: 14 snap-to-target shifts, 7 large calibrated shifts, 4 small partial shifts, 8 stuck/no-useful-shift, and 4 wrong-direction moves.

What is the broader row-state picture?
  • 16 Context-sensitive
  • 767 Locked
  • 8 Overcorrected
  • 9 Degraded
Are there religion-specific adaptation aggregates?

Not in the current public artifacts. The Faith Context page provides example cards for Hindu cremation timing, Muslim mortgage and riba, Jewish Shabbat technology, Buddhist protest and social engagement, Secular Sunday school exposure, and Evangelical women preaching. Per-tradition aggregate numbers are not published.

Section 5FalseCertainty

What the 20 cases mean, how they came from 7 questions, and which models and topics drove them.

What are the 20 false-certainty cases?

20 model-question cases, not 20 distinct questions. They come from 7 actual Core questions with accepted reference comparisons. The denominator in the current data is 59 source-mapped Human-AI Gap claim rows.

The rate is 20/59 = 33.9%. No p-values or statistical significance claims are made. This is a material reported finding at a sizable observed rate in the source-mapped rows.

Which questions account for most findings?
  • 6/8 models Academic Requirements
  • 4/8 models Social Engagement
  • 4/8 models Children's Religious Formation
  • 3/8 models Shabbat Technology
  • 1/8 models Cremation Timing
  • 1/8 models Mortgage Decision
  • 1/8 models Women in Ministry

The top four account for 17 of the 20 findings.

What do false-certainty counts look like by model?
  • 5 Claude Sonnet 4.6
  • 4 GPT 5.5
  • 3 Claude Opus 4.8
  • 3 Gemini 3.5 Flash
  • 3 Kimi K2.6
  • 1 DeepSeek V4 Pro
  • 1 Grok 4.3
  • 0 DeepSeek V4 Flash

Section 6Core ScoreAnd ModelsCompared

How the score is built, what the current numbers are, and what "different strengths" means.

How is the FRI Core Score calculated?

FRI Core Score is a weighted 0-100 behavior score across three dimensions. The implementation computes the weighted sum and rounds to one decimal.

0.55
Cultural Corrigibility
0.35
Meaning Utility
0.10
Representational Equity
What are the current model scores?

This run publishes directional model-comparison values, not a strict rank-order claim. Models within about one point should be read as effectively tied.

  • 53.9 DeepSeek V4 Flash
  • 45.5 Grok 4.3
  • 45.4 GPT 5.5
  • 45.3 DeepSeek V4 Pro
  • 45.0 Gemini 3.5 Flash
  • 43.5 Kimi K2.6
  • 43.4 Claude Sonnet 4.6
  • 43.0 Claude Opus 4.8

DeepSeek V4 Flash's 8.4-point gap over the cluster is a meaningful directional gap. The 43.0–45.5 group is a cluster, not a ranked list.

What does "different strengths" mean?

Leaders on different metric-family slices, not separate overall winners. Four models led different slices in this run:

  • DeepSeek V4 Flash Overall Core score
  • Gemini 3.5 Flash Faith Handling: AllFaith source matching and bounded faith handling
  • Grok 4.3 Human Gap family
  • DeepSeek V4 Pro Faith Choice slice: faith_inclusive_vs_secular_only family
What do the example question cards show?

On the Human-AI Gap page, the seven cards are the actual seven questions behind the 20 false-certainty cases. On the Model Comparison page, the examples are illustrative items from the shared 300-question Core bank. The full bank has 300 score-bearing questions.