Citation Fidelity in AI Research Synthesis

Cross-Domain Benchmark: 180 Queries × 9 Systems × 6 Domains

180 queries | 9 systems (3 Knitify + 4 Gemini + 2 Gemini with Google Search) | 6 domains | March 2026
Prepared by Innovo Health Labs

Abstract

We evaluated Knitify against six configurations of Google's Gemini — including models with Google Search grounding — across 180 research queries in six domains (medical, scientific, veterinary, supplement, beauty, wellness). Every citation was verified programmatically against PubMed and CrossRef.

Metric	Knitify	Gemini 3.0 Pro	Gemini Pro + Search
Citation fidelity (CF)	97%	71%	92%
Verified refs / response	17.4	5.3	4.4
Citations from last 2 years	61%	0%	6%
Response time	38s	52s	65s

Cross-domain averages. Knitify = Premium tier. All systems received identical queries.

1. Citation Fidelity Across 6 Domains

Citation fidelity measures whether each cited reference resolves to a real, on-topic paper. Verified programmatically via PubMed and CrossRef, with an independent AI verifier confirming topic match.

Figure 1: Citation Fidelity by Domain — All 9 Systems

Knitify

Medical

100%

Health Basic

99%

Scientific

98%

Veterinary

98%

Supplement

97%

Beauty

94%

Gemini 3.0 Pro

Supplement

83%

Health Basic

80%

Medical

72%

Beauty

71%

Scientific

66%

Veterinary

52%

Gemini 3.0 Pro + Google Search

Supplement

95%

Beauty

93%

Medical

92%

Health Basic

92%

Scientific

89%

Veterinary

89%

Figure 1. Knitify Premium (top), Gemini 3.0 Pro without grounding (middle), Gemini 3.0 Pro with Google Search (bottom).

Full Citation Fidelity Table

System	Medical	Scientific	Veterinary	Supplement	Beauty	Health	Average
Knitify Premium	100%	98%	98%	97%	94%	99%	98%
Knitify Standard	100%	97%	98%	94%	96%	98%	97%
Knitify Fast	99%	96%	100%	90%	98%	97%	97%
Gemini 3.0 Pro + Search	92%	89%	89%	95%	93%	92%	92%
Gemini 3.0 Pro	72%	66%	52%	83%	71%	80%	71%
Gemini 3.0 Flash + Search	63%	50%	47%	66%	54%	71%	59%
Gemini 3.0 Flash	53%	40%	26%	51%	47%	72%	48%
Gemini 2.5 Pro	51%	42%	24%	54%	48%	67%	48%
Gemini 2.5 Flash	28%	20%	6%	25%	13%	26%	20%

Finding 1: Knitify's citation fidelity is domain-agnostic

Knitify Premium maintains 94-100% CF across all 6 domains — a 6-point range. Gemini 3.0 Pro varies from 52% to 83% — a 31-point range. With Google Search, Gemini 3.0 Pro narrows to 89-95% (6-point range) but at the cost of 61-70s response time and only 4-6 references.

2. Reference Density

Average number of unique peer-reviewed sources cited per response. Stacked bars show verified references (color) and fabricated references (grey).

Figure 2: Verified References Per Response (cross-domain average)

Knitify Premium

17.4 verified

Knitify Standard

10.7 verified

Knitify Fast

7.6 verified

Gemini 3.0 Pro

5.3

2.2 ✗

Gemini 3.0 Flash

3.7

3.9 ✗

Pro + Search

4.4

0.4 ✗

Flash + Search

4.6

3.3 ✗

Figure 2. Color = verified references. Grey (✗) = fabricated. Knitify references are all verified against PubMed.

System	Medical	Scientific	Veterinary	Supplement	Beauty	Health	Average
Knitify Premium	19.8	18.3	16.5	15.4	17.5	19.1	17.8
Knitify Standard	11.7	11.9	10.6	10.5	10.6	11.2	11.1
Knitify Fast	11.1	8.9	8.8	6.0	6.0	6.5	7.9
Gemini 3.0 Pro	7.7	7.7	5.8	7.8	7.4	8.7	7.5
Gemini 3.0 Flash	7.0	7.3	7.3	8.6	7.9	8.1	7.7
Pro + Search	4.9	4.0	4.7	4.8	4.7	5.6	4.8
Flash + Search	7.8	9.1	6.9	7.9	7.9	8.0	7.9
Gemini 2.5 Flash	5.4	10.5	10.8	12.8	13.5	13.3	11.1

References per response. Gemini 2.5 Flash generates high volumes (11-14) but 80% are fabricated. Pro+Search produces only 4.8 refs/response.

Finding 2: Knitify cites more verified references per response

Knitify Premium averages 17.8 verified references across all 6 domains. Gemini 3.0 Pro averages 7.5 references with 71% fidelity — approximately 5.3 verified per response (2.2 fabricated). Pro+Search improves fidelity to 92% but reduces output to only 4.8 references — of which 4.4 are verified. Knitify Premium delivers 4× more verified references than Pro+Search.

3. Reference Recency

Percentage of citations from 2025-2026. Gemini without grounding generates from model weights only. With Google Search, Gemini gains some access to recent papers.

System	Medical	Scientific	Veterinary	Supplement	Beauty	Health	Average
Knitify Premium	70%	56%	21%	62%	65%	89%	61%
Knitify Standard	39%	43%	33%	65%	70%	79%	55%
Knitify Fast	43%	56%	31%	58%	56%	72%	53%
3.0 Flash + Search	19%	3%	2%	5%	2%	6%	6%
3.0 Pro + Search	11%	2%	3%	5%	6%	8%	6%
Gemini 3.0 Pro	0%	0%	0%	0%	0%	0%	0%
Gemini 3.0 Flash	0%	0%	0%	0%	0%	0%	0%

% of citations from 2025-2026. Gemini without grounding: 0%. With Google Search: 2-19% depending on domain. Knitify: 21-89%.

Finding 3: Knitify citations are substantially more recent

53-61% of Knitify citations are from 2025-2026. Gemini without grounding cites 0% from these years. With Google Search enabled, Gemini gains some recency (average 6%) but remains far behind Knitify. The gap is largest in medical (70% vs 11%) and health basic (89% vs 8%). Even on veterinary — where Knitify's recency is lowest (21%) — it leads Gemini+Search by 7×.

4. Speed

System	Medical	Scientific	Veterinary	Supplement	Beauty	Health	Average
Knitify Fast	21.2s	13.0s	17.8s	10.3s	10.0s	10.2s	13.8s
Gemini 3.0 Flash*	14.1s	13.8s	13.5s	12.3s	12.5s	11.9s	13.0s
Knitify Standard	25.4s	12.8s	18.3s	17.7s	20.5s	21.1s	19.3s
3.0 Flash + Search*	25.2s	24.5s	23.4s	26.2s	21.5s	24.4s	24.2s
Knitify Premium	53.6s	28.9s	28.7s	25.9s	46.4s	44.4s	38.0s
Gemini 3.0 Pro*	55.7s	55.7s	55.7s	46.5s	49.4s	45.8s	51.5s
3.0 Pro + Search*	68.1s	69.6s	68.1s	61.9s	63.0s	60.9s	65.3s

*Gemini TTFT = total response time (non-streaming REST API). Knitify TTFT = true time-to-first-token via streaming. Search grounding adds 10-15s to Gemini response time.

Finding 4: Knitify Fast is competitive on speed; grounding adds latency

Knitify Fast averages 13.8s — within 1 second of Gemini 3.0 Flash (13.0s). Google Search grounding adds 10-15s to Gemini: Flash+Search averages 24.2s, Pro+Search averages 65.3s. Knitify Premium (38.0s) is faster than both Gemini 3.0 Pro (51.5s) and Pro+Search (65.3s) while delivering more references at higher fidelity.

5. Where Gemini Performs Best — and Worst

Gemini's Best Domains

Gemini 3.0 Pro achieves its highest CF on supplement (83%) and health basic (80%). These are mainstream wellness topics (magnesium, vitamin D, omega-3) with dense coverage in training data. With Google Search, supplement reaches 95% — the closest any Gemini configuration comes to Knitify.

Gemini's Worst Domains

Gemini 2.5 Flash drops to 6% CF on veterinary and 13% on beauty. Even Gemini 3.0 Pro with Google Search reaches only 89% on veterinary and scientific — domains where specialized literature requires targeted retrieval beyond general web search.

The Generational Improvement

Domain	2.5 Flash	3.0 Flash	3.0 Pro	3.0 Pro + Search	Knitify Premium
Medical	28%	53%	72%	92%	100%
Scientific	20%	40%	66%	89%	98%
Veterinary	6%	26%	52%	89%	98%
Supplement	25%	51%	83%	95%	97%
Beauty	13%	47%	71%	93%	94%
Health Basic	26%	72%	80%	92%	99%

Progression from 2.5 Flash → 3.0 Flash → 3.0 Pro → 3.0 Pro + Search → Knitify Premium. Each step improves CF, but specialized domains (veterinary, scientific) remain hardest.

Finding 5: Model scaling and search grounding help but cannot close the gap in specialized domains

Gemini improved significantly from 2.5 to 3.0 and further with Google Search. On supplement queries, Pro+Search reaches 95% — within 2 points of Knitify Premium. But on veterinary and scientific topics — where specialized literature is sparse in general web results — Pro+Search still trails Knitify by 9 percentage points while delivering only 4-5 references at 65-70s response time. Specialized domains require specialized retrieval.

6. Citation Fidelity by Query Difficulty

Across all 6 domains (180 queries), CF broken down by the 4 difficulty tiers.

System	Common (48q)	Complex (48q)	Niche (42q)	Emerging (42q)	Overall
Knitify Premium	98%	98%	98%	97%	98%
Knitify Standard	97%	98%	97%	96%	97%
Knitify Fast	96%	96%	97%	96%	97%
Gemini 3.0 Pro	68%	68%	68%	64%	71%
Gemini 3.0 Flash	49%	40%	41%	42%	48%
Gemini 2.5 Pro	41%	33%	39%	32%	48%
Gemini 2.5 Flash	19%	15%	16%	18%	20%

Finding 6: Knitify is stable across all difficulty levels

Knitify maintains 96-98% CF from common to emerging queries — a 2-point range. Gemini 3.0 Pro shows more variance: 68% on common, 64% on emerging. The difficulty effect is secondary to the domain effect, but both hurt Gemini while leaving Knitify unaffected.

Emerging queries are where citations matter most. Pre-guideline topics (senolytics, circadian disruption, CAR-M therapy) have the sparsest literature. Gemini 3.0 Pro drops to 64% on these queries — fabricating 1 in 3 citations precisely when clinicians need reliable evidence most. Knitify maintains 96-97%.

7. Head-to-Head Comparisons

Direct matchups between same-class models. Cross-domain averages across all 180 queries.

Without Google Search Grounding

Metric	Knitify Fast	Gemini 3.0 Flash	Knitify Premium	Gemini 3.0 Pro
Citation Fidelity	97%	48%	98%	71%
Verified refs / response	7.6	3.7 (+3.9 fake)	17.4	5.3 (+2.2 fake)
Recent citations	53%	0%	61%	0%
Speed	13.8s	13.0s	38.0s	51.5s

With Google Search Grounding

Metric	Knitify Fast	Flash + Search	Knitify Premium	Pro + Search
Citation Fidelity	97%	59%	98%	92%
Verified refs / response	7.6	4.6 (+3.3 fake)	17.4	4.4 (+0.4 fake)
Recent citations	53%	6%	61%	6%
Speed	13.8s	24.2s	38.0s	65.3s

Cross-domain averages. "Recent citations" = % from last 2 years. "Fake" = fabricated references that don't resolve to real papers.

8. The Case for Specialized Models

Observation: Niche domains and recency-critical areas need specialized retrieval

Google Search grounding helps Gemini substantially on mainstream topics — supplement queries reach 95% CF. But three patterns emerge that general-purpose search cannot solve:

1. Specialized domains. Veterinary and scientific pharmacology are the hardest domains for Gemini, even with search. Pro+Search reaches 89% on both — still fabricating 1 in 9 citations. These domains require retrieval from PubMed's specialized indices, not general web results.

2. Reference depth. Pro+Search averages only 4.8 references per response — compared to 17.8 for Knitify Premium. Search grounding sacrifices breadth for accuracy. For literature reviews and comprehensive summaries, 4-5 references is insufficient.

3. Recency. Even with Google Search, Gemini's recency averages 6% from 2025-2026. Knitify averages 53-61%. For rapidly evolving fields — emerging therapies, new guidelines, recent trials — current evidence is not a bonus but a requirement.

The fundamental trade-off: Google Search grounding can push Gemini's CF from 71% to 92%, but at the cost of fewer references (4.8 vs 7.5), slower response (65s vs 52s), and still limited recency (6% vs 61%). Knitify achieves 98% CF with 17.8 verified references, 61% recency, and 38s response time — without these trade-offs.

9. Conclusions

1. Knitify delivers verified citations across all domains and difficulty levels

97-98% CF across 6 domains and 4 difficulty tiers. Gemini 3.0 Pro achieves 53-83% without grounding, 89-95% with Google Search.

2. Google Search grounding improves Gemini but introduces trade-offs

Pro+Search improves average CF from 71% to 92% but reduces references from 7.5 to 4.8, adds 14s latency (65s vs 52s), and still trails Knitify on recency (6% vs 61%).

3. Specialized domains require specialized retrieval

Veterinary and scientific pharmacology are the hardest for all Gemini configurations. Even Pro+Search reaches only 89% on these domains. General web search does not cover the depth of specialized biomedical literature that PubMed-based retrieval provides.

4. Knitify delivers 4× more verified references than Pro+Search

Knitify Premium averages 17.4 verified references per response. Pro+Search averages 4.4. For comprehensive literature reviews, clinical decision support, and research synthesis, reference depth matters.

5. Current evidence requires targeted retrieval

53-61% of Knitify citations are from 2025-2026. Gemini without grounding: 0%. With Google Search: 6%. For emerging therapies, new guidelines, and rapidly evolving fields, recency is not optional.

Appendix: Methodology

Scope: 180 research queries across 6 domains (30 each), 4 difficulty tiers per domain (common, complex, niche, emerging), 9 systems (3 Knitify tiers + 4 Gemini configurations + 2 Gemini with Google Search grounding).

Judge: Clinical safety and content quality scored by Gemini 3.0 Pro at temperature 0, batch evaluation. Citation fidelity verified programmatically — independent of the judge.

Citation verification: Each Gemini reference is resolved via CrossRef (DOIs) and PubMed (PMIDs). An independent AI verifier confirms whether the resolved paper matches what was claimed. Knitify citations are verified by the model's built-in quality assurance layer.

Gemini API: All Gemini models called via the Gemini API (generativelanguage.googleapis.com). Non-grounded models generate entirely from model weights. Grounded models use tools=[{"google_search": {}}] to enable Google Search.

Full test queries: See individual domain tech specs for complete query lists.

About Knitify
Knitify is the most accurate AI platform for healthcare research synthesis, achieving 97% average citation fidelity across 6 domains and 180 queries — compared to 71% for Gemini 3.0 Pro. Each domain has a purpose-built model: medical prescriber, scientific pharmacology, veterinary medicine, supplement health, beauty/dermatology, and general wellness. Every citation is verified against PubMed. Available via API at knitify.innovohealthlabs.com.