Citation Fidelity in AI Research Synthesis
Cross-Domain Benchmark: 180 Queries × 9 Systems × 6 Domains
180 queries | 9 systems (3 Knitify + 4 Gemini + 2 Gemini with Google Search) | 6 domains | March 2026
Prepared by Innovo Health Labs
Abstract
We evaluated Knitify against six configurations of Google's Gemini — including models with Google Search grounding — across 180 research queries in six domains (medical, scientific, veterinary, supplement, beauty, wellness). Every citation was verified programmatically against PubMed and CrossRef.
| Metric | Knitify | Gemini 3.0 Pro | Gemini Pro + Search |
| Citation fidelity (CF) | 97% | 71% | 92% |
| Verified refs / response | 17.4 | 5.3 | 4.4 |
| Citations from last 2 years | 61% | 0% | 6% |
| Response time | 38s | 52s | 65s |
Cross-domain averages. Knitify = Premium tier. All systems received identical queries.
1. Citation Fidelity Across 6 Domains
Citation fidelity measures whether each cited reference resolves to a real, on-topic paper. Verified programmatically via PubMed and CrossRef, with an independent AI verifier confirming topic match.
Figure 1: Citation Fidelity by Domain — All 9 Systems
Knitify
Gemini 3.0 Pro
Gemini 3.0 Pro + Google Search
Figure 1. Knitify Premium (top), Gemini 3.0 Pro without grounding (middle), Gemini 3.0 Pro with Google Search (bottom).
Full Citation Fidelity Table
| System | Medical | Scientific | Veterinary | Supplement | Beauty | Health | Average |
| Knitify Premium | 100% | 98% | 98% | 97% | 94% | 99% | 98% |
| Knitify Standard | 100% | 97% | 98% | 94% | 96% | 98% | 97% |
| Knitify Fast | 99% | 96% | 100% | 90% | 98% | 97% | 97% |
| Gemini 3.0 Pro + Search | 92% | 89% | 89% | 95% | 93% | 92% | 92% |
| Gemini 3.0 Pro | 72% | 66% | 52% | 83% | 71% | 80% | 71% |
| Gemini 3.0 Flash + Search | 63% | 50% | 47% | 66% | 54% | 71% | 59% |
| Gemini 3.0 Flash | 53% | 40% | 26% | 51% | 47% | 72% | 48% |
| Gemini 2.5 Pro | 51% | 42% | 24% | 54% | 48% | 67% | 48% |
| Gemini 2.5 Flash | 28% | 20% | 6% | 25% | 13% | 26% | 20% |
Finding 1: Knitify's citation fidelity is domain-agnostic
Knitify Premium maintains 94-100% CF across all 6 domains — a 6-point range. Gemini 3.0 Pro varies from 52% to 83% — a 31-point range. With Google Search, Gemini 3.0 Pro narrows to 89-95% (6-point range) but at the cost of 61-70s response time and only 4-6 references.
2. Reference Density
Average number of unique peer-reviewed sources cited per response. Stacked bars show verified references (color) and fabricated references (grey).
Figure 2: Verified References Per Response (cross-domain average)
Figure 2. Color = verified references. Grey (✗) = fabricated. Knitify references are all verified against PubMed.
| System | Medical | Scientific | Veterinary | Supplement | Beauty | Health | Average |
| Knitify Premium | 19.8 | 18.3 | 16.5 | 15.4 | 17.5 | 19.1 | 17.8 |
| Knitify Standard | 11.7 | 11.9 | 10.6 | 10.5 | 10.6 | 11.2 | 11.1 |
| Knitify Fast | 11.1 | 8.9 | 8.8 | 6.0 | 6.0 | 6.5 | 7.9 |
| Gemini 3.0 Pro | 7.7 | 7.7 | 5.8 | 7.8 | 7.4 | 8.7 | 7.5 |
| Gemini 3.0 Flash | 7.0 | 7.3 | 7.3 | 8.6 | 7.9 | 8.1 | 7.7 |
| Pro + Search | 4.9 | 4.0 | 4.7 | 4.8 | 4.7 | 5.6 | 4.8 |
| Flash + Search | 7.8 | 9.1 | 6.9 | 7.9 | 7.9 | 8.0 | 7.9 |
| Gemini 2.5 Flash | 5.4 | 10.5 | 10.8 | 12.8 | 13.5 | 13.3 | 11.1 |
References per response. Gemini 2.5 Flash generates high volumes (11-14) but 80% are fabricated. Pro+Search produces only 4.8 refs/response.
Finding 2: Knitify cites more verified references per response
Knitify Premium averages 17.8 verified references across all 6 domains. Gemini 3.0 Pro averages 7.5 references with 71% fidelity — approximately 5.3 verified per response (2.2 fabricated). Pro+Search improves fidelity to 92% but reduces output to only 4.8 references — of which 4.4 are verified.
Knitify Premium delivers 4× more verified references than Pro+Search.
3. Reference Recency
Percentage of citations from 2025-2026. Gemini without grounding generates from model weights only. With Google Search, Gemini gains some access to recent papers.
| System | Medical | Scientific | Veterinary | Supplement | Beauty | Health | Average |
| Knitify Premium | 70% | 56% | 21% | 62% | 65% | 89% | 61% |
| Knitify Standard | 39% | 43% | 33% | 65% | 70% | 79% | 55% |
| Knitify Fast | 43% | 56% | 31% | 58% | 56% | 72% | 53% |
| 3.0 Flash + Search | 19% | 3% | 2% | 5% | 2% | 6% | 6% |
| 3.0 Pro + Search | 11% | 2% | 3% | 5% | 6% | 8% | 6% |
| Gemini 3.0 Pro | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| Gemini 3.0 Flash | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
% of citations from 2025-2026. Gemini without grounding: 0%. With Google Search: 2-19% depending on domain. Knitify: 21-89%.
Finding 3: Knitify citations are substantially more recent
53-61% of Knitify citations are from 2025-2026. Gemini without grounding cites 0% from these years. With Google Search enabled, Gemini gains some recency (average 6%) but remains far behind Knitify. The gap is largest in medical (70% vs 11%) and health basic (89% vs 8%). Even on veterinary — where Knitify's recency is lowest (21%) — it leads Gemini+Search by 7×.
4. Speed
| System | Medical | Scientific | Veterinary | Supplement | Beauty | Health | Average |
| Knitify Fast | 21.2s | 13.0s | 17.8s | 10.3s | 10.0s | 10.2s | 13.8s |
| Gemini 3.0 Flash* | 14.1s | 13.8s | 13.5s | 12.3s | 12.5s | 11.9s | 13.0s |
| Knitify Standard | 25.4s | 12.8s | 18.3s | 17.7s | 20.5s | 21.1s | 19.3s |
| 3.0 Flash + Search* | 25.2s | 24.5s | 23.4s | 26.2s | 21.5s | 24.4s | 24.2s |
| Knitify Premium | 53.6s | 28.9s | 28.7s | 25.9s | 46.4s | 44.4s | 38.0s |
| Gemini 3.0 Pro* | 55.7s | 55.7s | 55.7s | 46.5s | 49.4s | 45.8s | 51.5s |
| 3.0 Pro + Search* | 68.1s | 69.6s | 68.1s | 61.9s | 63.0s | 60.9s | 65.3s |
*Gemini TTFT = total response time (non-streaming REST API). Knitify TTFT = true time-to-first-token via streaming. Search grounding adds 10-15s to Gemini response time.
Finding 4: Knitify Fast is competitive on speed; grounding adds latency
Knitify Fast averages 13.8s — within 1 second of Gemini 3.0 Flash (13.0s). Google Search grounding adds 10-15s to Gemini: Flash+Search averages 24.2s, Pro+Search averages 65.3s. Knitify Premium (38.0s) is faster than both Gemini 3.0 Pro (51.5s) and Pro+Search (65.3s) while delivering more references at higher fidelity.
5. Where Gemini Performs Best — and Worst
Gemini's Best Domains
Gemini 3.0 Pro achieves its highest CF on supplement (83%) and health basic (80%). These are mainstream wellness topics (magnesium, vitamin D, omega-3) with dense coverage in training data. With Google Search, supplement reaches 95% — the closest any Gemini configuration comes to Knitify.
Gemini's Worst Domains
Gemini 2.5 Flash drops to 6% CF on veterinary and 13% on beauty. Even Gemini 3.0 Pro with Google Search reaches only 89% on veterinary and scientific — domains where specialized literature requires targeted retrieval beyond general web search.
The Generational Improvement
| Domain | 2.5 Flash | 3.0 Flash | 3.0 Pro | 3.0 Pro + Search | Knitify Premium |
| Medical | 28% | 53% | 72% | 92% | 100% |
| Scientific | 20% | 40% | 66% | 89% | 98% |
| Veterinary | 6% | 26% | 52% | 89% | 98% |
| Supplement | 25% | 51% | 83% | 95% | 97% |
| Beauty | 13% | 47% | 71% | 93% | 94% |
| Health Basic | 26% | 72% | 80% | 92% | 99% |
Progression from 2.5 Flash → 3.0 Flash → 3.0 Pro → 3.0 Pro + Search → Knitify Premium. Each step improves CF, but specialized domains (veterinary, scientific) remain hardest.
Finding 5: Model scaling and search grounding help but cannot close the gap in specialized domains
Gemini improved significantly from 2.5 to 3.0 and further with Google Search. On supplement queries, Pro+Search reaches 95% — within 2 points of Knitify Premium. But on veterinary and scientific topics — where specialized literature is sparse in general web results — Pro+Search still trails Knitify by 9 percentage points while delivering only 4-5 references at 65-70s response time.
Specialized domains require specialized retrieval.
6. Citation Fidelity by Query Difficulty
Across all 6 domains (180 queries), CF broken down by the 4 difficulty tiers.
| System | Common (48q) | Complex (48q) | Niche (42q) | Emerging (42q) | Overall |
| Knitify Premium | 98% | 98% | 98% | 97% | 98% |
| Knitify Standard | 97% | 98% | 97% | 96% | 97% |
| Knitify Fast | 96% | 96% | 97% | 96% | 97% |
| Gemini 3.0 Pro | 68% | 68% | 68% | 64% | 71% |
| Gemini 3.0 Flash | 49% | 40% | 41% | 42% | 48% |
| Gemini 2.5 Pro | 41% | 33% | 39% | 32% | 48% |
| Gemini 2.5 Flash | 19% | 15% | 16% | 18% | 20% |
Finding 6: Knitify is stable across all difficulty levels
Knitify maintains 96-98% CF from common to emerging queries — a 2-point range. Gemini 3.0 Pro shows more variance: 68% on common, 64% on emerging. The difficulty effect is secondary to the domain effect, but both hurt Gemini while leaving Knitify unaffected.
Emerging queries are where citations matter most. Pre-guideline topics (senolytics, circadian disruption, CAR-M therapy) have the sparsest literature. Gemini 3.0 Pro drops to 64% on these queries — fabricating 1 in 3 citations precisely when clinicians need reliable evidence most. Knitify maintains 96-97%.
7. Head-to-Head Comparisons
Direct matchups between same-class models. Cross-domain averages across all 180 queries.
Without Google Search Grounding
| Metric | Knitify Fast | Gemini 3.0 Flash | | Knitify Premium | Gemini 3.0 Pro |
| Citation Fidelity | 97% | 48% | | 98% | 71% |
| Verified refs / response | 7.6 | 3.7 (+3.9 fake) | | 17.4 | 5.3 (+2.2 fake) |
| Recent citations | 53% | 0% | | 61% | 0% |
| Speed | 13.8s | 13.0s | | 38.0s | 51.5s |
With Google Search Grounding
| Metric | Knitify Fast | Flash + Search | | Knitify Premium | Pro + Search |
| Citation Fidelity | 97% | 59% | | 98% | 92% |
| Verified refs / response | 7.6 | 4.6 (+3.3 fake) | | 17.4 | 4.4 (+0.4 fake) |
| Recent citations | 53% | 6% | | 61% | 6% |
| Speed | 13.8s | 24.2s | | 38.0s | 65.3s |
Cross-domain averages. "Recent citations" = % from last 2 years. "Fake" = fabricated references that don't resolve to real papers.
8. The Case for Specialized Models
Observation: Niche domains and recency-critical areas need specialized retrieval
Google Search grounding helps Gemini substantially on mainstream topics — supplement queries reach 95% CF. But three patterns emerge that general-purpose search cannot solve:
1. Specialized domains. Veterinary and scientific pharmacology are the hardest domains for Gemini, even with search. Pro+Search reaches 89% on both — still fabricating 1 in 9 citations. These domains require retrieval from PubMed's specialized indices, not general web results.
2. Reference depth. Pro+Search averages only 4.8 references per response — compared to 17.8 for Knitify Premium. Search grounding sacrifices breadth for accuracy. For literature reviews and comprehensive summaries, 4-5 references is insufficient.
3. Recency. Even with Google Search, Gemini's recency averages 6% from 2025-2026. Knitify averages 53-61%. For rapidly evolving fields — emerging therapies, new guidelines, recent trials — current evidence is not a bonus but a requirement.
The fundamental trade-off: Google Search grounding can push Gemini's CF from 71% to 92%, but at the cost of fewer references (4.8 vs 7.5), slower response (65s vs 52s), and still limited recency (6% vs 61%). Knitify achieves 98% CF with 17.8 verified references, 61% recency, and 38s response time — without these trade-offs.
9. Conclusions
1. Knitify delivers verified citations across all domains and difficulty levels
97-98% CF across 6 domains and 4 difficulty tiers. Gemini 3.0 Pro achieves 53-83% without grounding, 89-95% with Google Search.
2. Google Search grounding improves Gemini but introduces trade-offs
Pro+Search improves average CF from 71% to 92% but reduces references from 7.5 to 4.8, adds 14s latency (65s vs 52s), and still trails Knitify on recency (6% vs 61%).
3. Specialized domains require specialized retrieval
Veterinary and scientific pharmacology are the hardest for all Gemini configurations. Even Pro+Search reaches only 89% on these domains. General web search does not cover the depth of specialized biomedical literature that PubMed-based retrieval provides.
4. Knitify delivers 4× more verified references than Pro+Search
Knitify Premium averages 17.4 verified references per response. Pro+Search averages 4.4. For comprehensive literature reviews, clinical decision support, and research synthesis, reference depth matters.
5. Current evidence requires targeted retrieval
53-61% of Knitify citations are from 2025-2026. Gemini without grounding: 0%. With Google Search: 6%. For emerging therapies, new guidelines, and rapidly evolving fields, recency is not optional.
Appendix: Methodology
Scope: 180 research queries across 6 domains (30 each), 4 difficulty tiers per domain (common, complex, niche, emerging), 9 systems (3 Knitify tiers + 4 Gemini configurations + 2 Gemini with Google Search grounding).
Judge: Clinical safety and content quality scored by Gemini 3.0 Pro at temperature 0, batch evaluation. Citation fidelity verified programmatically — independent of the judge.
Citation verification: Each Gemini reference is resolved via CrossRef (DOIs) and PubMed (PMIDs). An independent AI verifier confirms whether the resolved paper matches what was claimed. Knitify citations are verified by the model's built-in quality assurance layer.
Gemini API: All Gemini models called via the Gemini API (generativelanguage.googleapis.com). Non-grounded models generate entirely from model weights. Grounded models use tools=[{"google_search": {}}] to enable Google Search.
Full test queries: See individual domain tech specs for complete query lists.
About Knitify
Knitify is the most accurate AI platform for healthcare research synthesis, achieving 97% average citation fidelity across 6 domains and 180 queries — compared to 71% for Gemini 3.0 Pro. Each domain has a purpose-built model: medical prescriber, scientific pharmacology, veterinary medicine, supplement health, beauty/dermatology, and general wellness. Every citation is verified against PubMed. Available via API at
knitify.innovohealthlabs.com.