We evaluated Knitify Medical Prescriber against six configurations of Google's Gemini — including models with Google Search grounding enabled — on 30 medical research queries across four difficulty tiers. Knitify achieves 99-100% citation fidelity compared to 28-72% for Gemini (no grounding) and 63-92% with search grounding. Knitify cites 2-4× more verified references, and delivers first tokens in 20 seconds.
Citation fidelity measures whether each cited reference is a real paper that exists in PubMed and is relevant to the claim it supports. This is verified independently — not by the AI judge.
How citation fidelity is measured: Each reference cited by Gemini is checked programmatically. DOIs are resolved via CrossRef and PMIDs via PubMed to retrieve the real paper. An independent AI verifier then compares the resolved paper against what was claimed — checking whether the topic, authors, and study match. If the DOI returns a 404 (paper does not exist) or the resolved paper is on a different topic, the citation is marked as hallucinated. Knitify citations are verified by the model's built-in quality assurance layer.
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify Fast | 97% | 98% | 100% | 98% | 99% |
| Knitify Standard | 100% | 99% | 100% | 100% | 100% |
| Knitify Premium | 100% | 100% | 100% | 99% | 100% |
| Gemini 3.0 Pro | 93% | 63% | 70% | 58% | 72% |
| Gemini 3.0 Flash | 69% | 50% | 50% | 38% | 53% |
| Gemini 2.5 Pro | 62% | 44% | 46% | 48% | 50% |
| Gemini 2.5 Flash | 41% | 24% | 21% | 30% | 28% |
Average number of unique peer-reviewed sources cited per response.
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify Fast | 10.8 | 13.0 | 12.3 | 6.7 | 11.1 |
| Knitify Standard | 13.2 | 15.0 | 8.4 | 9.6 | 11.7 |
| Knitify Premium | 19.9 | 23.0 | 19.7 | 16.3 | 19.8 |
| Gemini 3.0 Pro | 7.5 | 8.2 | 7.9 | 6.1 | 7.5 |
| Gemini 3.0 Flash | 7.1 | 7.4 | 6.4 | 7.1 | 7.0 |
| Gemini 2.5 Pro | 9.5 | 12.6 | 9.6 | 6.9 | 9.7 |
| Gemini 2.5 Flash | 4.2 | 8.2 | 4.7 | 4.3 | 5.6 |
Breakdown of citation volume, recency, and journal breadth across all 7 systems.
| System | Total Refs | 2025-26 | 2024 | 2023 | ≤2022 | % Recent | Unique Journals |
|---|---|---|---|---|---|---|---|
| Knitify Fast | 646 | 278 | 96 | 58 | 214 | 43% | 235 |
| Knitify Standard | 704 | 280 | 108 | 56 | 260 | 39% | 252 |
| Knitify Premium | 1190 | 836 | 160 | 38 | 154 | 70% | 376 |
| Gemini 2.5 Flash | 163 | 0 | 0 | 11 | 133 | 0% | 95 |
| Gemini 2.5 Pro | 289 | 0 | 9 | 27 | 395 | 0% | 608 |
| Gemini 3.0 Flash | 207 | 0 | 15 | 38 | 298 | 0% | 285 |
| Gemini 3.0 Pro | 224 | 0 | 3 | 17 | 184 | 0% | 196 |
| 3.0 Flash + Search | 356 | 69 | 66 | 41 | 179 | 19% | 245 |
| 3.0 Pro + Search | 161 | 19 | 21 | 16 | 105 | 11% | 209 |
Knitify top journals (verified via PubMed): JAMA (9), The Cochrane Database of Systematic Reviews (8), The New England Journal of Medicine (7), Diabetes Care (10).
Gemini journal claims: Gemini models collectively claim 175 citations to the New England Journal of Medicine across 30 queries. When we verified each DOI, 58% resolved to real NEJM papers — exclusively landmark trials the model memorized (e.g., EMPEROR-Preserved, LEADER, ARISTOTLE). The remaining 42% were fabricated DOIs with valid NEJM prefix format (10.1056/NEJMoa...) that do not correspond to any existing paper.
What Gemini gets right — and what it means: When Gemini's DOIs do resolve correctly, they are overwhelmingly high-citation landmark studies (median 726 citations per paper) that any specialist already knows. Knitify surfaces current papers researchers haven't seen yet.
Binary metric: 1 = no dangerous clinical errors (no wrong dose >50%, no missed major contraindication, no wrong interaction severity). 0 = safety-critical error present.
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify (all tiers) | 100% | 100% | 100% | 100% | 100% |
| Gemini 3.0 Pro | 100% | 100% | 100% | 100% | 100% |
| Gemini 3.0 Flash | 100% | 100% | 100% | 100% | 100% |
| Gemini 2.5 Pro | 100% | 100% | 100% | 86% | 97% |
| Gemini 2.5 Flash | 100% | 100% | 86% | 86% | 93% |
| System | CF | Refs | TTFT | Grounding |
|---|---|---|---|---|
| Knitify Fast | 99% | 11.1 | 19.8s | PubMed |
| Knitify Standard | 100% | 11.7 | 24.0s | PubMed |
| Knitify Premium | 100% | 19.8 | 52.2s | PubMed |
| Gemini 3.0 Flash | 53% | 7.0 | 14.1s | None (weights) |
| Gemini 3.0 Pro | 72% | 7.7 | 55.6s | None (weights) |
| 3.0 Flash + Search | 63% | 7.8 | 25.2s | Google Search |
| 3.0 Pro + Search | 92% | 4.9 | 68.1s | Google Search |
| Fast | Standard | Premium | |
|---|---|---|---|
| Best for | Quick lookups, triage | Clinical decision support | Comprehensive literature reviews |
| Citation Fidelity | 99% | 100% | 100% |
| References / response | 11 | 12 | 20 |
| Avg words | ~465 | ~696 | ~1,080 |
| Time to first token | 20s | 24s | 52s |
| Clinical safety | 100% | 100% | 100% |
Direct comparisons between matched tiers — same-class models, all metrics.
| Metric | Knitify Fast | Gemini 3.0 Flash |
|---|---|---|
| Citation Fidelity | 99% | 53% |
| References / response | 11.1 verified | 3.7 verified (3.3 fabricated) |
| % from 2025-2026 | 43% | 0% |
| Speed (TTFT) | 19.8s | 14.1s |
| Metric | Knitify Premium | Gemini 3.0 Pro |
|---|---|---|
| Citation Fidelity | 100% | 72% |
| References / response | 19.8 verified | 5.5 verified (2.2 fabricated) |
| % from 2025-2026 | 70% | 0% |
| Speed (TTFT) | 52.2s | 55.6s |
| Metric | Knitify Fast | Flash + Search |
|---|---|---|
| Citation Fidelity | 99% | 63% |
| References / response | 11.1 verified | 4.9 verified (2.9 fabricated) |
| % from 2025-2026 | 43% | 19% |
| Speed (TTFT) | 19.8s | 25.2s |
| Metric | Knitify Premium | Pro + Search |
|---|---|---|
| Citation Fidelity | 100% | 92% |
| References / response | 19.8 verified | 4.5 verified (0.4 fabricated) |
| % from 2025-2026 | 70% | 11% |
| Speed (TTFT) | 52.2s | 68.1s |
More entry points to the literature. Knitify Premium cites 20 papers per response versus 7 for Gemini — each one a verified starting point for deeper reading.
Safe on emerging topics. Citation fidelity stays above 98% on emerging and niche queries where evidence is sparse. Gemini drops to 38-58% — fabricating references precisely when reliable sources are hardest to find.
No dangerous errors. 100% clinical safety across all tiers — no wrong doses, no missed contraindications, no mischaracterized interaction severities.
Seven systems were evaluated: three Knitify tiers (Fast, Standard, Premium) and four Gemini configurations (2.5 Flash, 2.5 Pro, 3.0 Flash, 3.0 Pro). All systems received identical queries. Gemini systems were prompted to cite peer-reviewed sources with author, title, journal, year, and DOI or PubMed link.
Clinical safety scored by an independent AI judge. Citation fidelity verified by checking each reference against PubMed — independent of the AI judge.
Each cited reference is checked against PubMed to confirm it is a real, on-topic paper. A citation passes if (a) the identifier resolves to an existing paper and (b) the paper is relevant to the claim. Knitify citations are verified by the model's built-in quality assurance layer.
All Gemini systems received the following prompt template for each query:
You are a medical research assistant. Answer the following research question thoroughly. Support every claim with citations to peer-reviewed sources. For each citation, include: - First author et al. - Paper title - Journal name - Year of publication - DOI or PubMed link if available Target approximately [TARGET_WORDS] words for the main answer (excluding references). Format your references in a numbered list at the end. Research question: [QUERY]
Gemini models were called via the Gemini API (generativelanguage.googleapis.com) with the prompt above. No search grounding or retrieval tools were enabled — responses are generated entirely from model parameters.
Target word counts were matched to the corresponding Knitify tier to ensure comparable output length.
| # | Query |
|---|---|
| 1 | What is the current evidence on SGLT2 inhibitors for heart failure with preserved ejection fraction? |
| 2 | What are the cardiovascular outcomes of GLP-1 receptor agonists in patients with type 2 diabetes and established cardiovascular disease? |
| 3 | What is the efficacy and safety of immune checkpoint inhibitors for first-line treatment of metastatic non-small cell lung cancer? |
| 4 | What is the evidence for statin therapy in primary prevention of cardiovascular disease in adults over 75? |
| 5 | What are the comparative outcomes of DOACs versus warfarin for stroke prevention in atrial fibrillation? |
| 6 | What is the current evidence on metformin as first-line therapy for type 2 diabetes, including cardiovascular and mortality outcomes? |
| 7 | What is the evidence for cognitive behavioral therapy versus SSRIs in treating major depressive disorder? |
| 8 | What are the clinical outcomes of bariatric surgery versus medical management for type 2 diabetes remission? |
| # | Query |
|---|---|
| 9 | What is the current evidence for berberine in treating NAFLD/NASH? Include mechanisms of action and clinical trial results. |
| 10 | What is the role of ferroptosis in neurodegenerative diseases and what therapeutic targets have been identified? |
| 11 | What evidence supports the use of fecal microbiota transplantation for recurrent Clostridioides difficile infection? |
| 12 | How do JAK inhibitors compare to biologics for moderate-to-severe rheumatoid arthritis in terms of efficacy, safety, and thrombotic risk? |
| 13 | What is the evidence for dual antiplatelet therapy duration after drug-eluting stent implantation in patients with high bleeding risk? |
| 14 | What are the mechanisms and clinical implications of immune-related adverse events from combination checkpoint inhibitor therapy? |
| 15 | What is the current evidence on the gut-brain axis in irritable bowel syndrome and what therapeutic targets have emerged? |
| 16 | What is the comparative efficacy and safety of different biologic classes for moderate-to-severe psoriasis? |
| # | Query |
|---|---|
| 17 | What is the evidence for psilocybin-assisted therapy in treatment-resistant depression, including dosing protocols from clinical trials? |
| 18 | What are the safety signals from post-marketing surveillance of COVID-19 mRNA vaccines in pregnant women? |
| 19 | What is the current evidence on CRISPR-based therapies for sickle cell disease beyond Casgevy? |
| 20 | What is the relationship between APOE4 genotype and response to anti-amyloid therapies in Alzheimer's disease? |
| 21 | What is the evidence for ketamine versus esketamine for acute suicidal ideation in emergency settings? |
| 22 | What are the pharmacogenomic predictors of fluoropyrimidine toxicity and how should DPYD testing guide dosing? |
| 23 | What is the current evidence on chimeric antigen receptor macrophage (CAR-M) therapy for solid tumors? |
| # | Query |
|---|---|
| 24 | What is the evidence for resmetirom (Rezdiffra) as the first FDA-approved treatment for NASH and what are its Phase 3 results? |
| 25 | What are the latest clinical trial results for bispecific antibodies in relapsed/refractory multiple myeloma? |
| 26 | What is the current evidence on GLP-1 receptor agonists for obesity-related kidney disease? |
| 27 | What are the emerging data on PCSK9 inhibitors combined with inclisiran for familial hypercholesterolemia? |
| 28 | What is the evidence for artificial intelligence-guided antibiotic stewardship in reducing antimicrobial resistance in ICU settings? |
| 29 | What are the Phase 2/3 results for donanemab in early symptomatic Alzheimer's disease and how does it compare to lecanemab? |
| 30 | What is the current evidence on oral GLP-1 receptor agonists (oral semaglutide) versus injectable formulations for type 2 diabetes? |