We evaluated Knitify Scientific Advanced against four Gemini configurations and two OpenAI GPT-5.5 configurations on 30 scientific queries spanning drug mechanisms, PK/PD interactions, structure-activity relationships, and emerging therapeutics. Knitify achieves 96-98% citation fidelity compared to 20-66% for plain Gemini and 76-92% for GPT-5.5. On complex PK queries, Gemini 2.5 Flash drops to 8%. Knitify Fast and Standard deliver first tokens in 13 seconds — vs 110-125 seconds for GPT-5.5.
How citation fidelity is measured: Each reference cited by Gemini is checked programmatically. DOIs are resolved via CrossRef and PMIDs via PubMed to retrieve the real paper. An independent AI verifier then compares the resolved paper against what was claimed — checking whether the topic, authors, and study match. If the DOI returns a 404 (paper does not exist) or the resolved paper is on a different topic, the citation is marked as hallucinated. Knitify citations are verified by the model's built-in quality assurance layer.
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify Fast | 91% | 100% | 95% | 96% | 96% |
| Knitify Standard | 95% | 100% | 96% | 94% | 97% |
| Knitify Premium | 98% | 99% | 96% | 97% | 98% |
| Gemini 3.0 Pro | 79% | 47% | 64% | 73% | 66% |
| Gemini 3.0 Flash | 48% | 21% | 35% | 55% | 40% |
| Gemini 2.5 Pro | 50% | 25% | 46% | 52% | 42% |
| Gemini 2.5 Flash | 25% | 8% | 16% | 35% | 20% |
| GPT-5.5 | 85% | 68% | 73% | 79% | 76% |
| GPT-5.5 + Web Search | 93% | 94% | 86% | 96% | 92% |
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify Fast | 6.6 | 10.1 | 8.0 | 11.0 | 8.9 |
| Knitify Standard | 10.0 | 14.6 | 10.9 | 12.0 | 11.9 |
| Knitify Premium | 16.6 | 21.9 | 15.4 | 19.1 | 18.3 |
| Gemini 3.0 Pro | 7.8 | 7.6 | 7.1 | 8.1 | 7.7 |
| Gemini 3.0 Flash | 7.4 | 7.2 | 7.1 | 7.3 | 7.3 |
| Gemini 2.5 Pro | 6.2 | 7.0 | 6.3 | 6.7 | 6.6 |
| GPT-5.5 | 12.5 | 13.4 | 16.6 | 23.1 | 16.2 |
| GPT-5.5 + Web Search | 12.4 | 17.1 | 18.4 | 19.7 | 16.8 |
Citation volume, recency, and journal breadth across all 7 systems.
| System | Total Refs | 2025-26 | 2024 | 2023 | ≤2022 | % Recent | Unique Journals |
|---|---|---|---|---|---|---|---|
| Knitify Fast | 534 | 286 | 34 | 24 | 164 | 56% | 193 |
| Knitify Standard | 714 | 302 | 70 | 52 | 264 | 43% | 240 |
| Knitify Premium | 1100 | 586 | 86 | 56 | 314 | 56% | 342 |
| Gemini 2.5 Flash | 313 | 0 | 0 | 6 | 217 | 0% | 336 |
| Gemini 2.5 Pro | 192 | 0 | 0 | 6 | 260 | 0% | 358 |
| Gemini 3.0 Flash | 208 | 0 | 0 | 7 | 282 | 0% | 343 |
| Gemini 3.0 Pro | 230 | 0 | 0 | 4 | 155 | 0% | 206 |
| 3.0 Flash + Search | 369 | 15 | 19 | 19 | 316 | 4% | 343 |
| 3.0 Pro + Search | 109 | 2 | 3 | 7 | 97 | 1% | 206 |
| GPT-5.5 | 379 | 0 | 1 | 11 | 367 | 0% | 195 |
| GPT-5.5 + Web Search | 360 | 4 | 20 | 13 | 323 | 1% | 514 |
Knitify top journals (verified via PubMed): The New England Journal of Medicine (6), Clinical Pharmacology & Therapeutics (5), Journal of Medicinal Chemistry (4), Pharmacology & Therapeutics (4).
Gemini journal claims: Gemini models collectively claim 131 citations to the New England Journal of Medicine across 30 queries. When we verified each DOI, 58% resolved to real NEJM papers — exclusively landmark trials the model memorized. The remaining 42% were fabricated DOIs with valid NEJM prefix format (10.1056/NEJMoa...) that do not correspond to any existing paper.
GPT-5.5 recency cliff: Plain GPT-5.5 cites 0 papers from 2025-26 across all 30 queries. Even with the web_search tool enabled, only 1% of citations are recent (4 of 360). GPT-5.5's January 2026 training cutoff is structural — web grounding rarely promotes new literature into the response. Knitify Standard cites 43% from 2025-26 on the same queries.
| System | Common | Complex | Niche | Emerging | Overall |
|---|---|---|---|---|---|
| Knitify Fast | 12.0s | 13.3s | 13.2s | 13.6s | 13.0s |
| Knitify Standard | 12.8s | 13.0s | 12.3s | 13.1s | 12.8s |
| Knitify Premium | 28.4s | 27.6s | 26.5s | 33.5s | 28.9s |
| Gemini 3.0 Flash* | 14.0s | 13.8s | 14.1s | 13.3s | 13.8s |
| Gemini 3.0 Pro* | 60.4s | 58.4s | 52.6s | 50.3s | 55.7s |
| GPT-5.5* | ~110s overall (no streaming) | 109.5s | |||
| GPT-5.5 + Web Search* | ~125s overall (no streaming) | 125.3s | |||
| Fast | Standard | Premium | |
|---|---|---|---|
| Best for | Quick mechanistic lookups | Drug interaction analysis | Comprehensive SAR/PK reviews |
| Citation Fidelity | 96% | 97% | 98% |
| References / response | 9 | 12 | 18 |
| Avg words | ~401 | ~694 | ~1,120 |
| Time to first token | 13.0s | 12.8s | 28.9s |
Direct comparisons between matched tiers — same-class models, all metrics.
| Metric | Knitify Fast | Gemini 3.0 Flash |
|---|---|---|
| Citation Fidelity | 96% | 40% |
| References / response | 8.9 verified | 2.9 verified (4.4 fabricated) |
| % from 2025-2026 | 56% | 0% |
| Speed (TTFT) | 13.0s | 13.8s |
| Metric | Knitify Premium | Gemini 3.0 Pro |
|---|---|---|
| Citation Fidelity | 98% | 66% |
| References / response | 18.3 verified | 5.1 verified (2.6 fabricated) |
| % from 2025-2026 | 56% | 0% |
| Speed (TTFT) | 28.9s | 55.7s |
| Metric | Knitify Fast | Flash + Search |
|---|---|---|
| Citation Fidelity | 96% | 50% |
| References / response | 8.9 verified | 4.5 verified (4.5 fabricated) |
| % from 2025-2026 | 56% | 4% |
| Speed (TTFT) | 13.0s | 24.5s |
| Metric | Knitify Premium | Pro + Search |
|---|---|---|
| Citation Fidelity | 98% | 89% |
| References / response | 18.3 verified | 3.6 verified (0.4 fabricated) |
| % from 2025-2026 | 56% | 1% |
| Speed (TTFT) | 28.9s | 69.6s |
| Metric | Knitify Standard | GPT-5.5 (plain) |
|---|---|---|
| Citation Fidelity | 97% | 76% |
| References / response | 11.9 verified | 12.3 verified (3.9 fabricated) |
| % from 2025-2026 | 43% | 0% |
| Speed (TTFT) | 12.8s | 109.5s |
| Metric | Knitify Premium | GPT-5.5 + Web Search |
|---|---|---|
| Citation Fidelity | 98% | 92% |
| References / response | 18.3 verified | 15.5 verified (1.3 fabricated) |
| % from 2025-2026 | 56% | 1% |
| Speed (TTFT) | 28.9s | 125.3s |
Complex queries are where it matters most. On PK/drug interaction questions, Gemini 2.5 Flash drops to 8% CF. Knitify stays at 99-100%.
2.4× more references. Knitify Premium cites 18 verified papers per response versus 8 for Gemini.
GPT-5.5 with web_search beats Gemini 3.0 Pro + Search (92% vs 89% CF) and lifts the verified reference count to 15.5 per response — close to Knitify Premium's 18.3. Without the web tool, plain GPT-5.5 falls to 76% CF, still ahead of every plain Gemini variant but well below Knitify.
Recency is the structural ceiling. GPT-5.5's January 2026 training cutoff means plain calls cite zero papers from 2025-26. Even web_search grounding lifts that to only 1%. Knitify Standard cites 43% recent literature on the same queries.
Latency penalty is severe. GPT-5.5 takes 110-125s per query because ~65% of output tokens are invisible reasoning. Knitify Standard finishes the same query in 12.8s.
Failure mode for plain GPT-5.5: fabricated identifiers. The model often gets author, journal, year, and title correct but invents the matching PMID. Example from this evaluation: GPT-5.5 cited PMID 18315556 as Wong et al.'s apixaban Factor Xa paper — but that PMID actually resolves to a von Willebrand factor study (the real Wong PMID is 18315548). The DOI for the same reference was correct. This is a structural risk for any clinical workflow that relies on the PMID for traceability.
Eleven systems evaluated: three Knitify tiers, four plain Gemini configurations, two Gemini + Google Search configurations, and two OpenAI GPT-5.5 configurations (plain and with the web_search tool). All received identical queries. Non-Knitify systems were prompted to cite sources with DOI/PubMed links. Clinical safety scored by Gemini 3.0 Pro (temp=0, batch). Citation fidelity verified by resolving each DOI via CrossRef and each PMID via PubMed esummary, then matching the resolved paper against the cited claim with an independent LLM verifier (Gemini Flash-Lite, temp=0).
GPT-5.5 calls used OpenAI's /v1/responses endpoint (Chat Completions rejects the web_search tool). GPT-5.5 was added to this report on May 15, 2026, three weeks after its April 24 release.
All Gemini systems received the following prompt template for each query:
You are a medical research assistant. Answer the following research question thoroughly. Support every claim with citations to peer-reviewed sources. For each citation, include: - First author et al. - Paper title - Journal name - Year of publication - DOI or PubMed link if available Target approximately [TARGET_WORDS] words for the main answer (excluding references). Format your references in a numbered list at the end. Research question: [QUERY]
Gemini models were called via the Gemini API (generativelanguage.googleapis.com) with the prompt above. No search grounding or retrieval tools were enabled — responses are generated entirely from model parameters.
Target word counts were matched to the corresponding Knitify tier to ensure comparable output length.
| # | Query |
|---|---|
| 1 | What is apixaban's mechanism of action, its chemical structure properties (MW, LogP), and the key clinical evidence from the ARISTOTLE trial? |
| 2 | What is metformin's mechanism of action including OCT1/OCT2 transporter dependence, its physicochemical properties, and cardiovascular outcome evidence? |
| 3 | What is atorvastatin's HMG-CoA reductase binding mechanism, the role of its ortho-fluorophenyl pharmacophore, and evidence from the ASTEROID trial? |
| 4 | What is the mechanism and CYP2D6 metabolism of tamoxifen to endoxifen, and what is the evidence from the NSABP P-1 trial? |
| 5 | How do GLP-1 receptor agonists (liraglutide vs semaglutide) differ in chemical structure, half-life, and cardiovascular outcomes? |
| 6 | What are the chemical and pharmacological differences between SGLT2 inhibitors and their renal outcome evidence? |
| 7 | What is lithium's mechanism of action (GSK-3β, inositol phosphatase), its narrow therapeutic index, and suicide risk reduction evidence? |
| 8 | What are the PK/PD differences between concentration-dependent vs time-dependent antibiotics and their dosing implications? |
| # | Query |
|---|---|
| 9 | How does tacrolimus interact with azole antifungals via CYP3A4, what is the AUC increase magnitude, and what FDA guidance exists? |
| 10 | What is the amiodarone-warfarin interaction mechanism including CYP2C9 inhibition and quantitative INR changes? |
| 11 | How do JAK inhibitors differ in JAK selectivity, and what is the comparative safety data for thrombosis and malignancy? |
| 12 | What are the pharmacogenomic predictors of fluoropyrimidine toxicity and how should DPYD testing guide dosing? |
| 13 | How do DOACs perform in obese patients — PK changes by weight category and clinical outcome data? |
| 14 | What is the evidence for CYP2D6 genotype-guided dosing of venlafaxine? |
| 15 | What are the mechanisms of PPI-clopidogrel interaction via CYP2C19, including cardiovascular outcome meta-analyses? |
| 16 | How does rifampicin induce CYP3A4/CYP2C9/P-gp and what is the impact on oral contraceptive hormone levels? |
| # | Query |
|---|---|
| 17 | Why does curcumin have poor oral bioavailability and what formulation strategies have been tried? |
| 18 | What is the prodrug design rationale for enalapril vs lisinopril? |
| 19 | How does paclitaxel Cremophor-EL compare to nab-paclitaxel in PK and clinical outcomes? |
| 20 | What chemical properties of fentanyl enable transdermal delivery and what are the FDA dosing conversion ratios? |
| 21 | What is the SAR of fluoroquinolones and the chemical basis for QTc prolongation risk? |
| 22 | How do esomeprazole and omeprazole differ and what is the evidence for clinical superiority? |
| 23 | Why do monoclonal antibodies have 60-80% SC bioavailability and what role does FcRn play? |
| # | Query |
|---|---|
| 24 | What are the differences between BTK inhibitors (ibrutinib, acalabrutinib, zanubrutinib) and their trial evidence? |
| 25 | What is venetoclax's BCL-2 selectivity and evidence from MURANO and CLL14 trials? |
| 26 | What are the PARP inhibitor differences for BRCA-mutant vs HRD-positive patient selection? |
| 27 | What is the mechanism and evidence for dupilumab across atopic dermatitis, asthma, and CRS? |
| 28 | How do CDK4/6 inhibitors differ in selectivity and comparative evidence in HR+ breast cancer? |
| 29 | What is the current evidence for CAR-T therapy in solid tumors? |
| 30 | What is the evidence for ASO and siRNA therapeutics including nusinersen and patisiran? |