Scientific Computing and Data / AIR·MS (AI Ready Mount Sinai) / AIR·MS AI Agent /

AIR·MS AI Agent: How Large Language Models Work (and Why Users Should Verify Their Responses)

The AIR·MS AI Agent uses Large Language Models (LLMs).
LLMs such as Ollama (Meta), Gemma (Google), and Phi (Microsoft)—are advanced artificial intelligence systems built on a type of neural network architecture known as transformers.
Transformers enable models to process and generate text by considering the relationships between words and concepts across long passages of text, allowing them to capture context and meaning far more effectively than earlier neural network designs.
These models are trained on vast amounts of text from books, articles, websites, and other sources. Through training, they learn patterns in language and how words and ideas relate to one another.
When a user provides a prompt, the model predicts the most likely next words based on these learned patterns, producing text that is coherent, relevant, and contextually appropriate.
However, while LLMs are powerful tools for writing, research, and problem-solving, they do not “understand” information in the human sense or have access to verified facts.
Their responses are probabilistic rather than factual, meaning they can occasionally produce incorrect, outdated, or even fabricated information—an issue known as hallucination. These errors can be convincing because the models are designed to sound confident and fluent.

Further reading on hallucinations, particularly in biomedical contexts, are available here:

Because of this, users should always verify important facts, data, and sources when using LLMs—especially for academic, professional, or decision-making purposes. Cross-referencing information with trusted references or original sources helps ensure accuracy. Used responsibly, LLMs can be highly effective assistants, but human judgment remains essential for verifying and interpreting their outputs.

Situations That Require Fact-Checking

AIR·MS AI Agent users should be especially cautious in the following scenarios (although this is by no means limited to this selection):

Lists of medical codes (diagnoses, procedures, drugs, etc.):
LLMs may generate inaccurate or nonexistent codes. Verify all lists against peer-reviewed or authoritative resources. Future improvements may incorporate retrieval-augmented generation (RAG) to reduce this issue.
Clinical decision support information (e.g., drug suggestions, interventions, differential diagnoses):
Always confirm outputs with up-to-date clinical references before applying them in practice.
Generated code (e.g., SQL, Python, or R):
Review and test all generated code to ensure it performs as intended and adheres to security and data integrity standards.
Any information that might impact clinical care:
Verify through reliable medical sources before acting on it.

Feedback

We are committed to improving the AIR·MS AI Agent by incorporating RAG methods, keeping models current, and identifying key areas where responses may be unreliable.

If you have any comments or questions, please do not hesitate to reach out to Andrew Deonarine or Edwin Thrower.