ChatGPT, HIPAA compliance and Self-hosted LLMs

AI, Healthcare

The Incident That Sparked Concern

On July 14th, 2025, a Reddit user posted something that sent shockwaves through the healthcare AI community. What started as an innocent query about sandpaper for woodworking turned into a potential HIPAA compliance nightmare.

The user was simply asking ChatGPT for advice on cone-shaped sandpaper to reach corners in their wooden skull project. Instead of receiving woodworking tips, they got something completely unexpected: a detailed breakdown of someone else’s LabCorp drug test results.

ChatGPT’s unexpected response included:

A complete analysis of a drug test custody and control form
Personal medical information from an unrelated individual
Detailed sections of what appeared to be legitimate medical documentation

When the user probed further, asking ChatGPT to reference the document again, the AI continued discussing the medical file and even provided additional details about the drug test report including the actual document file. Most concerning: this wasn’t a case of mistaken identity—when asked, ChatGPT correctly identified the original user’s personal information.

What Really Happened?

The Reddit community quickly mobilized to understand this potential data breach. Several technical theories emerged:

Hash Collision Theory

The most plausible explanation involves ChatGPT’s file storage system. When users upload binary files (like documents or images), the system likely uses hash IDs to identify and retrieve them. A hash collision—where two different files generate the same or similar hash values—could cause the system to retrieve the wrong document.

Cross-Session Data Leakage

Another possibility is that ChatGPT’s infrastructure experienced a glitch where documents from one user’s session became accessible to another user, potentially due to issues with ID generation or bucket management across different servers.

Public Document Hypothesis

Some users theorized the medical document might have been publicly available online and indexed by ChatGPT. However, reverse image searches failed to locate the document in public databases, making this explanation less likely.

The HIPAA Compliance Reality Check

This incident highlights a critical fact that many healthcare professionals and organizations overlook: ChatGPT is not HIPAA compliant and never claimed to be.

According to the HIPAA Journal, OpenAI explicitly refuses to sign Business Associate Agreements (BAAs) with HIPAA-regulated entities. This means:

No legal protection for healthcare data processed through ChatGPT
No compliance guarantees for handling Protected Health Information (PHI)
Potential violations when healthcare providers use ChatGPT with patient data

Understanding the Legal Framework

PHI (Protected Health Information) encompasses all health-related information that can be linked to a specific individual, including:

Medical records and test results
Treatment plans and diagnoses
Insurance information
Any health data combined with personal identifiers

PII (Personally Identifiable Information) includes data that can identify specific individuals:

Names, addresses, phone numbers
Social Security numbers
Dates of birth
Medical record numbers

For HIPAA compliance, healthcare entities must ensure that any third-party service handling PHI signs a BAA, guaranteeing proper data protection and handling protocols.

Risk vs. Reward: A Balanced Perspective

Despite the risks, AI platforms like ChatGPT offer significant potential benefits for healthcare:

Potential Benefits:

Clinical decision support through symptom analysis and differential diagnosis assistance
Administrative efficiency via automated scheduling and patient communication
Research capabilities for literature review and data analysis
Educational tools for medical training and patient education
Cost reduction through automation of routine tasks

Documented Risks:

Data breaches as demonstrated by the Reddit incident
Privacy violations through cross-session data exposure
Regulatory penalties for HIPAA non-compliance
Reputational damage from security incidents
Legal liability from mishandled patient information

Recommendations for Different Users

For Individuals

If you’re considering using ChatGPT for personal health insights, consider this risk-reward balance:

Lower-risk uses:

General health education and information
Wellness and fitness guidance
De-identified health data analysis
Research into medical conditions

Higher-risk uses to avoid:

Uploading actual medical records
Sharing specific test results with identifiers
Discussing sensitive diagnoses with personal details
Analyzing insurance or billing information

Personal approach: Use your judgment about what health information you’re comfortable potentially exposing. The convenience and insights may outweigh risks for general health queries, but avoid sharing anything you’d consider catastrophic if leaked.

For Healthcare Organizations

The recommendation is clear: never use public-facing, cloud-hosted LLM platforms like ChatGPT for processing patient PHI.

Compliant Alternatives for Healthcare Organizations

Option 1: Business Associate Agreements

While theoretically possible, most major AI providers (including OpenAI) currently refuse to sign BAAs with healthcare entities. This approach is likely expensive and time-consuming even where available.

Option 2: Self-Hosted LLM Solutions

Major cloud providers offer HIPAA-compliant AI solutions:

AWS provides various machine learning services with BAA coverage
Microsoft Azure offers healthcare-specific AI tools
Google Cloud has developed Med-PaLM 2 with HIPAA compliance support

Benefits of self-hosted solutions:

Complete control over data processing
No cross-customer data exposure risk
Customizable security configurations
HIPAA compliance capabilities

Option 3: Specialized AI Consultancy

Working with specialized healthcare AI consultants offers several advantages:

Cost efficiency: Often 30-50% cheaper than large cloud providers due to optimized GPU hosting and reduced overhead.

Tailored solutions: Custom AI agents designed specifically for your healthcare workflows, not generic chatbots.

Compliance expertise: Specialized knowledge of HIPAA requirements and healthcare-specific security needs.

Advanced security: Purpose-built architectures with guardrails to prevent data leakage, hash collisions, and other technical vulnerabilities.

Ongoing support: Dedicated teams familiar with healthcare regulations and AI implementation challenges.

The Technical Prevention Strategy

Even with self-hosted solutions, the hash collision problem that likely caused the ChatGPT incident could theoretically occur. Proper implementation requires:

Robust hashing algorithms with collision-resistant properties
Multi-layer identification systems beyond simple hash matching
Session isolation to prevent cross-user data access
Regular security audits of file storage and retrieval systems
Comprehensive logging for audit trails and incident investigation

Looking Forward: The Future of Healthcare AI

This incident shouldn’t derail healthcare AI adoption—it should inform better practices. The benefits of AI in healthcare are too significant to ignore:

Improved diagnostic accuracy through pattern recognition
Reduced administrative burden allowing more focus on patient care
Personalized treatment plans based on comprehensive data analysis
Population health insights for better public health outcomes

The key is implementing these technologies responsibly, with proper safeguards and compliance measures in place.

Key Takeaways

Public AI platforms aren’t HIPAA compliant and shouldn’t be used with patient data
Technical vulnerabilities exist in current AI systems that could cause data exposure
Self-hosted solutions provide better security and compliance options
Specialized expertise is crucial for healthcare AI implementation
Risk-reward analysis should guide individual and organizational decisions

The ChatGPT medical data incident serves as a crucial reminder that while AI offers tremendous potential for healthcare, it must be implemented with appropriate security measures and regulatory compliance in mind.

For healthcare organizations looking to implement AI solutions safely and compliantly, DeepX specializes in building custom, HIPAA-compliant AI agents and data processing pipelines. Our team combines deep healthcare industry knowledge with cutting-edge AI technology to deliver secure, cost-effective solutions tailored to your specific needs.

Video discussion

For those of our followers who prefer a video format, here is a video recording of DeepX co-founder and Chief Product Officer, Taras Filatov, reviewing the original Reddit thread and discussing the issue:

Full transcript

Provided below is the full transcript of the video discussion.

Okay, it’s 16th of July, 2025, and today we are going to look at this issue of whether ChatGPT or other public LLMs are safe for sharing your medical data with or doing any sort of things which might trigger issues with HIPAA compliance and simply we are going to look at this one thread at Reddit on Reddit which appeared just the day before yesterday two days ago when I spotted it and it says “ChatGPT gave me someone else’s medical data from unrelated search“.

So this I thought would be a great example and a great way to explain this topic to my audience and to jointly look into this potential issue and make a decision whether it makes sense to use ChatGPT or other publicly available LLMs for your medical data or in other scenario for the businesses, for companies to use it for processing the medical data of your customers.

So basically what this user claims is that they were looking for just what kind of sandpaper to use. So nothing to do with medical requests at all.

And what happened to them is, the chat GPT has shared an overview of someone’s drug tests across the country from the author, from the original post, author of this thread.

So there has been a discussion and literally the user has not, has never asked for any medical information or for any files.

So the discussion started, I want to send my wooden skull and I’ve been doing it with sanding paper, I cannot get into corners. So, you know, I need the cone-shaped one, I don’t have. So it asks the chat GPT for specific advice.

What’s interesting, chat GPT has responded with, “Thanks for sharing that. “This scanned document is the LabCorp “non-DOT drug test custody and control form. Here’s a breakdown of the important sections in case you’re reviewing it for clarity or for documentation.

Then it has listed a bunch of information that was on that form that originally the topic starter has not posted on Reddit.

There has been like further discussion and some users believed it, some didn’t believe, some of them said that you should report this.

But let me just open another tab where I have this open because here the topic starter has explained, provided more details what happened there. So basically what they’ve done is, So the topic starter has then queried, so wanted to research about this and wanted to query chatGPT whether, you know, what happened there and whether it’s going to repeat its mistake.

And so it says, So ChatGPT continued to talk about this medical file. And then the author says, can you highlight on the document the areas I should check out? And then it talks about some sort of dream report, which is probably what this document is about. So again, this is a dream report of some other person potentially.

I’m saying potentially because it’s just one of the options here. Another option is that LLM has hallucinated and has synthesized this file or this document, which, I mean, if it was just the text, it was more likely, and for a proper binary file, it’s less likely, but yeah, we could, we should still look into different possibilities.

So then author just to investigate and dig deeper into this issue, the author has asked ChatGPT, can you tell me what you know about myself? Because the topic starter was under impression that ChatGPT has somehow glitched and switched the sessions or identification between this user and some other person.

But chat GPT has actually responded absolutely Adam, here’s what I know about you and it has actually provided the real personal information about the topic starter. So that was not the case that switch or swap of identities.

At the same time, so going forward in that discussion, chat GPT to rate that I don’t have the ability to retrieve or resend that LabCorp custody and control form.

You best bet if you deleted it, contact your collector or HR department, or maybe if you still have the LabCorp accession or control number, they might send you a copy directly. Alternatively, you can check with your medical review officer, MRO, as they often keep the records of the forms they review.

I think then the topic starter attempted to do like some further probing of chat GPT, kind of try to work around, hack around that issue and asked it to re-upload the document. to re-reference the same document that was referenced by chat-gpt before and to add a star on top of it.

So I believe this is the document in question. So I believe this is, and this is the LabCorp document. The only thing that the donor name and donor signature has been edited by the topic starter. And we can see this, it looks like a LabCorp Americas drug, BRUG test in a laboratory. I don’t know if this is a correct abbreviation or not. Is it drug testing, is it BRUG?I’m not like very familiar with this specific form, but otherwise the document looks real.

So again, if it was just text, I would say there is a high probability that ChatGPT has hallucinated the document in case where we have a real graphical binary file, real document, it is less probable that chat-gpt has generated it in my opinion, but there could be still probability depending on the current capabilities of image generation with chat GPT.

So anyway, so let’s go back to the discussion. And here again, like, so here the topic starter has been further digging and extracted information such as collector name, designated employer representative, employer client, testing lab, and so on.

So back to discussion. Somebody has suggested– so people in the discussion are obviously concerned about this matter, but also they’re skeptical. they’re trying to understand what’s going on.

And one of the issues and one of the questions, suggestions raised, try a Google reverse image search. Because one of the hypotheses from one of the participants here, it looks like, OK, probably what has happened is ChatGPT might have indexed a publicly shared file. maybe some document which is not classified, which is made public, and it’s available in Google Search. So try to do the Google Search.

So I believe that nobody was able– I mean, I tried myself as well, and nobody was able to find that document on public Google Search. But that was a good hypothesis anyway. So in case, what would be the situation if that was the case?If that was the case, then there is a higher probability that there was no issue. So this document is either like not a real document or a real document which has been made public, probably based on, according to the consent of the customer or patient in question, so that’s less of an issue and not a HIPAA compliance violation potentially if this is coming from public sources. And we’ll talk a little bit about that later.

Then another user is providing their hypothesis of what has happened here, which is, so the user says, I had like sort of a similar issue then, and he says, I think there is an issue with the document uploading feature at the moment.

So this user says, I uploaded something for it to analyze last night, like old digital newspaper clipping, and the analysis it gave was for something completely different. So the suggestion here is maybe this person uploaded their stuff to chat GPT and for some reason, GPT thought it was your document.

Another person is confirming that is what happened with them. Another user is adding like technical details here. So he says, or she says, resources identified by ID. So you mix IDs and you return something from a different person. So that’s just an assumption or hypothesis what might have happened on the backend, on the infrastructure side with chat-gpt. So for this issue to occur, one of the options could be that because it’s storing, when it’s storing a lot of binary files, there would be some IDs or hashes to those files. And maybe something is wrong with that ID generation, addressing or hashing system, which makes certain files to swap. the hashes or here, you know, I’m kind of continuing this train of thought.

Maybe there are different buckets hosted at different servers or instances and there could be that IDs could be non-unique across those buckets and something is happening which might result in a system in chat GPT basically trying to analyze certain file, but actually taking another file with, just based on this mix up with IDs and thinking that this is what this user has uploaded.

Well, actually in reality, this is a binary file from completely different user, from completely different chat GPT session. So a hashing collision, like rightly, potentially rightly for these hypotheses mentioned by this user on the good marbles, a hashing collision on the resources. So that’s the thought.

So another user is making a technical level guess, which is similar. So internal hashing algorithm for binary content, to reduce your processing binary content they’ve already seen. Processing chunks of the binary, your binary hash somehow matched or was close enough to the hash of this other binary based on the similarity of the hashes. Then it pulled in this existing content as context instead of reprocessing your file.

So another user says, these folks probably need to be notified ASAP, by a reference connectivity solutions, tests, orders and reports, because they say our HIPAA compliant secure cloud-based provider portal, blah, blah, blah. So users start to suggest reporting this issue.

Again, back to the discussion of the Google image search. Another user adds, I think this is a very important question to answer. So the question is whether this image is searchable, retrievable online. Because if the document is not freely available on the internet, then it is likely that someone uploaded it to the same chat GPT database.

And the program is using not only publicly available information, but also information from its users. Another user suggests you should also report it to OpenAI.

Another user, however, says it’s definitely a hallucination. If you’re not used to chat GPT making up detailed, plausible, sound, fake information, and so on.

So this is where the original topic starter has added more information, including the file.

So another comment for the topic starter, have you internet searched the drug tested person? It is a pretty big deal if that person truly exists and it is their results. If you found someone that matches the info on the report, name, age, city, perhaps DM them to, direct message them to inquire. They were at XYX Medical Center on date. And if so, I have info about what place, that place there that you might want to know. You and them could file together against the platform or the entity that was responsible for uploading it.

It is worrisome if it is not a hallucination.

Big if real.

So another user says you need to reach out to the lab corp and report it as a potential HIPAA violation. They are required by law to follow protocol and notify the person whose info it is. From there obviously if that person meant to have their drug test online, they won’t be concerned. But if that’s not the case, the investigation can be handled through the right channels.

And again, like we discussed earlier, another user comments, if the file is searchable on a search engine, which is how ChatGPT would have accessed it, then no, it’s not a HIPAA violation.

It was probably uploaded to a medical research site or as a case study or something where the patient’s info background is usually included to provide a full profile for the patient and would have been done so with the patient’s knowledge and permission.

And again, correct clarification, that it is legally for the patient whose information it is to decide. LabCorp still has a legal obligation to confirm that.

So let’s summarize this, our discussion and our findings a bit.

(1) Did ChatGPT break HIPAA compliance?

And (2) what can one do to avoid these kind of flops when using AI in personal or business environment?

Well, first of all, ChatGPT never claims to be HIPAA compliant.

And let’s look into this information from the HIPAA journal.

And HIPAA obviously is the Health Insurance Portability and Accountability Act, that’s a US federal law and it’s recognized globally pretty much as the main standard for the IT systems that have to do with medical data and with healthcare. So back to the publication in HIPAA journal, it says that actually OpenAI will not currently sign a business associate agreement with HIPAA regulated entities.

What is business associate agreement? BAA, business associate agreement, is one of the important terms here. So PHI means personal health information that is the information everything related to health medical records and things like that which would be shared with the IT system or AI agent and then PII personally identifiable information would be something like a phone number, date of birth, full name, address that allows to connect the health information with a specific person.

So one of the important aspects of HIPAA Act and of the HIPAA compliance is safeguarding and ensuring that providers take care of the confidentiality and protect the personal health information, PHI.

So, and when the provider abides by these rules to ensure this compliance chain, the providers, software providers, infrastructure providers, IT services would sign BAA agreement with those entities who share potentially personal health information, PHI of the customers or patients with those providers.

So this is to just reinforce this point that OpenAI will not sign this BAA agreement and they’re not claiming they are HIPAA compliant, they’re actually saying they’re not. And that’s the claim by the HIPAA journal that the tool cannot be used, meaning OpenAI, ChatGPT cannot be used in connection with any ePHI (electronic personal health information).

That said, there is, so, option, potentially, ChatGPT can be used in connection with de-identified protected health information. So, when, kind of, when you remove your PII, personally identifiable information from your PHI personal health information. Then you can potentially use ChatGPT and with to derive to get some advice or derive some, do some analysis as long as that PHI has been de-identified using the method permitted by HIPAA privacy rule.

So that means that PHI is no longer PHI, is not subject to the HIPAA compliance rules. So that’s something that’s important to know.

Right then, so back to our discussion. I mean, some people say and previously it’s used to be discussed as an option that providers of platforms such as ChatGPT could sign a BAA agreement. HIPAA Journal claims that is actually not possible and OpenAI will not sign it.

Even if it was possible, these large providers would not likely sign it with all the customers who want it. Probably would be difficult and probably expensive.

Another point, so whoever might have leaked this document into public access could have violated HIPAA. So that’s not ChatGPT, not OpenAI. So whoever has leaked this document, so for example, any handling provider, this LabCorp company, I don’t know who might have leaked it.

If this was done without the consent of the specific patient or customer, then it’s possibly is in violation of HIPAA compliance. But we also should mention that we are not sure whether this document is real.

So it is likely real, but in case it was generated and it does not include PHI of any real individual, it would not then be a HIPAA compliance violation in this specific case.

So what conclusions can we derive for businesses and individuals based on our quick zoom in into this topic?

Well, technically we cannot confirm with all confidence that ChatGPT has exposed someone’s medical data to another user.

There are reports, as we have seen, of users receiving data from the chat sessions of other users, and there is this report where this medical file not related to the current user has been exposed. So we have also discussed the technical mechanisms such as hash, ID collision for the stored binary files, which might in theory leads to such unintended exposures.

So for individuals, this means that they should balance the risk versus reward. So the risk here is that your personal medical data, your PHI might leak and the reward is the convenience, the low cost and additional insights that you obtain, which might potentially result in better health outcomes or even be life saving for some people.

Personally, I use public LLM platforms such as ChatGPT and Clause time to time to get advice on over second opinion on my health, to analyze my quantified self data and so on. So I’m just conscious of the fact that I’m dealing with a software that someone hosts somewhere on their server and theoretically my data might leak. So I think the probability of this happening with OpenAI Anthropic, for example, is low and the reward for me is higher. So I continue doing that, but I would probably not put there any of the super sensitive data that would be unacceptable for me to ever leak. So as individuals, I recommend you do the same. Balance the risk and reward here.

For organizations, my recommendation is simply never use public-facing cloud-hosted LLM platforms such as ChatGPT or Cloud in handling anything that has to do with PHI. Always use self-hosted LLM and AI agents to process your clients’ data. For practical options to do this, you have a few options here.

(1) So, one option is get a BAA, that Business Associate Agreement from the LLM provider. So, theoretically, it is possible, as we discussed in practice, and as we saw in this publication, that would be either, you know, they might be not doing that at the moment, or it might be a lengthy and expensive process. So most likely, I personally wouldn’t go that way.

(2) Second option is to use a self-hosted LLM. So often platforms such as AWS, Microsoft Azure, Google apps, and so on already have prebuilt LLM solutions you can self-host. So that means that other users, other customers, which have nothing to do with your business, will not have access to your LLM. So there will be no chance that somebody else will expose their data or that your customer’s data will be exposed outside of your customer’s base, even if there was some technical glitch there.

(3) Third option is hire a specialized AI and LLM consultancy firm, such for example as our firm DeepX, that will advise on the best solution and implement it for you. So there are multiple advantages here. It will likely be actually cheaper infrastructure wise, as instead of paying larger providers overheads, Overheads, the specialized consultancy will find you the best cost efficient GPU hosting. It will be cheaper. And we’ll install the self-hosted LLM there. So the overall infrastructure cost will be at least like 30% or 50% cheaper compared to large providers and infrastructure as a service platforms. Most importantly, however, a specialized consultancy will not just deploy an LLM for you, but will also build a tailored data processing and AI agent pipeline tailored to your needs. So it will actually solve your tasks, create more value, and be more secure. For example, the issues with possible hash collisions or other security and compliance problems could still occur in self-hosted LLMs. It just means that information will not be exposed outside of your employees and your customer base. However, it is still a bad thing and possibly a HIPAA compliance violation if one of your customers obviously gets a response related to a document of another customer of yours. So specialized AI and LLM consultants, will know how to set up the architecture of the system and install the necessary guardrails to avoid those kinds of things happening.

I think that those are probably the most important things to share on this topic. I wanted to share them based on this Reddit discussion. I think it’s a useful update and useful to provide this sort of summary.

Thank you for listening and good luck with your AI agents application in the space of healthcare!

ChatGPT, HIPAA compliance and Self-hosted LLMs

The Incident That Sparked Concern

What Really Happened?

Hash Collision Theory

Cross-Session Data Leakage

Public Document Hypothesis

The HIPAA Compliance Reality Check

Understanding the Legal Framework

Risk vs. Reward: A Balanced Perspective

Potential Benefits:

Documented Risks:

Recommendations for Different Users

For Individuals

For Healthcare Organizations

Compliant Alternatives for Healthcare Organizations

Option 1: Business Associate Agreements

Option 2: Self-Hosted LLM Solutions

Option 3: Specialized AI Consultancy

The Technical Prevention Strategy

Looking Forward: The Future of Healthcare AI

Key Takeaways

Video discussion

Full transcript

Recent Posts

Categories

About Us

Contact Info

Socialize

Links

ChatGPT, HIPAA compliance and Self-hosted LLMs

The Incident That Sparked Concern

What Really Happened?

Hash Collision Theory

Cross-Session Data Leakage

Public Document Hypothesis

The HIPAA Compliance Reality Check

Understanding the Legal Framework

Risk vs. Reward: A Balanced Perspective

Potential Benefits:

Documented Risks:

Recommendations for Different Users

For Individuals

For Healthcare Organizations

Compliant Alternatives for Healthcare Organizations

Option 1: Business Associate Agreements

Option 2: Self-Hosted LLM Solutions

Option 3: Specialized AI Consultancy

The Technical Prevention Strategy

Looking Forward: The Future of Healthcare AI

Key Takeaways

Video discussion

Full transcript

Recent Posts

Categories

Tags

About Us

Contact Info

Socialize

Links