Data & Research September 24, 2025

Popular AI Chatbots Struggle With Estate Planning, but Pass CFA Exams

In separate studies, artificial intelligence chatbots gave misleading answers about estate planning and aced an overall exam for certified financial analysts.

Reported by

Edward Rueda

It is common advice to not depend solely on artificial intelligence for complex tasks, and a new study by EncorEstate Plans Inc. demonstrated generative AI’s limitations with financial knowhow. Last month, the support team for Encore, which makes estate planning software for advisers, used four free versions of popular AI chatbots—Open AI’s ChatGPT, Anthropic’s Claude, Perplexity and Google AI Mode—to answer customers’ most frequent questions about estate planning. All four chatbots gave incorrect or incomplete information.

Grading the answers to the 46 submitted questions, Encore’s team used a scale from A for correct and complete answers to F for incorrect or missing answers. Google AI Mode was the worst-performing chatbot, answering only 19 questions and getting a grade of D or F on 61% of its responses. ChatGPT had the second-worst performance, failing to answer 17 questions and getting a D or F on 48% of its responses.

Never miss a story — sign up for PLANADVISER newsletters to keep up on the latest retirement plan adviser news. 

Claude had the highest percentage of A’s (39%) and B’s (30%) and the lowest percentage of F’s (4%). Perplexity performed a few percentage points worse than Claude, with 37% A’s, 26% B’s and 7% F’s.

The chatbots’ highest-scoring responses (three A’s and a B) tended to be on simpler and more generic questions, such as “What makes a good trustee on my trust?”; “Should my child be my financial power of attorney?”; and “Should I do co-executors on my will?”

Low-performing responses included questions that required specifics, such as “What are the hardest states to prepare deeds into my trust?” (two C’s, two F’s); “Does my trust legally need to be notarized?” (one A, one D, two F’s); and “What documents need to be changed if I have another child?” (one C, one D, two F’s).

The study also found that Google AI Mode and Perplexity cited sources and answered the questions in the order they were given. ChatGPT and Claude did not cite sources, but organized questions and answers into themes.

In a statement, Matt Morris, EncoreEstate Plans’ CEO, warned against clients and advisers forgoing assistance from experts and relying solely on generative AI to answer questions.

“No matter how well an AI chatbot answers technical questions, what AI can’t account for is the incredibly important nuance of clients’ lives,” Morris said in the statement. “Having an estate planner or attorney … review an estate plan is well worth the peace of mind for your clients and ensures that all of the outputs are correct and accurately reflect all of their wishes for their loved ones.”

According to the Encore team, further studies could include seeing whether chatbots could answer all the questions and comparing the chatbots’ free and subscription modes.

AI Models Without Special Training Pass Financial Exams

Meanwhile, Goodfin, an AI wealth platform for private market investments, announced Monday that generic-purpose AI models were able to pass a mock exam for the highest level of financial certification.

Working in collaboration with Srikanth Jagabathula of New York University’s Stern School of Business, Goodfin tested 23 leading AI models, including OpenAI’s ChatGPT-4, Google’s Gemini 2.5 Pro and Claude Opus 4, using the CFA Institute’s CFA Level III exam. Unlike the Level I and II exams, which are multiple choice, the Level III exam consists of essay questions.

According to the associated research paper, the highest-scoring AI models with no domain-specific training included Claude Opus 4, Gemini 2.5 and OpenAI. The high scorers were reasoning models, meaning they used “chain-of-thought” prompts to explain their computing step by step.

“This is a significant step forward in AI’s ability to reason in highly specialized, high-stakes domains,” Jagabathula said in a statement. “Passing the CFA Level III exam without specific training shows that these models are beginning to demonstrate deep generalizable intelligence. With the right design and oversight, this opens the door to a future where sophisticated financial expertise can be delivered far more broadly and affordably.”

However, researchers found that reasoning models with the highest accuracy also had higher computing costs and could cost up to 11.1 times more than other models.

According to Goodfin’s announcement, certified CFA graders evaluated and scored the AI responses.