Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

leninmummy@lemmy.ml · 1 year ago

Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

TheSaneWriter@lemmy.thesanewriter.com · 1 year ago

I’m not too surprised, they’re probably downgrading the publicly available version of ChatGPT because of how expensive it is to run. Math was never its strong suit, but it could do it with enough resources. Without those resources, it’s essentially guessing random numbers.

PupBiru@kbin.social · 1 year ago

from what i understand, the big change in chat-gpt4 was that the model could “ask for help” from other tools: for maths, it knew it was a maths problem, transformed it to something a specialised calculation app could do, and then passed it off to that other code to do the actual calculation

same thing for a lot of its new features; it was asking specialised software to do the bits it wasn’t good at

whyrat@lemmy.ml · 1 year ago

Chat GPT will just become a front end for Wolfram Alpha?

DrMux@kbin.social · 1 year ago

My guess is that it’s more a result of overfitting for alignment. Fine-tuning for “safety” (rather, more corporate-friendly outputs).

That is, by focusing on that specific outcome in training the model, they’ve compromised its ability to give well-“reasoned” “intelligent” sounding answers. A tradeoff between aspects of the model.

It’s something that can happen even in simple statistical models. Say you have a scatter plot of data that loosely follows some trend, and you come up with two equations to describe that trend. One is a simple equation that loosely follows it but makes a good general approximation, and the other is a more complicated equation that very tightly fits the existing data. Then you use those two models to predict future data. But you find that the complicated equation is making predictions way off the mark that no longer fit the trend, and the simple one still has a wide error (how far its prediction is from the actual data) but still more or less accurately fits the general trend. In the more complicated equation, you’ve traded predictive power for explanatory power. It describes the data you originally had but it’s not useful for forecasting data that follows.

That’s an example of overfitting. It can happen in super-advanced statistical models like GPT, too. Training the “equation” (or as it’s been called, spicy autocorrect) to predict outcomes that favor “safety” but losing the model’s power to predict accurate “well-reasoned” outcomes.

If that makes any sense.

I’m not a ML researcher or statistician (I just went through a phase in college), so if this is inaccurate I’m open to corrections.

DR_Hero@programming.dev · 1 year ago

I’ve definitely experienced this.

I used ChatGPT to write cover letters based on my resume before, and other tasks.

I used to give it data and tell chatGPT to “do X with this data”. It worked great.
In a separate chat, I told it to “do Y with this data”, and it also knocked it out of the park.

Weeks later, excited about the tech, I repeat the process. I tell it to “do x with this data”. It does fine.

In a completely separate chat, I tell it to “do Y with this data”… and instead it gives me X. I tell it to “do Z with this data”, and it once again would really rather just do X with it.

For a while now, I have had to feed it more context and tailored prompts than I previously had to.

MrMamiya@feddit.de · edit-2 1 year ago

It’s gonna be so fucking rich that the staggering mass of stupidity online prevents us from improving an AI beyond our intelligence level.

Thank the shitposter in your life.

erwan@lemmy.ml · 1 year ago

You can’t really blame the amount of stupidity online.

The problem is that ChatGPT (and other LLM) produce content of the average quality of its input data. AI is not limited to LLM.

For chess we were able to build AI that vastly outperform even the best human grandmasters. Imagine if we were to release a chess AI that is just as good as the average human…

Atomic@sh.itjust.works · edit-2 1 year ago

We call them chess ai. But they’re not actually real A.I. chess bots work off of opening books, predetermined best practices. And then analyzes each position and potential offshoots with an evaluation function.

They will then start to brute-force positions until it finds a path that is beneficial.

While it may sound very much alike. It works very differently than an A.I. However. It turned out that A.I software became better than humans at writing these functions.

So in a sense, chess computers are not A.I. They’re created by A.I. at least Stockfish 12 has these “A.I inspired” evaluations. (Currently they’re on Stockfish 15 I believe)

And yes. We also did make “chess AI” that is as bad as the average player. We even made some that are worse. Because we figured it would be nice if people can play a chess computer that is on the same skill level as the player. Rather than just being destroyed every time.

erwan@lemmy.ml · 1 year ago

The definition of “AI” is fuzzy and keeps changing. Basically when an AI use case becomes solved and widespread it stopped being seen as AI.

Face recognition, OCR, speech recognition, all those used to be considered AI but now they’re just an app on your phone.

I’m sure in a few years we’ll stop thinking about text generation as AI, but just one more tool we can leverage.

There is no clear definition of “real AI”.

Dr Cog@mander.xyz · edit-2 1 year ago

Those are all still AI. Scientists still have a functional definition that includes these plus more scripted AI like in video games.

Essentially, any algorithm that learns and acts on information that has not been explicitly programmed is considered AI.

Nonameuser678@kbin.social · 1 year ago

Shitposting saves jobs

rammer@sopuli.xyz · 1 year ago

Shitposters on the Internet are the new clogs in the machine

dugite-code@mastodon.social · 1 year ago

This is my experience in general. ChatGTP when from amazingly good to overall terrible. I was asking it for snippets of javascript, explanations of technical terms and it was shockingly good. Now I’m lucky if even half of what it outputs is even remotely based on reality.

Pepperette@lemmy.ml · 1 year ago

They probably laid off the guy behind the curtain.

Send_me_nude_girls@feddit.de · 1 year ago

Must be because of all the censoring. The more they try to prevent DAN jailbreaking and controversial replies, the worse it got.

Fisk400@lemmy.world · 1 year ago

Stop making a language model do math? We have already have calculators.

ThreeHalflings@sh.itjust.works · 1 year ago

Do you think maybe it’s a simple and interesring way of discussing changes in the inner workings of the model, and that maybe people know that we already have calculators?

Fisk400@lemmy.world · 1 year ago

I think it’s a lazy way of doing it. OpenAI has clearly stated that math isn’t something that they are even trying to make it good at. It’s like testing how fast Usain bolt is by having him bake a cake.

If chatgpt is getting worse at math it might just be a side effect of them making it better at reading comprehension or something they want it to be good at there is no way to know that.

Measure something it is supposed to be good at.

ThreeHalflings@sh.itjust.works · edit-2 1 year ago

All the things it’s supported to be good at are completely subjectively judged.

That’s why, u less you have a panel of experts in your back pocket, you need something with a yes or no answer to have an interesting discussion.

If people were discussing ChatGPT’s code writing ability, you’d complain that it wasn’t designed to do that either. The problem is that it was designed to transform inputs tk relatively beliveable outputs, representative of its training set. Great. That’s not super useful. It’s actual utility comes from its emergent behaviours.

Lemme know when you make a post detailing the opinions of some university “Transform inputs to outputs” professors. Until then, well ocmrinue to discuss its behaviour in observable, verifiable and useful areas.

Fisk400@lemmy.world · 1 year ago

We have people that assign numerical values to peoples ability to read and write every day. They are english teachers. They test all kinds of stuff like vocabulary, reading comprehension and grammar and in the end they assign grades to those skills. I don’t even need tiny professors in my pocket, they are just out there being teachers to children of all ages.

One of the task I have chatGPT was to name and describe 10 dwarven characters. Their names have to be adjectives like grumpy but the description can not be based on him being grumpy. He has to be something other than grumpy.

ChatGPT wrote 5 dwarves that followed the instructions and then defaulted to describing each dwarf based on their name. Sneezy was sickly, yawny was lazy and so on. This gives a score of 5/10 on the task I gave it.

There is a tapestry of clever tests you can give it with language in focus to test the ability of a natural language model without giving it a bunch of numbers.

ThreeHalflings@sh.itjust.works · 1 year ago

OK, you go get a panel of highschool English teachers together and see how useful their opinions are. Lemme know when your post is up, I’ll be interested then.

Fisk400@lemmy.world · 1 year ago

Sorry, I thought we were having a discussion when we were supposed to just be smug cunts. I will correct my behaviour in the future.

Fixbeat@lemmy.ml · 1 year ago

Can it still solve programming problems?

StarkillerX42@lemmy.ml · 1 year ago

I’ve never been able to get a solution that was even remotely correct. Granted, most of the times I ask ChatGPT is when I’m having a hard time solving it myself.

thisisnotcoincedence@lemmy.world · 1 year ago

If OpenAI is being roadblocked by all these social platforms why doesn’t it decentralize and use the fediverse to learn?

Perfide@reddthat.com · 1 year ago

I mean, whose to say they aren’t? But also, the fediverse is worthless compared to the big players. The entirety of the fediverses content to date is like a days worth of twitter or reddit content.