I tried this yesterday, asking it to create a simple daily reminder task, which it happily did. Then when the time came and went I simply got a chat that the task failed, with no explanation of why or how it failed. When I asked it why, it hallucinated that I had too many tasks. (I only had the one) So, now I don't know why it failed or how to fix it. Which leads to two related observations:
1) I find it interesting that the LLM rarely seems trained to understand it's own features, or about your account, or how the LLM works. Seems strange that it has no idea about it's own support.
2) Which leads me to the Open AI support docs[0]. It seems pretty telling to me that they use old-school search and not an LLM for its own help docs, right?
Same experience except mine insisted I had no tasks.
It does say it's a beta on the label, but the thing inside doesn't seem to know that, nor what it's supposed to know. Your point 1, for sure.
Point 2 is a SaaS from before the LLMs+RAG beat normal things. Status page, a SaaS. API membership, metrics, and billing, a SaaS. These are all undifferentiated, but arguably they selected quite well for when the selections were made, and unless the help is going to sell more users, they shouldn't spend time on undifferentiated heavy lifting, arguably.
How do you know it hallucinated? Maybe your task was one too many and it is only able to handle zero tasks (which would appear to be true in your case).
Re: 2 — for the same reason that you shouldn't host your site's status page on the same infrastructure that hosts your site (if people want to see your status page, that probably means your infra is broken), I would guess that OpenAI think that if you're looking at the support docs, it might be because the AI service is currently broken.
I've thought about this a lot too and my guess is that because foundational modals take a lot to train, I don't think they are trained fairly often, and from my experiences you can't train in new data easily, so I think you'd have to have some little up to date side system, and I suspect they're very thoughtful about these "side systems" they place, from trying to build some agent orchestration stuff myself nothing ends up being as simple as as I expect with "side systems" and stuff easily goes off the rails. So my thought was probably, given the scale they're dealing with, this is probably a low priority not actually particularly easy feature.
> So my thought was probably, given the scale they're dealing with, this is probably a low priority not actually particularly easy feature.
"working like OpenAI said it should" is a weird thing to put low priority. Why do they continuously put out features that break and bug? I'm tired of stochastic outputs and being told that we should accept sub-90% success rates.
At their scale, being less than 99.99% right results in thousands of problems. So their scale and the outsized impact of their statistical bugs is part of the issue.
Why are you setting your bar this way? Is it because of how they do their feature releases (no warning of it being an alpha or beta feature)? Their product, ChatGPT was released 2 years ago, and is a fairly complicated product. My understanding was the whole thing is still a pretty early product generally. It doesn't seem unusual that any startup doing something as big as they are to release features that don't have all the kinks ironed out. I've released some kinda janky features to 100,000s of users before not totally knowing how it's going to preform with all of them at that scale, I don't think that is very controversial in product development.
Also, I was specifically talking about it being able to understand the features it has in my earlier comment, I don't think that is the same problem as the remind me feature not working consistently.
> I've released some kinda janky features to 100,000s of users before not totally knowing how it's going to preform with all of them at that scale, I don't think that is very controversial in product development.
Oh, that's because modern-day product development of "ship fast, break things" is its own problem. The whole tech industry is built on principles that are antithetical to the profession of engineering. It's not controversial in product development, because the people doing the development all decided to loosen their morals and think its Fine to release broken things and fix later.
That my bar is high and OpenAI is so low is its own issue. But then again, I haven't released a product where it could randomly tell people to poison themselves by combining noxious chemicals or whatever other dangerous hallucination ChatGPT spews. If I had engineered something like that, with the opportunity to harm people and being unable to guarantee it wouldn't, if I had engineered that misinformation was a possibility to be created at scale, if I had engineered this, I would have trouble sleeping...
I regularly use Perplexity and Cursor which can search the internet and documentation to answer questions that aren't in their training data. It doesn't seem that hard for ChatGPT to search and summarize their own docs when people ask about it.
You would want a feature like "self aware" to be pretty canonical, not based on a web search, and even if they had a discreet internal side system it could query that you controlled, if the training data was a year old, how would you keep it matched from a systems point of view over time? Also it's unclear how the model would interoperate the data each time it ran on the new context. It seems like a pretty complicated system to build tbh, esp when maintaining human created help and docs and FAQs etc is A LOT simpler and more reliable source of truth. That said, my understanding is behind the scenes they are working towards the product we experience just built around the foundational model, not THE foundational model is it pretty much is today. Once they have a bunch of smaller llms that do discreet standard tasks set up, I would guess they will become considerably more "aware".
> 2) Which leads me to the Open AI support docs[0]. It seems pretty telling to me that they use old-school search and not an LLM for its own help docs, right?
I agree, but then again, if you're a dev in this space, presumably you know what keywords to use to refine your search. RAG'ed search implies that the user (dev) are not "in the know".
Very, very, very buggy and really looks extremely low effort as with many OpenAI feature rollouts. Nothing wrong with an MVP feature, but make it at least do what it’s supposed to do and maybe give it 10% more extensibility than the bare bones.
I question the same things frequently. I routinely try to ask chatgpt to help me understand the openai api documentation and how to use it and it rarely is helpful, and frequently tells me things that are just blatantly untrue. At least nowadays I can link it directly to the documentation for it to read.
But I dont understand why their own documentation and products and lots of examples using them wouldn't be the number one thing they would want to train the models on (or fine tune, or at least make available through a tool)
It's all a matter of degree. Even in deterministic systems, bit flipping happens. Rarely, but it does. You don't throw out computers as a whole because of this phenomena, do you? You just assess the risk and determine if the scenario you care about sits above or below the threshold.
My point is that your confidence level depends on your task. There are many tasks for which I'll require ECC. There are other tasks where an LLM is sufficient. Just like there are some tasks where dropped packets aren't a big deal and others where it is absolutely unacceptable.
If you don't understand the tolerance of your scenario, then all this talk about LLM unreliability is wasted. You need to spend time understanding your requirements first.
You generally cannot know because we don't measure for it? Especially not on personal computers, maybe ECC ram reports this information in some way?
In practice I think it happens often enough, and I remember a blackhat conference talk from around a decade ago where the hacker squatted typoed variants of the domain of a popular facebook game, and caught requests from real end users. Basing his attack on the random chance of bitflips during dns lookups.
I'm trying to figure out how this would be useful with the existing feature set.
It seems like it would be good for summarizing daily updates against a search query. but all it would do is display them. I would probably want to connect it with some tools at minimum for it to be useful.
> ChatGPT has a limit on 10 active tasks at any time. If you reach this limit, ChatGPT will not be able to create a new task unless you pause or delete an existing active task or it completes per its scheduled time.
So this is pretty much useless for most real-world uses cases.
I'm surprised it took OpenAI this long to launch scheduled tasks, but as we've seen from our users[0], pure LLM-based responses are quite limited in utility.
For context: ~50% of our users use a time-triggered Loop, often with an LLM component.
Simple stuff I've used it for: baby name idea generator, reminder to pay housekeeper, pre-natal notifications, etc.
We're moving away from cron-esque automations as one of our core-value props (most new users use us for spinning up APIs really quickly), but the base functionality of LLM+code+cron will still be available (and migrated!) to the next version of our product.
> None of these require an LLM. It seems like you own this service yet can't find any valuable use for it.
Sorry? My point was that these are the only overlapping features I've personally found useful that could be replaced with the new scheduled tasks from ChatGPT.
Even these shouldn't require an LLM. A simple cron+email would suffice.
The web scraping component is neat, but for my personal use-cases (tide tracking) I've had to use LLM-generated code to get the proper results. Pure LLMs were lacking in following the rules I wanted (tide less than 1 ft, between sunrise and sunset). Sometimes the LLM would get it right, sometimes it would not.
For our customers, purely scheduling an LLM call isn't that useful. They require pairing multiple LLM and code execution steps to get repeatable and reliable results.
> ChatGPT tasks will become a powerful tool once incorporated into GPTs.
> Baby name generator: why would this be a scheduled task? Surely you aren't having that many children... :)
So far it's help name two children :) -- my wife and I like to see the same 10 ideas each day (via text), so that we can discuss what we like/don't like daily. We tried the sift through 1000 names thing and it didn't fit well with us.
> Reminder to pay, notifications: what value does OpenAI bring to the table here over other apps which provide calendar / reminder functionality?
That's exactly my point. Without further utility (i.e. custom code execution), I don't think this provides a ton of value at present.
This feature is really bad (unreliable) and they don’t even make a good case for _why_ you would want to use this over literally any other reminder system. I guess it can execute an LLM to decide what to send to you at the scheduled time but its unreliability would never have me relying on it.
Some use cases that might be interesting
* Let me know the closing stock price for XXXXX
* Compile a list of highlights from the XXXX game after it finishes
But everything I can think of is just a toy, cool if it works but not ground breaking and possible with much more reliable methods.
OpenAI really seems just be throwing stuff at the wall to see if it sticks then moving on and never iterating on the previous features. Dall-e is kind of a joke compared to other things (one-shot only), I trust Claude more for programming, o1 was ho-hum for my needs, desktop app still feels like a solution in search of a problem, etc.
I tried it and it failed to send me desktop notification. I did receive emails (at the wrong time). I do think it is too early to launch. 5 min test could have found out these bugs.It really hurts their brand.
This will be a lot more useful when it's able to combine with more tools, such as in custom GPT actions, APIs, "computer use", the Python interpreter, etc.
Yeah, it’s pretty bad, embarrassingly so quite honestly. Literally a single developer in a day could probably significantly improve it. I’m sure that’s coming, but why don’t they just launch these MVP features at least a quarter baked. It’s essentially unusable as is. If it could ping me on my phone And advanced voice could open or I could go do a basic task, great I’m back to using it. But essentially as it is rolled out, it’s hilariously minimal and borderline unusable.
If it works correctly, wouldn't those still be peak times? Except with this they have to process the initial scheduling request in addition to the at-execution task.
Everyone else's crons, synced to wall clocks, vs your centralized cron (task scheduler, really) that is aware of scheduled work and current load on your systems that consume the scheduled tasks.
Controlling the ability to nudge the wakeup times by small amounts of time can make a huge difference to your ability to manage spiky workloads like this.
A lot of answers don't go stale for hours or days. They'll do the task early, at an off-peak time, hidden from the user, double-check that it really wasn't time sensitive, then surface the saved answer at the time desired.
Start with a regex (or fast tiny model) to flag obvious time-sensitive tasks. Else, do the task early by prompting it "if this requires up to the minute information, output cancel, else [prompt]". At best, it's 1 regex + 1 full inference. At worst, it's 1 regex + 1 output token + 1 full inference.
OpenAI resembles the old Apple: ship the best experience. The ChatGPT app on every platform is the best in business and they are shipping polished features relatively quickly. It's quite the contrast to Apple of today, the world's largest company who is so inept that they are releasing Apple Intelligence, which is quite literally using ChatGPT 3.5 tech in 2025. It just shows how valuable CEO's like Altman, Musk and Jobs are to a corporation.
The ChatGPT UI/UX is pretty middling. They still don't have a proper answer to Claude Projects, plus they are focusing on shipping stuff like this instead of fixing the numerous papercuts with the chat experience in their UI. How is it that I can access the most powerful AI on the planet with o1 pro, but if I paste more than few pages of text there's no solution for that, it just overflows the input box and makes it impossible to navigate?
The "old" Apple certainly didn't ship anything quick or on the bleeding edge, nor did they ship the "best" experience. They did, however, have somewhat different priorities than their competitors. They still do to some extent.
Agreed. The vast majority of their audience doesn't understand the difference. And among the subset that do, I imagine there's a fair number of us that don't care about the distinction. I just want it to work well.
Open AI creating an AI phone with Microsoft ... release H.E.R. (the movie) in your pocket.
Your AI assistant / Agent is seen on the Lock Screen (like a FaceTime call UI/UX) waiting at your beckon to do everything for you /be there for via via text, voice, gestures, expressions, etc.
It interfaces with other AI Agents of businesses, companies, your doctor, friends & family to schedule things & used as a knowledge-base (ask friends birthday if they allow that info).
Apple is indeed stale & boring to me (heavy GPT user) in 2025.
1) I find it interesting that the LLM rarely seems trained to understand it's own features, or about your account, or how the LLM works. Seems strange that it has no idea about it's own support.
2) Which leads me to the Open AI support docs[0]. It seems pretty telling to me that they use old-school search and not an LLM for its own help docs, right?
[0] https://help.openai.com/