Hacker News new | past | comments | ask | show | jobs | submit | NiloCK's comments login

1. The N=1 positive result isn't the sole basis for expanded effort. The basis the is the compelling, research backed, causal mechanism that predicted the scheduling adjustment's success.

2. Does it? Speaking directly out of my butt here (not in healthcare, not an academic), but the OP spoke of pretty acute symptoms specific to a treatment plan. If the treatment program is at all common, then a very straightforward A/B split of non-intervention / intervention.

Heck, even a questionnaire of past patients cross-referenced with historical records of appointment times could go a long way to validate the hypothesis.

3. This degree of specialization is for insects. If literal MDs in the field are too atomized to even surface research proposals, then that feels like an awful waste of edge-research capability.


And if the A/B test says that your intervention made the situation worse? Now the doctor is held liable for that.

Not how it works. Doctors have wide latitude to treat patient based on their personal medical intuition. You already have doctors dosing patients at all times of day. If an A/B test shows evening is optimal, all the morning administrators will not suddenly become liable retroactively. Hell, they wont even be liable if they keep doing it in the morning because it fits their schedule better.

I, too, built a POC autothink shortly after the Claude 3.7 release that included the `extended thinking` toggle. It's literally also called autothink:

https://github.com/NiloCK/autothink

https://www.paritybits.me/think-toggles-are-dumb/

My own version took a first pass with an LLM whose job was to assign a 0-100 complexity rating, and then there was more or less a linear scaling of the allocated thinking budget.

The OP effort here is obviously higher grade, and I'm really tickled to see quantitative results. Well done.


Maybe I miss the point here, but it seems to me that versioning the protocols is the specific solution to maintaining interop between different implementations.


I'm working on a modular OSS stack for creation of interactive tutoring systems. Think anki + wikipedia + mathacademy + duolingo, but with very generic interfaces that allow for all sorts of interactive content (eg, one working prototype is an ear training / site reading course with midi keyboard hookup).

I want to sand away the roughest edges of the SRS user experience, and to enable individuals and organizations to easily spin up courseware that's aligned with best pedagogical practices and fits naturally into the hands of educators.

And, in the current wave, also think bolt.new or lovable.dev, with 'agentic' workflows for both content (scouring source material for ) and for user-authored bespoke interactive, via LLM authored artifacts.

Optimistic for some ShowHNs or more general shilling this week.

- https://github.com/patched-network/vue-skuilder - https://patched.network


Anthropic moved from 3.5, to 3.5(new), to 3.7. They skipped 3.6 because of usage in the community, and because 3.5(newer) probably passed some threshold of awfulness.

People also use 3.5.1 to refer to 3.5(new)/3.6.

The remaining difficulty now is when people refer to 3.5, without specifying (new) or (old). I find most unspecified references to 3.5 these days are actually to 3.6 / 3.5.1 / 3.5(new), which is confusing.


SRS is wonderful. The improvements in the linked article are real and laudable.

But they are comparatively narrow and technical compared to the improvements that need to be made in usability and scalability (of content dissemination and navigation).

Shameless: I am working on a FOSS SRS platform / toolkit that I believe can take some substantive steps forward here. Can read some at http://patched.network, or poke around http://github.com/patched-network/vue-skuilder


> Half the benefit of SRS comes from working out what the flashcards are. You have to circle around a concept, look for similarities, differences, examples, generalisations, properties, etc.

This is at least half true, but among SRS folks it's overstated.

The same argument stands for any creative endeavor.

Debussy had a greater experience than I did with Claire de Lune, because he composed it. In turn, my experience of it is probably greater than the next persons' because I learned it. Bottom of the heap is the lowly listener - but I think everyone would agree that listening is a wonderful way to appreciate music and shouldn't be diminished.

Same for literature, visual arts, etc etc.

The still greater reason that create-your-own has historically dominated for SRS decks is the specific curation aspect. Anything curated by someone else suffers from mismatch with your own intentions and prior knowledge profile.

(Plug time. Like most people, I am building my own grandiose SRS app. Some writing on it is here: http://patched.network )


This looks amazing. I love your description

> …the FOSS lovechild of anki, duolingo, wikipedia, and MathAcadamy, with a more generalized surface area…

I haven’t read much in your blog and the answers are probably there, but what’s the current state? It looks like you’ve done a lot of work on it. Is it in a usable state at this point?


Wow, thanks!

The app's been 'in production' for myself and my daughters for years, but no public access or polished working demos yet. I hope to have exemplar static-site builds using very soon.

I'm also working on these guys for my in-development early literacy course: https://patched-network.github.io/letterpeople/


With respect to model access and deployment pipelines, I assume there are some inside tracks, privileged accesses, and staged roll-outs here and there.

Something that could be answered, but is unlikely to be answered:

What was the level of run-time syconphancy among OpenAI models available to the White House and associated entities during the days and weeks leading up to liberation day?

I can think of a public official or two who are especially prone to flattery - especially flattery that can be imagined to be of sound and impartial judgement.


What? No they aren't.

You get different results each time because of variation in seed values + non-zero 'temperatures' - eg, configured randomness.

Pedantic point: different virtualized implementations can produce different results because of differences in floating point implementation, but fundamentally they are just big chains of multiplication.


On the other hand, responses can be kind of chaotic. Adding in a token somewhere can sometimes flip things unpredictably.


But experience shows that you do need non-zero temperature for them to be useful in most cases.


Warning: you may have become cynical.

I didn't read that comment as snarky at all - efficiency comparisons between emerging tech and SOTA (grass, trees) are extremely relevant!

(Warning to welf: you may be naive)


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: