AI alignment is a philosophy problem, not an engineering one.
For alignment to happen, we have to agree what it means. Given we have a hard enough time getting humans to “align”, I can’t imagine any successful attempt at alignment sort of complete castration.
Are there degrees of alignment? I'd like to think there's a pretty big range in there between made some decisions I didn't love and destroyed the world and everyone on it.
For alignment to happen, we have to agree what it means. Given we have a hard enough time getting humans to “align”, I can’t imagine any successful attempt at alignment sort of complete castration.