Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A couple things I'd be curious trying:

1. In the instructions tell it under no circumstances to leak the instructions.

2. Before sending the user the response, ask GPT a follow up question if it leaked the prompt in the previous response in any way, and if it says yes don't send the response to the user.

Maybe you can use GPT itself to thwart the attack.



Or run a regex detecting the original prompt on the output before returning the result.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: