While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.
The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.
The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.
I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?
——
The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.
The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.
The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.
I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?
——
The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.