Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: I want “Hey AI find duplicates of books on my shelf
3 points by ThinkBeat on June 28, 2023 | hide | past | favorite | 11 comments
I have a diverse collection of books and sometimes it is joined by another collection and another. My system for keeping it all straight has failed. I have taken a couple of photos of my shelves.

Now I wish to give the AI the photos and ask it to find duplicates.

(Now the back of the book might be a different color, or a different font and so on)

Ideally it will understand and attempt to guess the edition and all that as well.

Is there one out there that can do it?

I have had problems finding one that accepts pictures as input to questions.



Book barcodes are usually the ISBN number for the book. You can just scan the barcodes on the books and check for duplicates of the ISBN. If different editions have different ISBNs, you can use google books API for free to look up metadata about the books (title, author, etc.) for similarity. AI isn't necessary for this


In principle looking up the barcodes would not just find duplicates but let you look up bibliographic records and make a catalog of your books.

I did a short pilot project for scanning barcodes on books in my collection and gave up quickly because many of the books I have are too old to have an ISBN. This doesn't have to be very old, I just took this 1985 book off the shelf

https://monoskop.org/images/a/a0/Foster_Hal_Recodings_Art_Sp...

and it has no barcode or ISBN. If you look on page 4 of the PDF, however, you will find the "Library of Congress Cataloging In Progress Data" and find a mini biblographic record, the number 85-70184 is the LoC catalog number and points to this record

https://catalog.loc.gov/vwebv/search?searchCode=LCCN&searchA...

you can download all the MARC records from the LoC so this is a good way to get metadata, I took a 1966 book off the shelf and found it also had a LCCN.

With ISBN I had other problems too. Some books turned up no records, also I have some books from South End Press (I knew somebody who knew the people there) and they notoriously reuse ISBN numbers as a way to "stick it to the man".

I think OCRing the LoC record at the start of books though would be a better approach.

The LCCN is supposed to identify a "book" but not an edition,

https://oldmatemedia.com/guides/how-to-get-a-lccn/

for instance the number is the same for the hardcover or softcover. It's probably still non-trivial to find "duplicates" (do you count the many different editions of a Shakespeare play as "duplicates?) but you'd have good metadata to do the task.


A good deal of my books are not in English. Those that are mostly not US editions (to the extent they were ever published in the first place)


I'd assume most countries have something like the LoC, for instance

https://www.dnb.de/EN/Professionell/Metadatendienste/Metadat...

https://www.bl.uk/catalogues/british-national-bibliography

https://www.ndl.go.jp/en/data/data_service/jnb/index.html

I think most of these use MARC records so making software that works with all of them seems possible.


I have somewhere between 2200 - 2400 books.

A lot of my books do not have any ISBN. At least half. I think more.

I wanted to take my photos of my shelf and have GPT or something accept my photos as input, and output a list of duplicate a list of duplicates, guessed of editions and highlight in the photo the location.

I agree that I can do it manually. In the manual version I would have to pull out each book, note the title, step on chair or step ladder etc. It takes a long time.

I dont want to put in all the work if an AI can do it all fast.


> I have had problems finding one that accepts pictures as input to questions.

You dont want a question answering llm, you want a OCR app that can parse text from images.


Image similarity isn't hard

https://sbert.net/examples/applications/image-search/README....

and might work for two books that have the same cover, but I wouldn't expect it to work for books that have different covers.


I would rather have an "AI" find duplicates of files across all of my computers and NAS drives, and other hard drives as plugged into a computer on the network.


Dupeguru


I know that what I was asking for was "just a program" and not AI (then again, what is AI but "just a program?") but I would like something that can not only run cross-platform but actually runs on multiple/many machines/nodes and presents a unified dashboard.


Segment them with OpenCV and find the titles with OCR?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: