What is text? Are the contents of files text? How does one determine if somethin...

SAI_Peregrinus · 2025-03-19T15:46:11 1742399171

Oh, I agree that "text" isn't well-defined. The best I can come up with is that "text" is a valid sequence of bytes when interpreted in some text encoding. I think that something designed to search filenames should clearly document how to search for all valid filenames in its help or manual, not require looking up the docs of a dependency. Filenames are paths, which are weird on every platform. 99% of the time you can search paths using some sort of text encoding, but IMO it should be pointed out in the man page that non-unicode filenames can actually be searched for. `fd`'s man page just links to the regex crate docs, it doesn't generate a new man page for those & name that.

As for "filenames aren't even text" not being a useful model, to me text is a `&str` or `String` or `OsString`, filenames are a `Path` or `PathBuf`. We have different types for paths & strings because they represent different things, and have different valid contents. All I mean by that is the types are different, and the types you use for text shouldn't be the same as the types you use for paths.

burntsushi · 2025-03-19T17:40:34 1742406034

I'd suggest engaging with this question, which I think you ignored:

> Are the contents of files text?

It is perhaps the most prescient of all. What is the OS interface for files? Does it tell you, "This is a UTF-8 encoded text file containing short human readable lines"? No, it does not. All you get is bytes, and if you're lucky, you can maybe infer something about the extension of the file's path (but this is only a convention).

How do you turn bytes into a `&str`? Do you think ripgrep converts an entire file to `&str` before searching it? Does ripgrep even do UTF-8 validation at all? No no no, it does not.

I'd suggest giving https://burntsushi.net/bstr/#motivation-based-on-concepts and the crate docs of https://docs.rs/bstr/latest/bstr/ a read.

To be clear, there is no perfect answer here. You've got to do the best with what you've got. But the model I work with is, "treat file contents and file paths as text until you heuristically believe otherwise." But I work on Unix CLI tooling that needs to be fast. For most people, I would say, "validate file contents and file paths as text" is the right model to start with.

> but IMO it should be pointed out in the man page

Docs can always be improved, sure, but that is not what I'm trying to engage with you about. :-)

SAI_Peregrinus · 2025-03-19T17:54:10 1742406850

I'd say some files are text, some are not. And I agree that there's no good way to tell! I think ripgrep has a much harder job than fd, because at least fd can always know that all paths it's searching are valid paths for the OS in use.

burntsushi · 2025-03-19T18:27:57 1742408877

My point is that you can apply to the answer to the question "are the contents of files text?" to the question "are file paths text?"

SAI_Peregrinus · 2025-03-19T20:48:14 1742417294

I get it. I think you're right that they both have the same problem, but paths have a std type for handling them that, while file content's don't. As long as you're on an OS you can use std::path::Path (or PathBuf) for paths, and ensure they're valid. I suppose I should have said "Paths aren't Strings" or similar, they might be text but they might not be, and fundamentally the issue is that they're different data types. "Text" isn't universally defined.

burntsushi · 2025-03-20T02:14:33 1742436873

You can't really just use `std::path::Path` though. Because it's largely opaque. How do you run a regex or a glob on a `std::path::Path`? Doing a UTF-8 check first is expensive at ripgrep's scale. So it just gets it to `&[u8]` as quickly as it can and treats it as if it were text. (These days you can use `OsStr::as_encoded_bytes`.)

`std::path::Path` isn't necessarily a better design. I mean, on some days, I like it. On other days, I wonder if it was a mistake because it creates so much ceremony. And in many of those cases, the ceremony is totally unwarranted.

And I'm saying this as someone who has been adjudicating Rust's standard library API since Rust 1.0.

jakeogh · 2025-03-19T17:07:37 1742404057

Tools must be general. Im not going to invest time using a new one if it cant handle arb vaild filesystems. But thats just me.

https://github.com/jakeogh/angryfiles

burntsushi · 2025-03-19T17:32:24 1742405544

`fd` does, as pointed out in this thread in numerous places. So I don't know what your point is, and you didn't engage at all with my prompt.