I had never used youtube-dl until the story happened. I downloaded for windows and its speeds were throttled to around 50kB/s. Posters on Stack recommended the ytdlp fork which i tried and it was 5-10 MB/s. Just fyi
One has to execute functions from base.js^1 to modify the "n" URL parameter and the "sig" parameter to get the fastest download speeds. (One can still download videos with the original n parameter, or without the n parameter, but download speeds will be slower.)
By itself, using the ANDROID API instead of the WEB API^2 does nothing to affect download speeds. I can block yt-dlp's POST request indicating the client and API name and this has no effect on download speed.
A website that forces users to run Javascript in order to get faster download speeds. This is not a new idea.
yt-dl and yt-dlp use Python to interpet the Javascript functions that modify "n" and "sig", but a faster language could be used. For example, V8 is written in C++.
One of the reasons yt-dl and yt-dlp are so slow is because they do (too) many do other things besides running the Javascript, modifying n and sig, and spitting out a fast download URL. Before they can run the JS, the video page needs to be retrieved, but waiting for Python to start up and do this is slow. A YouTube video page can be retrieved much faster using netcat, outside of Python. YouTube video can be downloaded much faster using an HTTP client like tnftp, directly, outside of Python. YouTube video can be converted much faster using ffmpeg, directly, outside of Python. And so on. These programs start instantaniously when compared to the slow start up time of Python. The Python startup latency is unbearable.
At the very least yt-dl and yt-dlp should accept an already downloaded YouTube video page as input instead of forcing the user to use Python (or Python calling another program) to download the page. The recalculation of "n" and "sig" do not have to occur within seconds of retrieving the video page. There is no need to require the user to use Python for downloading webpages. The values in the YouTube video page are good for a substantial amount of time.
Both scripts have an option to output the new n and sig values in a download URL or JSON containing the download URLs, so the user can use an HTTP client directly, outside of Python. But the user still has to use Python to download each video page. Using an HTTP client directly would be faster.
Using yt-dl and yt-dlp just to output optimal download URLs feels like overkill.
This is complete nonsense. The overwhelming majority of video download time is spent transferring the actual video data, limited by throttling or network speed. The time it takes for these tools to start up and deal with the API/JS stuff is inconsequential for any video longer than a minute or so. And even then most of the time is network latency for the API stuff, not local processing. Netcat isn't going to go any faster.
Absolutely nobody thinks optimizing the meta/API processing of yt-dlp & co is worth it. This is exactly why we have high level programming languages that make all of this much easier, instead of trying to write HTML and JS parsing in plain C. Keep in mind these tools support dozens or hundreds or websites, not just YouTube.
If you think rewriting yt-dlp in C is worth it, go right ahead, but you're not going to make it significantly faster; you're just going to make maintenance a much bigger pain for yourself. Pick the right tool for the job. Python is absolutely (one of) the right tools for this job.
(For the record: I use C and Python on a daily basis, and will use whatever language is appropriate for each situation.)
"Python is absolutely (one of) the right tools for this job."
This opinion assumes it is a single job. I see multiple jobs. The number "options" provided by yt-dl(p) gives us a clue.
There is nothing wrong with preferring to use larger, more complicated, "multi-purpose" utilities. There will always be plenty to choose from.
However the idea of using smaller, less complicated, single purpose utilities is not "nonsense". It makes sense in many cases and some users may prefer it.
The statements I make about speed are from day-to-day experience not conjecture.
What is especially slow on some hardware like the original PinePhone is that youtube-dl and yt-dlp are actually executable that extract themselves at runtime. Python itself is quite fast to start.
The startup time for netcat is less than for yt-dl/yt-dlp.
Plus I can request multiple video pages over a single TCP connection with netcat.
For example, in a single TCP connection, with a list of 30 videoIds, I can get initial sig and n values for 413 videoIds. No wait time for netcat to decompress or startup. Using netcat is quite fast. Then I have utilties written in C to extract URLs from stdin. As such, all I need is a utility to update the sig and n values in the download URLs to make them fast ones instead of throttled.
How long would it take to get sig and n values for 30 videoIds let alone 413 with yt-dl or yt-dlp over a single TCP connection. The startup time for yt-dl/yt-dlp plus the time waiting for the downloading of each video page makes it much, much slower. These are "do-everything" scripts that are as a result quite inflexible.
Fixed typos/misspellings, edited for more clarity:
yt-dl and yt-dlp use Python to run the Javascript functions that modify the initial "n" and "sig" values to make non-throttled download URLs, but a faster language could be used to interpret and run the snippet of Javascript. For example, V8 is written in C++, not Python.
One of the reasons yt-dl and yt-dlp are so slow, IMO, is because they do (too) many other things besides running the Javascript, modifying n and sig, and spitting out a fast download URL. Before they can run the JS, the video page needs to be retrieved. I have self-created utilities for downloading webpages wth TCP clients that are sgnificantly faster and more flexible than yt-dl/yt-dlp. I use these small dedicated programs every day. I am used to the speed. Waiting for yt-dl/yt-dlp to start up in order to download web pages is annoying. The latency is unbearable.
At the very least yt-dl and yt-dlp should accept an already downloaded YouTube video page as input instead of forcing the user to use Python (or Python calling another program) to download the page. The recalculation of "n" and "sig" do not have to occur within seconds of retrieving the video page. There is no need to require the user to use Python for downloading webpages. The values in the YouTube video page are good for a substantial amount of time.
Both yt-dl and yt-dlp have an option to output the new n and sig values in a download URL or as JSON containing the download URLs, so the user can use an HTTP client directly to download video, wthout need to launch yt-dl/yt-dlp. But the user still has to use yt-dl/yt-dlp to download each video page. Using a TCP client directly would be faster.
Using yt-dl and yt-dlp just to output optimal download URLs feels like overkill. It would be nice to have a program that just focuses on running the inecessary JS in base.js in order to output optimal download URLs. Then the user can use whatever programs she wants, directly, for downloading HTML/JSON, extracting URLs, downloading video files, converting video files, etc. Downloading video from most websites is generally easy. I never need a program like yt-dl/yt-dlp. It is only YouTube that plays games with users trying to get them to enable Javascript and be tracked. One need only look at the size of the extractors in yt-dl/yt-dlp as evidence. The extractor for YouTube is 3x the size of the next largest, and over 10-20x the size of most of the others. I want a small utility that just focuses on YouTube. A simpler solution instead of a massive project.
If you find yt-dlp particularly slow, you can use aria2c to perform the download itself. It doesn't solve any startup latency problems, but it generally can do the HTTP requests faster.
This is reason why we must diversify these techs. If same company builds everything like mobile, os, search engine, Youtube, it would have debacle in future. If there is diversification, it simply takes toil and sweat to convince everyone.
Also, neither --format best nor --format bestvideo chooses the best encoding in all cases; they use bitrate as a heuristic for quality, and a less efficient codec can have higher bitrate but worse quality, resolution, or framerate. The workaround for this is specifying --format with an enumeration of every combination of codec, resolution, and framerate in preferred order, which goes like this:
I think there's a bit of variation in the exact order among the config files found online. If you're goals are archival, consider also retrieving metadata, thumbnail, and subtitles in all languages; I also have in my config the options:
Am I the only one totally chill with the throttling? Look, I'm not sure if it's the best idea for their own resource utilitization (there are probably peak times when more throttline is better, and quiet internet times when throttling doesn't make sense) As long as they are being gracious hosts and allowing downloads, I'm good with being throttled I guess?
> 50 kB/s means you'll be downloading a 10min video like 10 hours
Very misleading phrasing: you would download in 10 hours a ~10 hours long video, which (of course) could have been downloaded in a fraction of the time.
The throttling has the user download at a speed similar to that required to viewing the video.
No, the throttling is much more aggressive than real time. I suspect it's Google's way of sneakily breaking ancient YouTube clients that won't run arbitrary JS as a countermeasure for downloaders, without really breaking them all the way, so users of e.g. ancient smart TVs just think their network is (unusably) slow instead seeing an outright error, so they don't complain en masse that their TV no longer works properly.
Typical video bitrates for bog standard 1080p30 will be in the 1MB/s range, so the throttling is around 20x slower than real time.
I played a random 1080p YT video and the player-reported data rate is at least 2.5 MB/s, peaking to 20 MB/s (cca 10 MB/s average). 4K is of course much more than that.
I just took a video as a parameter - a recent engineering animation - and got the following:
format 18 - 640x360 AVC1, V+A : ~50kB/s
format 22 - 1280x720 AVC1, V+A : ~60kB/s
format 137 - 1920x1080 AVC1, V : ~128kB/s
format 400 - 2560x1440 av01, V : ~450kB/s
format 401 - 3840x2160 av01, V : ~850kB/s
Are you sure you are not reporting mega/bits/? When you mention the «player reported data rate», are you sure that is not the "connection speed"? 10MB/s means downloading 36GB/h (one CD per minute, dozens gigabytes per hour)...
You're right it's somehow weird. Perhaps the player shows decompressed figures? Check for yourself - right-click on YT video player, show developer/debug info, and there's a chart.
I checked that interface (if you mean the "Stats for nerds"), hence I asked you if you were not by chance reading the "connection speed", which could be the closest to the values you reported and may be misleading to a glance.
I noticed this recently, too. I frequently scrape concerts from YouTube by adding them to a private playlist and then running a script on my HTPC to pull down everything from that playlist. As recently as six months ago, pulling down a 2-3GB playlist file (maybe a 30-40 minute concert at 720P) took ten to fifteen minutes. The last time I tried it, it estimated several hours.
FWIW, just recently wrote a go program that watches a Youtube playlist and downloads new videos to a configurable path: https://github.com/raine/ytdlwatch
It genuinely shocks me that there are still people who don't disable sending the referer header cross-origin in the browser: I have not encountered a single website that breaks when setting `network.http.referer.XOriginPolicy` to 1 in Firefox, and only 2 or 3 sites that break when setting it all the way to 2.
It not only completely prevents stuff like this, it profoundly increases your privacy on the web by preventing sites from tracking which domain you came from. There is no good reason any site needs to know that. I am surprised that Mozilla hasn't simply made this the default setting for all users.
This breaks in the case that the endpoint actually does use the referer for something. I have actually encountered plenty of cases where a site will use the referer internally, i.e. on the same origin, to check something.
Yeah that's fair, but I haven't really noticed it for myself, but then again, I disable javascript by default and have compiled a lot of other stuff into the netwerk module that probably breaks sites even more to the degree if its noticeable, i just don't use the site.
Unless I'm mistaken, you've misunderstood the issue. From your link:
> it happened to me and I figured out it started when I disabled website.referrersEnabled
So if you completely disable sending the referer header, it breaks. This would probably also happen if you set `network.http.sendRefererHeader` to 0 or 1.
But that's not what I suggested! I only suggested disabling sending the header externally, to other sites, when the host domain (*.example.org) changes. In fact, someone in the thread you linked says that doing this instead of disabling it completely fixed the issue for them:
> ok I left network.http.sendRefererHeader on 2 (had it on 1), and changing network.http.referer.XOriginPolicy to 2 it works
DDG has been de-listing a ton of stuff recently. They recently de-listed rdrama, a reddit-trolling website that came up on HN last year. What, exactly, do they think people use them for? If I wanted "result curation" or whatever euphemism for censorship, I would just use Google.
Amazing that nowadays if you want the best "uncensored" results you have to go to the Russian Yandex (at least on all topics unrelated to Russia). How did our society get to such a point...
It has been like that for the long time, actually. I've seen examples where somewhat political search clause (in Russian) yields completely different results on Google & Yandex (it's actually hard to tell which one "tweaks" the results, and what tweaking even really means, when you think about it — my guess would be both, but differently). But in a sense of intentionally hiding something "neutral" that I know perfectly well must be there, but isn't (and it seems like you can guess why) — I've never actually noticed in on Yandex, while on Google it has been like that for the whole decade already. (I don't find is that surprising, really. To me, it's perfectly logical, but I won't elaborate, because it would be political and extremely unpopular on HN, which seems like a worthless flamewar topic to me.)
But in my experience Yandex just isn't as good at searching things in English, as DDG/Bing. At least, I always preferred them for most use-cases except from some really specific ones. So it's a pity that this thing with DDG happened. I guess, I'll still continue to use it for now, because I don't want to hurt my productivity other that thing (and switching to a search engine I would have to adapt to most certainly would do that). But we'll see.
Capitalism (as practiced by democracies) allows you to start an alternative site that doesn't censor results as well. Totalitarianism (Communism) would not have. Choose your poison from amongst the various governments that have succeeded (for a while) in the past as I don't think Utopia is a possibility with humans as messed up as we are.
I'm not sure what your point is. The GP commenter asked what brought us here... it was capitalism. It was people starting other sites that didn't censor or didn't have ads or didn't do this or that. The winners of capitalism are what we have now.
Regardless of whether you support or oppose this kind of content, there are perfectly good reasons search for it and share it. This whole "don't talk about it, don't look at it" strategy that tech companies are taking is downright scary. I support identifying and transparently outing state actors. I'm not comfortable with it, but even censoring or caveating content based on author or institution is better than censoring based on the content itself.
For example, someone close to me expressed support for the "Proud Boys." This person in particular has been duped by evangelical movements in the past that all have one curious detail in common: they prohibit masturbation. Half to make fun of him, half to help him, I wanted to share a link with him to the Proud Boy's official website because he wouldn't believe me when I said they ban jackin' it. Facebook Messenger (in private DMs) refused to send a message including the link. "Send failed. Operation could not be completed." I tried an archive.org archive of it, same thing.
It depends on what you want out of a search engine. Do you want a search that tries to give you what it thinks you want? Or do you want a search that functions more like a library index where you're going to get what you searched for, even if that might ironically not really be exactly what you wanted - so the onus of creating a more correct search term is on you.
This isn't a rhetorical question of course. There are major arguments for both. But I think a pretty good chunk of DDG users were more often after the library index than what ideally would be a librarian recommendation, but in reality is more like a stereotypical used car salesman style recommendation.
I might agree with shite. Or I might wish to read it for my own amusement. Or perhaps I want to argue against the shite, in which case, I need to know what shite is out there. In any case, it doesn't matter. I only need the search engine to be an intermediary, not a curator.
Besides, if someone accidentally searches for shite, he can revise his search terms. Just like how you might reword something if a listener was confused.
This "curation" just doesn't need to exist. But that's a moral argument and those applying censorship aren't moral, so won't be partial to it.
I once ended up on Encyclopedia Dramatica before I knew what it was. It was some semi-informational article about some person and then I scrolled down and there was an absolutely disgusting picture I'm not even going to describe here. I, for one, don't mind that it's not on the top of my search results.
The entire job of a search engine is to curate results from a vast internet and boil it down to what is probably interesting for you. You also don't want all kind of "shock sites" on top if you search for "gun homicide" or "ISIS beheadings". Most likely you want some background information on these things, not "shock sites". That's the service Google and such provide.
ED is an incredible repository of useful factual data dressed as a trolling/shock site. Google is absolutely not de-listing it because they're worried about you. There's much worse content available on google; the difference that the other stuff isn't politically sensitive.
But don't you think a query containing the exact name of a website is fairly likely in search of content from that site? In terms of recognizing user intent, I'd consider the need to use a search operator here a failing.
I commented because learning of such special-casing of a site by Google [a] was quite memorable for me at the time; since I also see plenty of other threads around us debating search engine curation, what follows is
I came across some Reddit threads from a certain time in 2014 which noted that the top Google results for some subjects were ED pages that mentioned them. Perhaps ED's downranking was in response to that. Whatever the cause, this is actually a more severe downranking than that famously applied to thepiratebay.org [b]; for example, the query thepiratebay org foo returns only results from the actual site, [c] but the same query for ED returns no results from ED. [d]
I did find two more cases where Google does return a (single) page from ED. The sole query without any operators that does so is encyclopediadramatica online (and its punctuation/whitespace-equivalent variants), for which the Main Page is the first result and the only ED result. [e]
The other case is when searching for a phrase within quotes that occurs nowhere else in Google's index other than on ED. For example, the expected only result of the query "Encyclopedia Dramatica help pages" is the ED page containing that text. [f]
So, to be more precise, ED is severely downranked rather than delisted, albeit with the result that no one can find it on Google without already knowing its URL (or a hapax legomenon quoted from its pages).
-----
[a][b][c][d][e][f] If I may use a simplistic model of Google that determines a score for each potential search result (i.e., 'distinct' crawled URL) by starting with the same initial score for all results and applying a sequence of steps to each result that each change its score, the following is a speculative explanation for all these behaviours:
- Following the meat of the algorithm is the downrank-specific-sites step, which decreases the score of each page of a downranked site by some amount (affects both TPB and ED: TPB scores are decreased moderately to implement [b]; ED scores are decreased massively, explaining [a]).
- Then comes the query-contains-site-URL step, which greatly increases the scores of all pages of that site (TPB results are now higher than non-TPB results, explaining [c]; ED scores were decreased so much that they are still lower than all other results, explaining [d]).
- Then comes the query-is-site-URL-exactly step, which makes the score for that site's base URL (which in the case of ED redirects to the Main Page) higher than any other score (explaining [e]).
- Search operators are last, and thus the highest-priority (explaining [f] and the site: operator).
These steps have held up in general where applicable for all the queries that I've tried.
Question: I've seen Kagi and I'm very interested, is there a way to use it without creating an account? Is the account required due to it still being beta, or it will be required in the future as well?
I think an account is required, for at least a couple of reasons.
1. They will eventually require payment.
2. The service provides a lot of customisation options. Up and down ranking domains, for example, and setting up special filters. This kind of customisation can be built over many years and should carry with the user between browsers and devices.
Personally, I'm less worried about privacy and more about receiving accurate and unbiased search results. Of course, you can always use a VPN and use a burner email when you create the account.
Why does DDG de-list things in the first place; isn't it in the interest of search engines to be as useful as possible to users, and thus maximise the results provided?
Also curious to know the extent to which Google de-lists things?
Long term, perhaps a decentralised search engine could get around de-listing and provide a more reliable and rigorous search experience.
Before someone chimes in, yes I understand the humanitarian perspective. That's not my point. My point is that DDG is not neutral, and is politically biased.
BTW, if you wanted a biased platform, just use Google, its significantly better in every way.
Ultimately what I'm getting at is:
There's no market for DDG. Use Google for biased searches, and use other search engines that are not biased(which excludes DDG) for unbiased searches.
"All"[0] in the revision of historical examples sense. Unless one really thinks that filter bubbles have no overlap with censorship.
Maybe i was only the few that cared more about the filter bubble angle and less the "we're selling bottled privacy™" angle [1] and am not interested now that `yegg` has clearly reneged upon the former
Yeah DDG is just a worse google. If I want to use google I might as well use google.
When I want unbiased searches I've been using Kagi but more are popping up. they approach search differently so it's useful when google feels to "sanitized" for certain searches
> BTW, if you wanted a biased platform, just use Google, its significantly better in every way.
It's not to me, when was the last time you used it? I haven't used Google in around 4 years now and I'm getting on just fine, majority of searches answered on first page.
If anything I now toggle between Duckduck and Ecosia as Ecosia still isn't 100% up to scratch (frequent 500 errors, slow, results are bad) but I like the idea of my searches planting trees.
I wonder how a decentralized search engine would fight its inevitable SEO war, should it become successful.
I'd expect that all sites wanting to draw traffic would attempt to grab the reins of the search engine to point toward themselves, and the result would be search results ordered by rein-grabbing power.
Not that centralized search engines are immune to this; they're almost as vulnerable (seeing as sponsored search results exist) but the maintainer at least has to balance that with the utility of the search engine overall, to prevent the search engine from falling out of favour.
With a decentralized engine, parties that have deeply invested in manipulating the results will still want the engine to be popular too, but I'm not sure how you resolve the prisoners dilemma there as a whole.
> I wonder how a decentralized search engine would fight its inevitable SEO war, should it become successful.
> I'd expect that all sites wanting to draw traffic would attempt to grab the reins of the search engine to point toward themselves, and the result would be search results ordered by rein-grabbing power.
I would venture to say that a combination of allow-lists and block-lists from trusted parties, ranked using some kind of distributed web-of-trust system would work reasonably well.
Ages ago, I was a professional P2P developer, and I vaguely remember some of the research papers on P2P censorship-resistant reputation systems. Generally, you have public signing keys that sign ratings (say, -1.0 to 1.0) for both content and other raters. These signed ratings collections are then pushed into a distributed hash table.
The basic idea is when you rate something, your client also looks up in the DHT other people who have rated the same content with similar ratings. Your client then pulls the latest ratings collections from those people, and computes the cosign distance between your ratings and their ratings (over the intersection of content that both of you have rated). Periodically, your client signs and publishes an updated ratings document, where the rating for other raters is the cosign distance. The cosign distance, the size of the ratings intersection set, and maybe some other factors go into deciding which raters get published out in your ratings update.
When you query for the rating for a given piece of content, your client grabs the list of ratings for that content from the DHT. It then pulls the latest ratings published by those raters, computes cosign distance, and then does something similar to Djikstra's shortest-path algorithm to recursively search the DHT using these cosign distances as weights. In general, the DHT wouldn't have many signatures stored under the content's hash, but by recursively following the graph of other raters, your client hopefully finds other raters that rate things similarly to you and have rated this content. The path weight to a given rater is the product of cosign distances, and so by using a priority queue for querying, you get something close to a breadth-first search of the ratings graph. Once your client has accumulated enough weight of ratings for the given content, it stops and shows you the weighted average of the ratings (and maybe the weighted std. dev. is displayed as a confidence score to power users who have enabled it).
Presumably, the UI for the ratings system maps 0 to 5 stars to 0.0 to 1.0 (probably not linearly, more likely the client locally keeps a histogram of the user's ratings and then maps the star rating back to a percentile rating), and the "spam" button rates the content as -1.0.
The tricks come down to the metrics used for how the DHT decides priorities for cache eviction of the per-content ratings and also the per-rater ratings. You don't want spammers or other censors to be able to easily force cache eviction. Getting cache eviction metrics right is the key to having the system scale well while also preventing spammers/censors from evicting the most useful sets of ratings.
What books and/or papers could you recommend for learning all about this, distributed hash tables, cosign distance, reputation systems, the whole deal?
I don’t think anyone wants to use a search engine that never delists anything. Ransomeware, Markov chain junk, plagiarism. A search engine that never delists anything is useless.
The problem is when delisting is used against the end-user’s interests.
The youtube-dl homepage has returned to the listings as has thepiratebay. However they still aren't indexing thepiratebay or youtube-dl.org's contents so you can't search within the sites you only get the homepage. The complaint the other day was about the indexes too, so its only partially fixed.
That's not actually true. Our site search is having issues, so better to just add the site name to the search, but note that youtube-dl.org is just one page and for other sites (that are essentially vertical search engines) you're better off going directly to them since their index is going to be more up to date.
I’ve been using this as it came as the default for a new Brave install I got. I’m oscillating between it and Google: the former because I don’t want to continue to give Google all this free data, the latter because Google’s results are good (or at least reliably what I expect)
Braves search is still kind of meh, but I appreciate that they’re trying.
Sometimes when the damage is done, it's done. I'm never going to use DDG ever again.
If this decision was because of legal pressure by Google, I don't see how that got resolved in a matter of days. Which means it wasn't because of Google, but rather a poor decision made by DDG management. How can people trust their product now?
The removals were a rarity. It's not as if you can't add a `!g` bang query to redirect to Google if you can't find something. And DDG is rampant with all sorts of stuff that shouldn't be there, so I don't think they're hellbent on censorship.
If I don’t know about YouTube-dl and I search “download YouTube command line”, how am I going to know that ddg is hiding the best result from me?
This definitely isn’t a small annoyance kind of problem, in my eyes. It’s a deal breaker. I’ll never use ddg again. If they’re going to censor like Google does, then I’m going to use Google because it generally has better results. Ddg needs to offer something beyond what Google offers to make up for their bad results, and they aren’t doing that.
> Ddg needs to offer something beyond what Google offers
If you're searching on google for medical terms it builds a profile on you, and you should be concerned that they could be selling that information to insurers, directly or indirectly.
I've been curious about this as a health care worker. The number of times I've looked up a client condition to better help the person I can't count. So is googles profile of me tainted?
I would be somewhat worried that information could be abused now or in the future. Since that data is necessarily going to be noisy, insurers probably aren't going to make black and white decisions based on it, but they could score you somewhat worse over it. You could maybe trust them that their algorithms would be able to determine that since you work in health care and are probably in the top 5% of people who search health terms that your individual data is polluted by your job, but I would never trust an insurer's black-box algorithms. They only need to be statistically correct over the population, they can always fuck you over individually and still make a fantastic profit for themselves.
Is this really going on? Maybe not, but is it worth it to ignore the risk? Can you just use DDG so it isn't a question?
probably not, because probably they also know your occupation, or you crossed a threshold of too many to be normal/direct etc.
(except in reality it's less simple than "they know your occupation", it's a huge cloud of data points that an ai makes correlations and assosciations that no human actually knows. It means the searches would also be weighted by indirect things like, not only your occupation but your assosciations. Say you don't have the medical occupation, but your computer makes a lot of medical searches, because your roomate in your college dorm has a medical major etc. And right now, the ais are still pretty stupid and absolutely making a lot of obvious unsafe conclusions, but they also do get more and more spooky every day.)
But this doesn't make it any better. If they did a perfectly accurate job of profiling you, that is not better than doing an inaccurate job.
That much insight is like being married to someone, where they intimately know all your biases and motivations, know all your buttons, know how to manipulate you, know how to weigh any opinions you might express against their knowledge of where you got every idea you ever had,
except it isn't a marriage, and they aren't subject to all that same vulnerability to you, and they aren't even a human but a corporation, and they have this intimate knowledge of everyone not just one spouse or sibling or best friend.
I wonder if it was unintentional. They rely on Bing's database for their search engine; if these sites disappeared from their data source, might it have disappeared from DDG without any direct intention?
I also find that searching Bing does not result in youtube-dl.org, only the repository, while DDG returns both, but Bing is not the ONLY source for DDG results, at least according to them.
"We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."
I've tried comparing Google, Bing and DDG on a private window before, and I didn't find Bing and DDG more similar than Bing and Google. Searching for monkey: https://news.ycombinator.com/item?id=27598329
Is there a search alternative that isn't just using Google or Bing underneath? DDG is just Bing and all the recent filtering and what not that Bing has been doing also applies to DDG. Many things that used to show up on DDG no longer do.
Been using Kagi for the last week and the search results have been SURPRISINGLY good. Like, in the "I don't have to scroll to find it" category of good, and no having to deal with spam sites that just copy the actual answer but somehow rank higher than the original, like currently plagues Google.
> That revenue model is crazy, no way will they last.
Why? Their plan is to be an amazing SE, but only for the comparatively small amount of people that are willing to pay. They aren’t VC funded but boostrapped, so they don’t need to ruin a good thing for dumb returns.
I believe that their pricing will keep their potential customer base too small to remain commercially viable. I'm sure you've seen a supply and demand curve before. The optimal price point isn't where the price is highest, because then no one buys the product/service. It's actually where sufficient customers can afford the product/service to maximise profit. There are very few people willing to pay $30/month for a search engine. Even if Kagi can convince 1,000 people to do so, I do not believe $30,000 a month in revenue will even keep the lights on.
It is highly likely that $5/month, for example, would attract millions of users, and $25 million/month would certainly keep the lights on.
$30 is not the price unless you need a lot of searches. Looking at my consumption tab, I’m usually below $10/month, so I’ll probably end up in a $10-$20 plan. Some napkin math gives around 2,550 searches per month at $30, Vlad said in Discord $30 is for people who want around 100 searches every day.
Also, you can’t forget that every search actually costs them money, as every single search incurs the fee from Google and Microsoft for their APIs. I think there was an idea around having a $5 trial plan with only 10 searches a day or something. Unlimited searches for $5 would be "We are losing money on every customer, but we are making it up in volume"
>Also, you can’t forget that every search actually costs them money, as every single search incurs the fee from Google and Microsoft for their APIs. I think there was an idea around having a $5 trial plan with only 10 searches a day or something. Unlimited searches for $5 would be "We are losing money on every customer, but we are making it up in volume"
Are they really querying Google and Microsoft for every search? This seems highly inefficient. Then Kagi is basically just a fancy wrapper. I thought their ambition was to build out Teclis and TinyGem to serve more and more content and rely less and less on other search providers. At the very least, I don't think they need to hit Google every time someone queries "games." This is hopefully queried once per period and stored in Teclis/cached.
To be blunt, if this is just a fancy wrapper/aggregator then there is no reason to use this over you.com.
It uses data from both Google and Bing, but reorders it substantially. In addition, they use data from their own index and some different sources. The interesting part is the ranking.
And I mean, it should be easy to compare. For me, kagi is substantially better than any other SE I tried, so I’ll pay for it when it releases, but just test it yourself.
We are not optimizing for revenue but for staying independent and sustainable. (Kagi dev here)
In other words it does not matter if we could have $100,000/month with another price point if it would cost us $150,000 to do so (every search has a fixed cost that does not go down with scale).
Our current price point includes a tiny margin that would allows us to break even at around 50,000 users. That is what we are optimizing for.
If I may ask, why do you believe that having more users and profit would undermine your ability to stay independent and sustainable? I would have thought that more users and profit would do the opposite: secure Kagi's ability to stay independent and sustainable. Intentionally reducing users and profit doesn't make any sense to me, given your stated goal.
I think you were arguing for optimal price point which would maximize revenue, not profit?
Our profit margin is already razor thin, I do not think it can be further optimized. Meaning we can not further reduce price to get more users without operating at a loss.
It sounds like you have a high marginal cost. As in, you pay every time someone searches for something. Can this not be reduced over time with greater efficiency of scale?
FYI I use Kagi and like it. I’ve replaced Google :)
Nope, the cost is pretty much fixed. What we charge the users is just a hair above it so we have a hope of reaching sustainability at around 50,000 users. In other words there is nothing to optimize further.
And it is only about one cent to search the entire web in 300ms with everything else Kagi does, that has to be impressive ? :)
Very impressive :) So if Kagi profits $x per search, wouldn't they profit more with more searches? More users, more searches, more profit, more sustainable.
They are not back, it's misleading bullshit. "Pirate bay" was never hidden from search in that sense. Look for "<something> torrent" that is found on thepiratebay by Brave (I guess I won't provide the exact search term I tested, sorry). Now try to look for it on DDG. You'll find many torrent sites, but not thepiratebay.org
DDL is a fringe search engine. They are not doing themselves any good by crippling their search results. It's just going to push the 10 people still using their stuff away.
Sorry to be that guy but what was going on? This is alarming as someone who uses DDG and youtube-dl kinda too much probably.
Also if anyone else is slightly hung over and searches DDG in DDG and gets super confused for three seconds because DDG is a rapper apparently just know you aren't alone. I'm right here with you, whoever you are.
From an April 17 Twitter thread from Gabriel Weinberg at DuckDuckGo:
> ... [W]e are not "purging" YouTube-dl or The Pirate Bay and they both have actually been continuously available in our results if you search for them by name (which most people do). Our site: operator (which hardly anyone uses) is having issues which we are looking into.
(Note that "site:" in his comment is how you restrict DDG searches to a specific domain e.g. "site:example.com")
>> Our site: operator (which hardly anyone uses) is having issues which we are looking into
I noticed problems with their site: operator too earlier today, and still now as well. In my case, when I used it in a search I saw that the word “site” itself was also bolded in some results. So it looks like it is using the operator itself also as a search term, which it shouldn’t.
I find it surprising that hardly anyone uses it though.
That's incomplete though, when I tried it (in response to seeing it posted here a couple of days ago) I couldn't get any yt-dl.org search results, i.e. adding site: appeared to function correctly - no results.
> For example, searching for “site:thepiratebay.org” is supposed to return all results DuckDuckGo has indexed for The Pirate Bay’s main domain name. In this case, there are none.
> This whole-site removal isn’t limited to The Pirate Bay either. When we do similar searches for ["site:1337x.to", "site:NYAA.se", "site:Fmovies.to", site:"Lookmovie.io"], and ["site:123moviesfree.net"], no results appear.
So long, DDG. Censorship before they even gained enough momentum to get users to stick will undoubtedly be the beginning of the end for them. Looking forward to the next Google competitor. Eventually, someone will get it right.