Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
“YouTube-dl” and “Pirate Bay” back on DDG (fosstodon.org)
470 points by ikt on April 17, 2022 | hide | past | favorite | 216 comments


I had never used youtube-dl until the story happened. I downloaded for windows and its speeds were throttled to around 50kB/s. Posters on Stack recommended the ytdlp fork which i tried and it was 5-10 MB/s. Just fyi


Yeah youtube-dl is missing workarounds for throttling implemented by google. youtube-dl is pretty much unmaintained compared to yt-dlp:

https://github.com/ytdl-org/youtube-dl/graphs/commit-activit...

https://github.com/yt-dlp/yt-dlp/graphs/commit-activity


And yt-dlp uses an 'Android API' to stop throttling


One has to execute functions from base.js^1 to modify the "n" URL parameter and the "sig" parameter to get the fastest download speeds. (One can still download videos with the original n parameter, or without the n parameter, but download speeds will be slower.)

By itself, using the ANDROID API instead of the WEB API^2 does nothing to affect download speeds. I can block yt-dlp's POST request indicating the client and API name and this has no effect on download speed.

A website that forces users to run Javascript in order to get faster download speeds. This is not a new idea.

1. https://www.youtube.com/s/player/{player_version}/player_ias...

2. From the YouTube video page JSON:

ANDROID API KEY

AIzaSyA8eiZmM1FaDVjRy-df2KTyQ_vz_yYM39w

WEB API KEY

AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8


yt-dl and yt-dlp use Python to interpet the Javascript functions that modify "n" and "sig", but a faster language could be used. For example, V8 is written in C++.

One of the reasons yt-dl and yt-dlp are so slow is because they do (too) many do other things besides running the Javascript, modifying n and sig, and spitting out a fast download URL. Before they can run the JS, the video page needs to be retrieved, but waiting for Python to start up and do this is slow. A YouTube video page can be retrieved much faster using netcat, outside of Python. YouTube video can be downloaded much faster using an HTTP client like tnftp, directly, outside of Python. YouTube video can be converted much faster using ffmpeg, directly, outside of Python. And so on. These programs start instantaniously when compared to the slow start up time of Python. The Python startup latency is unbearable.

At the very least yt-dl and yt-dlp should accept an already downloaded YouTube video page as input instead of forcing the user to use Python (or Python calling another program) to download the page. The recalculation of "n" and "sig" do not have to occur within seconds of retrieving the video page. There is no need to require the user to use Python for downloading webpages. The values in the YouTube video page are good for a substantial amount of time.

Both scripts have an option to output the new n and sig values in a download URL or JSON containing the download URLs, so the user can use an HTTP client directly, outside of Python. But the user still has to use Python to download each video page. Using an HTTP client directly would be faster.

Using yt-dl and yt-dlp just to output optimal download URLs feels like overkill.


This is complete nonsense. The overwhelming majority of video download time is spent transferring the actual video data, limited by throttling or network speed. The time it takes for these tools to start up and deal with the API/JS stuff is inconsequential for any video longer than a minute or so. And even then most of the time is network latency for the API stuff, not local processing. Netcat isn't going to go any faster.

Absolutely nobody thinks optimizing the meta/API processing of yt-dlp & co is worth it. This is exactly why we have high level programming languages that make all of this much easier, instead of trying to write HTML and JS parsing in plain C. Keep in mind these tools support dozens or hundreds or websites, not just YouTube.

If you think rewriting yt-dlp in C is worth it, go right ahead, but you're not going to make it significantly faster; you're just going to make maintenance a much bigger pain for yourself. Pick the right tool for the job. Python is absolutely (one of) the right tools for this job.

(For the record: I use C and Python on a daily basis, and will use whatever language is appropriate for each situation.)


"Python is absolutely (one of) the right tools for this job."

This opinion assumes it is a single job. I see multiple jobs. The number "options" provided by yt-dl(p) gives us a clue.

There is nothing wrong with preferring to use larger, more complicated, "multi-purpose" utilities. There will always be plenty to choose from.

However the idea of using smaller, less complicated, single purpose utilities is not "nonsense". It makes sense in many cases and some users may prefer it.

The statements I make about speed are from day-to-day experience not conjecture.


What is especially slow on some hardware like the original PinePhone is that youtube-dl and yt-dlp are actually executable that extract themselves at runtime. Python itself is quite fast to start.


I look forward to seeing your implementation of this solution.


The startup time for netcat is less than for yt-dl/yt-dlp.

Plus I can request multiple video pages over a single TCP connection with netcat.

For example, in a single TCP connection, with a list of 30 videoIds, I can get initial sig and n values for 413 videoIds. No wait time for netcat to decompress or startup. Using netcat is quite fast. Then I have utilties written in C to extract URLs from stdin. As such, all I need is a utility to update the sig and n values in the download URLs to make them fast ones instead of throttled.

How long would it take to get sig and n values for 30 videoIds let alone 413 with yt-dl or yt-dlp over a single TCP connection. The startup time for yt-dl/yt-dlp plus the time waiting for the downloading of each video page makes it much, much slower. These are "do-everything" scripts that are as a result quite inflexible.


There is a pure js version here: https://www.npmjs.com/package/ytdl-core


Fixed typos/misspellings, edited for more clarity:

yt-dl and yt-dlp use Python to run the Javascript functions that modify the initial "n" and "sig" values to make non-throttled download URLs, but a faster language could be used to interpret and run the snippet of Javascript. For example, V8 is written in C++, not Python.

One of the reasons yt-dl and yt-dlp are so slow, IMO, is because they do (too) many other things besides running the Javascript, modifying n and sig, and spitting out a fast download URL. Before they can run the JS, the video page needs to be retrieved. I have self-created utilities for downloading webpages wth TCP clients that are sgnificantly faster and more flexible than yt-dl/yt-dlp. I use these small dedicated programs every day. I am used to the speed. Waiting for yt-dl/yt-dlp to start up in order to download web pages is annoying. The latency is unbearable.

At the very least yt-dl and yt-dlp should accept an already downloaded YouTube video page as input instead of forcing the user to use Python (or Python calling another program) to download the page. The recalculation of "n" and "sig" do not have to occur within seconds of retrieving the video page. There is no need to require the user to use Python for downloading webpages. The values in the YouTube video page are good for a substantial amount of time.

Both yt-dl and yt-dlp have an option to output the new n and sig values in a download URL or as JSON containing the download URLs, so the user can use an HTTP client directly to download video, wthout need to launch yt-dl/yt-dlp. But the user still has to use yt-dl/yt-dlp to download each video page. Using a TCP client directly would be faster.

Using yt-dl and yt-dlp just to output optimal download URLs feels like overkill. It would be nice to have a program that just focuses on running the inecessary JS in base.js in order to output optimal download URLs. Then the user can use whatever programs she wants, directly, for downloading HTML/JSON, extracting URLs, downloading video files, converting video files, etc. Downloading video from most websites is generally easy. I never need a program like yt-dl/yt-dlp. It is only YouTube that plays games with users trying to get them to enable Javascript and be tracked. One need only look at the size of the extractors in yt-dl/yt-dlp as evidence. The extractor for YouTube is 3x the size of the next largest, and over 10-20x the size of most of the others. I want a small utility that just focuses on YouTube. A simpler solution instead of a massive project.


If you find yt-dlp particularly slow, you can use aria2c to perform the download itself. It doesn't solve any startup latency problems, but it generally can do the HTTP requests faster.


I expect the ability to use APIs like this will be hindered once remote attestation becomes the norm.


This is reason why we must diversify these techs. If same company builds everything like mobile, os, search engine, Youtube, it would have debacle in future. If there is diversification, it simply takes toil and sweat to convince everyone.


This is why I come on HN, I always learn something useful.


The workaround for vanilla youtube-dl is to use it with aria2, with options like:

  --external-downloader aria2c                                                                                         
  --external-downloader-args "--continue --max-concurrent-downloads=3 --max-connection-per-server=3 --split 3 --min-split-size 1M"
(possibly in your config file)

Also, neither --format best nor --format bestvideo chooses the best encoding in all cases; they use bitrate as a heuristic for quality, and a less efficient codec can have higher bitrate but worse quality, resolution, or framerate. The workaround for this is specifying --format with an enumeration of every combination of codec, resolution, and framerate in preferred order, which goes like this:

  --format "(bestvideo[vcodec^=av01][height>=4320][fps>30]/bestvideo[vcodec^=vp9.2][height>=4320] ...
Here's a full example (hmm... they're using it with yt-dlp, which I thought had fixed this?):

https://github.com/TheFrenchGhosty/TheFrenchGhostys-Ultimate...

I think there's a bit of variation in the exact order among the config files found online. If you're goals are archival, consider also retrieving metadata, thumbnail, and subtitles in all languages; I also have in my config the options:

  --verbose
  --download-archive ./ytdl-archive.txt                                                                                                                       
  --cookies ./ytdl-cookies.txt   
  --merge-output-format mkv
  --add-metadata --all-subs --embed-subs                                                                                                          
  --write-info-json --write-thumbnail                                                                                                             
  --no-overwrites --continue                                                                                                                      
  --force-ipv4
(the only remaining workaround for age-restricted videos is to give it cookies extracted from a browser with a real Google account logged in)


Am I the only one totally chill with the throttling? Look, I'm not sure if it's the best idea for their own resource utilitization (there are probably peak times when more throttline is better, and quiet internet times when throttling doesn't make sense) As long as they are being gracious hosts and allowing downloads, I'm good with being throttled I guess?


50 kB/s means you'll be downloading a 10min video like 10 hours.


> 50 kB/s means you'll be downloading a 10min video like 10 hours

Very misleading phrasing: you would download in 10 hours a ~10 hours long video, which (of course) could have been downloaded in a fraction of the time.

The throttling has the user download at a speed similar to that required to viewing the video.


No, the throttling is much more aggressive than real time. I suspect it's Google's way of sneakily breaking ancient YouTube clients that won't run arbitrary JS as a countermeasure for downloaders, without really breaking them all the way, so users of e.g. ancient smart TVs just think their network is (unusably) slow instead seeing an outright error, so they don't complain en masse that their TV no longer works properly.

Typical video bitrates for bog standard 1080p30 will be in the 1MB/s range, so the throttling is around 20x slower than real time.


Right: I forgot about the higher end encodings. 50kB/s is a speed similar to that of the "base" 360p ("format 18") - ~100MB every half hour.


Sorry, one detail (cannot edit the former reply): 1080p30 will be in the 1Mbit/s range (not «1MB/s»). That's maybe ~2.5x slower (not 20x).


I played a random 1080p YT video and the player-reported data rate is at least 2.5 MB/s, peaking to 20 MB/s (cca 10 MB/s average). 4K is of course much more than that.


I just took a video as a parameter - a recent engineering animation - and got the following:

  format  18 -  640x360  AVC1, V+A :  ~50kB/s
  format  22 - 1280x720  AVC1, V+A :  ~60kB/s
  format 137 - 1920x1080 AVC1, V   : ~128kB/s
  format 400 - 2560x1440 av01, V   : ~450kB/s
  format 401 - 3840x2160 av01, V   : ~850kB/s
Are you sure you are not reporting mega/bits/? When you mention the «player reported data rate», are you sure that is not the "connection speed"? 10MB/s means downloading 36GB/h (one CD per minute, dozens gigabytes per hour)...


You're right it's somehow weird. Perhaps the player shows decompressed figures? Check for yourself - right-click on YT video player, show developer/debug info, and there's a chart.


I checked that interface (if you mean the "Stats for nerds"), hence I asked you if you were not by chance reading the "connection speed", which could be the closest to the values you reported and may be misleading to a glance.


For me the throttling without yt-dlp was often slower than the video bitrate, makeing it unusable for streaming.


I noticed this recently, too. I frequently scrape concerts from YouTube by adding them to a private playlist and then running a script on my HTPC to pull down everything from that playlist. As recently as six months ago, pulling down a 2-3GB playlist file (maybe a 30-40 minute concert at 720P) took ten to fifteen minutes. The last time I tried it, it estimated several hours.


FWIW, just recently wrote a go program that watches a Youtube playlist and downloads new videos to a configurable path: https://github.com/raine/ytdlwatch

I use it to download videos into a Plex library.


If you're downloading a playlist you could run it in parallel to achieve your max network bandwidth speed:

  function ytp() { youtube-dl --get-id "$1" | xargs -I '{}' -P 200 youtube-dl -i --embed-thumbnail --add-metadata -f 'bestaudio[ext=m4a]' -o '%(title)s.%(ext)s' 'https://youtube.com/watch?v={}'; }


FireDM (https://pypi.org/project/FireDM) is an awesome front end for ytdlp for those interested.


GitHub Links from the project page all get a 404.

https://github.com/firedm/ shows no public repos and of course https://github.com/firedm/FireDM gets a 404


Yeah, trying to delete `youtube-dl` from the GitHub triggered a huge Streisand Effect[1].

[1]: https://en.wikipedia.org/wiki/Streisand_effect


I always find jwz's youtubedown works flawlessly for me. https://www.jwz.org/hacks/youtubedown


Do not direct link to jwz.org from HN. Here's a click-safe link:

https://dereferer.me/?https%3A//www.jwz.org/hacks/youtubedow...


It genuinely shocks me that there are still people who don't disable sending the referer header cross-origin in the browser: I have not encountered a single website that breaks when setting `network.http.referer.XOriginPolicy` to 1 in Firefox, and only 2 or 3 sites that break when setting it all the way to 2.

It not only completely prevents stuff like this, it profoundly increases your privacy on the web by preventing sites from tracking which domain you came from. There is no good reason any site needs to know that. I am surprised that Mozilla hasn't simply made this the default setting for all users.


> I am surprised that Mozilla hasn't simply made this the default setting for all users.

I was under the impression it was, I doubt I'm the only one, so thanks for drawing attention to it.


>> `network.http.referer.XOriginPolicy` to 1 in Firefox

Thanks! Just checked and it is at zero for the default. Reviewed https://wiki.mozilla.org/Security/Referrer and learned something new.


There is also `network.http.referer.spoofSource`, which I have used instead of that.


This breaks in the case that the endpoint actually does use the referer for something. I have actually encountered plenty of cases where a site will use the referer internally, i.e. on the same origin, to check something.


Yeah that's fair, but I haven't really noticed it for myself, but then again, I disable javascript by default and have compiled a lot of other stuff into the netwerk module that probably breaks sites even more to the degree if its noticeable, i just don't use the site.


The Wayback Machine interface breaks.


Do you have an example of this? What exactly breaks? I use Wayback all the time and have had no obvious problems.


Has been the case for a bunch of people with them turned off for a while now:

https://old.reddit.com/r/WaybackMachine/comments/kzvzxl/fail...


Unless I'm mistaken, you've misunderstood the issue. From your link:

> it happened to me and I figured out it started when I disabled website.referrersEnabled

So if you completely disable sending the referer header, it breaks. This would probably also happen if you set `network.http.sendRefererHeader` to 0 or 1.

But that's not what I suggested! I only suggested disabling sending the header externally, to other sites, when the host domain (*.example.org) changes. In fact, someone in the thread you linked says that doing this instead of disabling it completely fixed the issue for them:

> ok I left network.http.sendRefererHeader on 2 (had it on 1), and changing network.http.referer.XOriginPolicy to 2 it works


I think people don’t know about it. I’m going to do this now that I do. Thank you!


> Do not direct link to jwz.org from HN

Why?


Because the admin detects HN referer explicitly and presents a joke page.


Perhaps I’m in the minority, but I find this kind of subversion amusing. Like all good jokes, it’s pretty close to the mark.


I agree, so it’s a safe bet you are in the minority.


I've since noticed...

Just copy the link to not send any referrer information.


Does not work either because apparently bit.ly's 301 preserves the Referer.


Weird, I had tested it. I’ve updated to a de-referrer site instead.


Might be a browser difference (firefox here). Thanks for the update, it works better.


oooohhhhhhh. The "egg". My bad.


How often do you update it ? Because it seems to have seen quite a few update from the beginning of 2022.


Every time I go to use it I grab a fresh copy.


I miss when youtube-dl wasn't getting so much attention and just worked. Ah well.


I started getting systematic

> Connection reset by peer

in a script that I have that downloads podcasts and immediately transcodes to low bitrate opus using ffmpeg.

It's a real bummer...



DDG has been de-listing a ton of stuff recently. They recently de-listed rdrama, a reddit-trolling website that came up on HN last year. What, exactly, do they think people use them for? If I wanted "result curation" or whatever euphemism for censorship, I would just use Google.


They've removed Yandex and are now at the mercy of Bing's censorship.


Amazing that nowadays if you want the best "uncensored" results you have to go to the Russian Yandex (at least on all topics unrelated to Russia). How did our society get to such a point...


It has been like that for the long time, actually. I've seen examples where somewhat political search clause (in Russian) yields completely different results on Google & Yandex (it's actually hard to tell which one "tweaks" the results, and what tweaking even really means, when you think about it — my guess would be both, but differently). But in a sense of intentionally hiding something "neutral" that I know perfectly well must be there, but isn't (and it seems like you can guess why) — I've never actually noticed in on Yandex, while on Google it has been like that for the whole decade already. (I don't find is that surprising, really. To me, it's perfectly logical, but I won't elaborate, because it would be political and extremely unpopular on HN, which seems like a worthless flamewar topic to me.)

But in my experience Yandex just isn't as good at searching things in English, as DDG/Bing. At least, I always preferred them for most use-cases except from some really specific ones. So it's a pity that this thing with DDG happened. I guess, I'll still continue to use it for now, because I don't want to hurt my productivity other that thing (and switching to a search engine I would have to adapt to most certainly would do that). But we'll see.


I suppose that's a real use case for a meta search engine: send queries to different search engines, then interlace the results removing duplicates.


DDG was supposed to be like this originally, but they stopped using Yandex a while ago so now its just a wrapper on Bing...


i mean yes, it's suboptimal. but hardly news that for the real deal, you gotta leave the main road.


If that isn't a rhetorical question... capitalism.



> capitalism

If capitalism caused censorship you would expect the least capitalist places to have the least censorship but the opposite is true.


If capitalism causes censorship, that doesn't mean it's necessarily the worst at causing censorship.

Additionally, we don't live in a stationary society, so, whatever capitalism did or did not cause in the past might not apply as-is today.


Capitalism (as practiced by democracies) allows you to start an alternative site that doesn't censor results as well. Totalitarianism (Communism) would not have. Choose your poison from amongst the various governments that have succeeded (for a while) in the past as I don't think Utopia is a possibility with humans as messed up as we are.


I'm not sure what your point is. The GP commenter asked what brought us here... it was capitalism. It was people starting other sites that didn't censor or didn't have ads or didn't do this or that. The winners of capitalism are what we have now.


right, russians went from cyber-shady to cyber-pure-evil and now everyone has to cut ties...


Man, that's interesting, they removed rdrama.net but left KiwiFarms up, and now the main result for "rdrama" is the KiwiFarms thread about it


They've "memory-holed", 1984-style, other conservative sites as well. Just another narrative-enforcing search-engine.


Regardless of whether you support or oppose this kind of content, there are perfectly good reasons search for it and share it. This whole "don't talk about it, don't look at it" strategy that tech companies are taking is downright scary. I support identifying and transparently outing state actors. I'm not comfortable with it, but even censoring or caveating content based on author or institution is better than censoring based on the content itself.

For example, someone close to me expressed support for the "Proud Boys." This person in particular has been duped by evangelical movements in the past that all have one curious detail in common: they prohibit masturbation. Half to make fun of him, half to help him, I wanted to share a link with him to the Proud Boy's official website because he wouldn't believe me when I said they ban jackin' it. Facebook Messenger (in private DMs) refused to send a message including the link. "Send failed. Operation could not be completed." I tried an archive.org archive of it, same thing.


I prefer to say “shite removal” after the last half a decade.

At least TPB adds value to peoples lives.


It depends on what you want out of a search engine. Do you want a search that tries to give you what it thinks you want? Or do you want a search that functions more like a library index where you're going to get what you searched for, even if that might ironically not really be exactly what you wanted - so the onus of creating a more correct search term is on you.

This isn't a rhetorical question of course. There are major arguments for both. But I think a pretty good chunk of DDG users were more often after the library index than what ideally would be a librarian recommendation, but in reality is more like a stereotypical used car salesman style recommendation.


I would be fine with either. But what I really don't want is a search engine that gives me what it thinks I should want.


The whole point is to have a place where you can see shite if you want. I don't need an online nanny.


I might agree with shite. Or I might wish to read it for my own amusement. Or perhaps I want to argue against the shite, in which case, I need to know what shite is out there. In any case, it doesn't matter. I only need the search engine to be an intermediary, not a curator.

Besides, if someone accidentally searches for shite, he can revise his search terms. Just like how you might reword something if a listener was confused.

This "curation" just doesn't need to exist. But that's a moral argument and those applying censorship aren't moral, so won't be partial to it.


Yes, the Overton Window must be enforced. We have always been at war with East Asia.


Removal of TPB was a positive too. Can't imagine how much loss of revenue these pirate sites have caused over the years.


Luckily searching "Marsey the Cat" on Bing still yields the correct knowyourmeme and Twitter pages.


> If I wanted "result curation" or whatever euphemism for censorship, I would just use Google.

Both youtube-dl and the pirate bay are available on google.


Now we have to wait for DDG to manually fix their results every time Bing censors something? Great…

I wonder what Yandex is like these days…


Isn't DDG just reskinned bing anyhow?


Well, youtube-dl.org is still delisted from bing, so clearly something is different.


Bing delisted rdrama too. I think it is related somehow (?)


I noticed some time ago that Google refuses to offer up Encyclopedia Dramatica unless you use the site: operator.

There was some drama in 2010 over ED being censored in Australia, but it looks like Google has since quietly delisted it completely.


I once ended up on Encyclopedia Dramatica before I knew what it was. It was some semi-informational article about some person and then I scrolled down and there was an absolutely disgusting picture I'm not even going to describe here. I, for one, don't mind that it's not on the top of my search results.

The entire job of a search engine is to curate results from a vast internet and boil it down to what is probably interesting for you. You also don't want all kind of "shock sites" on top if you search for "gun homicide" or "ISIS beheadings". Most likely you want some background information on these things, not "shock sites". That's the service Google and such provide.


ED is an incredible repository of useful factual data dressed as a trolling/shock site. Google is absolutely not de-listing it because they're worried about you. There's much worse content available on google; the difference that the other stuff isn't politically sensitive.


ED is an amazing time capsule of the 4chan and SA adjacent cultures that sprouted up in the early 2000s. Sad to see the state it's in now.


But don't you think a query containing the exact name of a website is fairly likely in search of content from that site? In terms of recognizing user intent, I'd consider the need to use a search operator here a failing.

I commented because learning of such special-casing of a site by Google [a] was quite memorable for me at the time; since I also see plenty of other threads around us debating search engine curation, what follows is

I came across some Reddit threads from a certain time in 2014 which noted that the top Google results for some subjects were ED pages that mentioned them. Perhaps ED's downranking was in response to that. Whatever the cause, this is actually a more severe downranking than that famously applied to thepiratebay.org [b]; for example, the query thepiratebay org foo returns only results from the actual site, [c] but the same query for ED returns no results from ED. [d]

I did find two more cases where Google does return a (single) page from ED. The sole query without any operators that does so is encyclopediadramatica online (and its punctuation/whitespace-equivalent variants), for which the Main Page is the first result and the only ED result. [e]

The other case is when searching for a phrase within quotes that occurs nowhere else in Google's index other than on ED. For example, the expected only result of the query "Encyclopedia Dramatica help pages" is the ED page containing that text. [f]

So, to be more precise, ED is severely downranked rather than delisted, albeit with the result that no one can find it on Google without already knowing its URL (or a hapax legomenon quoted from its pages).

-----

[a][b][c][d][e][f] If I may use a simplistic model of Google that determines a score for each potential search result (i.e., 'distinct' crawled URL) by starting with the same initial score for all results and applying a sequence of steps to each result that each change its score, the following is a speculative explanation for all these behaviours:

- Following the meat of the algorithm is the downrank-specific-sites step, which decreases the score of each page of a downranked site by some amount (affects both TPB and ED: TPB scores are decreased moderately to implement [b]; ED scores are decreased massively, explaining [a]).

- Then comes the query-contains-site-URL step, which greatly increases the scores of all pages of that site (TPB results are now higher than non-TPB results, explaining [c]; ED scores were decreased so much that they are still lower than all other results, explaining [d]).

- Then comes the query-is-site-URL-exactly step, which makes the score for that site's base URL (which in the case of ED redirects to the Main Page) higher than any other score (explaining [e]).

- Search operators are last, and thus the highest-priority (explaining [f] and the site: operator).

These steps have held up in general where applicable for all the queries that I've tried.


Yep, I think the delisting comes from Bing, not DDG since DDG just pulls all their results from Bing since they stopped also using Yandex.


I've started using Kagi myself


Question: I've seen Kagi and I'm very interested, is there a way to use it without creating an account? Is the account required due to it still being beta, or it will be required in the future as well?


I think an account is required, for at least a couple of reasons.

1. They will eventually require payment.

2. The service provides a lot of customisation options. Up and down ranking domains, for example, and setting up special filters. This kind of customisation can be built over many years and should carry with the user between browsers and devices.

Personally, I'm less worried about privacy and more about receiving accurate and unbiased search results. Of course, you can always use a VPN and use a burner email when you create the account.


Why does DDG de-list things in the first place; isn't it in the interest of search engines to be as useful as possible to users, and thus maximise the results provided?

Also curious to know the extent to which Google de-lists things?

Long term, perhaps a decentralised search engine could get around de-listing and provide a more reliable and rigorous search experience.


DDG CEO is clearly biased. Its no longer a neutral platform:

https://twitter.com/yegg/status/1501716484761997318?s=20&t=9...

Before someone chimes in, yes I understand the humanitarian perspective. That's not my point. My point is that DDG is not neutral, and is politically biased.


BTW, if you wanted a biased platform, just use Google, its significantly better in every way.

Ultimately what I'm getting at is: There's no market for DDG. Use Google for biased searches, and use other search engines that are not biased(which excludes DDG) for unbiased searches.

What are the use cases for DDG?


Few people switched to DDG because of censorship, the main DDG selling point is privacy. All of their marketing stresses that aspect.


"All"[0] in the revision of historical examples sense. Unless one really thinks that filter bubbles have no overlap with censorship.

Maybe i was only the few that cared more about the filter bubble angle and less the "we're selling bottled privacy™" angle [1] and am not interested now that `yegg` has clearly reneged upon the former

[0] https://techcrunch.com/2011/06/20/duckduckgo-to-google-bing-...

[1] https://pictobar.tumblr.com/post/63785124046/the-banality-of...


Yeah DDG is just a worse google. If I want to use google I might as well use google.

When I want unbiased searches I've been using Kagi but more are popping up. they approach search differently so it's useful when google feels to "sanitized" for certain searches


> BTW, if you wanted a biased platform, just use Google, its significantly better in every way.

It's not to me, when was the last time you used it? I haven't used Google in around 4 years now and I'm getting on just fine, majority of searches answered on first page.

If anything I now toggle between Duckduck and Ecosia as Ecosia still isn't 100% up to scratch (frequent 500 errors, slow, results are bad) but I like the idea of my searches planting trees.


Recently Google started ignoring double-quotes entirely which makes it pretty much useless for most of my daily search needs.


DDG respects privacy, or at least claims to. Google has nothing but contempt for it, and thrives on spying on you.


I use them as default search on my Android devices, to avoid the cookie banner of Google in private browsing mode.


I removed DDG from my browsers after his tweet. they are no different from Google


I wonder how a decentralized search engine would fight its inevitable SEO war, should it become successful.

I'd expect that all sites wanting to draw traffic would attempt to grab the reins of the search engine to point toward themselves, and the result would be search results ordered by rein-grabbing power.

Not that centralized search engines are immune to this; they're almost as vulnerable (seeing as sponsored search results exist) but the maintainer at least has to balance that with the utility of the search engine overall, to prevent the search engine from falling out of favour.

With a decentralized engine, parties that have deeply invested in manipulating the results will still want the engine to be popular too, but I'm not sure how you resolve the prisoners dilemma there as a whole.


> I wonder how a decentralized search engine would fight its inevitable SEO war, should it become successful.

> I'd expect that all sites wanting to draw traffic would attempt to grab the reins of the search engine to point toward themselves, and the result would be search results ordered by rein-grabbing power.

I would venture to say that a combination of allow-lists and block-lists from trusted parties, ranked using some kind of distributed web-of-trust system would work reasonably well.


Ages ago, I was a professional P2P developer, and I vaguely remember some of the research papers on P2P censorship-resistant reputation systems. Generally, you have public signing keys that sign ratings (say, -1.0 to 1.0) for both content and other raters. These signed ratings collections are then pushed into a distributed hash table.

The basic idea is when you rate something, your client also looks up in the DHT other people who have rated the same content with similar ratings. Your client then pulls the latest ratings collections from those people, and computes the cosign distance between your ratings and their ratings (over the intersection of content that both of you have rated). Periodically, your client signs and publishes an updated ratings document, where the rating for other raters is the cosign distance. The cosign distance, the size of the ratings intersection set, and maybe some other factors go into deciding which raters get published out in your ratings update.

When you query for the rating for a given piece of content, your client grabs the list of ratings for that content from the DHT. It then pulls the latest ratings published by those raters, computes cosign distance, and then does something similar to Djikstra's shortest-path algorithm to recursively search the DHT using these cosign distances as weights. In general, the DHT wouldn't have many signatures stored under the content's hash, but by recursively following the graph of other raters, your client hopefully finds other raters that rate things similarly to you and have rated this content. The path weight to a given rater is the product of cosign distances, and so by using a priority queue for querying, you get something close to a breadth-first search of the ratings graph. Once your client has accumulated enough weight of ratings for the given content, it stops and shows you the weighted average of the ratings (and maybe the weighted std. dev. is displayed as a confidence score to power users who have enabled it).

Presumably, the UI for the ratings system maps 0 to 5 stars to 0.0 to 1.0 (probably not linearly, more likely the client locally keeps a histogram of the user's ratings and then maps the star rating back to a percentile rating), and the "spam" button rates the content as -1.0.

The tricks come down to the metrics used for how the DHT decides priorities for cache eviction of the per-content ratings and also the per-rater ratings. You don't want spammers or other censors to be able to easily force cache eviction. Getting cache eviction metrics right is the key to having the system scale well while also preventing spammers/censors from evicting the most useful sets of ratings.


What books and/or papers could you recommend for learning all about this, distributed hash tables, cosign distance, reputation systems, the whole deal?


DDG uses the Bing index. If Bing de-lists something, then it disappears from DDG.

Microsoft (and any other big company) has many competing interests other than just being helpful to users.


> [big companies have] many competing interests

Are the main ones i) reducing competition and ii) managing their reputation?

If so, the case for a decentralised search engine got stronger.


> Why does DDG de-list things in the first place

I don’t think anyone wants to use a search engine that never delists anything. Ransomeware, Markov chain junk, plagiarism. A search engine that never delists anything is useless.

The problem is when delisting is used against the end-user’s interests.


Early on, a major selling point of DuckDuckGo was that it filtered out SEO parked domain pages.


Copyright industry pressure.


> Why does DDG de-list things in the first place [...]

They use Bing as an index, and it was Bing who de-listed it.


Banning Russian news outlets from search results was generally supported here on HN.


That's certainly not how I remember that thread going.


There is no war just a special operation.


we must condemn russias unprovoked genocide of neo nazis


Wait, did Russians start committing suicide?


The youtube-dl homepage has returned to the listings as has thepiratebay. However they still aren't indexing thepiratebay or youtube-dl.org's contents so you can't search within the sites you only get the homepage. The complaint the other day was about the indexes too, so its only partially fixed.


That's not actually true. Our site search is having issues, so better to just add the site name to the search, but note that youtube-dl.org is just one page and for other sites (that are essentially vertical search engines) you're better off going directly to them since their index is going to be more up to date.


I'm probably going to stop using DDG now. I don't want my results filtered in anyway because of pearl clutching over 'piracy' or whatever.


Can highly recommend https://you.com

Great engine with some really nice features


Getting a blank "Please reload the page, something went wrong" page instead of rendering anything.

Seems to have an uncaught exception trying to use the beacon API (which I have disabled).

You.com looks like junk.


The beacon bug is fixed now and it should work when disabled.


Same


Discussed a few months ago at https://news.ycombinator.com/item?id=29165601


What are you going to use instead?


Brave has a new search engine.


I’ve been using this as it came as the default for a new Brave install I got. I’m oscillating between it and Google: the former because I don’t want to continue to give Google all this free data, the latter because Google’s results are good (or at least reliably what I expect)

Braves search is still kind of meh, but I appreciate that they’re trying.


yandex


Yandex is the worst alternative you could pick. https://www.protocol.com/policy/yandex-gershenzon-qa


“…tech giant Yandex has a handshake deal with government authorities to limit what news outlets the site will pull onto its homepage”

wow that sounds so incredibly different from the alternatives


Makes one wonder, however, what other 'handshake deals' with autocratic government there are.


Yeah. I'm sure Russia is the only place where the telecom, tech and news sectors have quid pro quos with the organs of state.

[1] https://www.wired.com/2006/05/att-whistle-blowers-evidence/


The difference is that I have no control what Russia does with my information but I have some control over what my country is doing with it.


Do you really? The likelihood of legislation passing in the US is totally independent of how popular it is.


Just don't get all your news from yandax??


All kinds of info/website are also censored in Russia. They censor anything that could be taken as against the Duma-Putin.


Sometimes when the damage is done, it's done. I'm never going to use DDG ever again.

If this decision was because of legal pressure by Google, I don't see how that got resolved in a matter of days. Which means it wasn't because of Google, but rather a poor decision made by DDG management. How can people trust their product now?


The official stance according to DDG was that it was a bug and they fixed it.


Funny how none of these random "bugs" never delist major mainstream media.


Would you even know if DDG stopped returning results for site:cnn.com?


Would be on front page of HN, so yes.


> I'm never going to use DDG ever again.

The removals were a rarity. It's not as if you can't add a `!g` bang query to redirect to Google if you can't find something. And DDG is rampant with all sorts of stuff that shouldn't be there, so I don't think they're hellbent on censorship.


> if you can't find something

You don’t always know what you don’t know.

If I don’t know about YouTube-dl and I search “download YouTube command line”, how am I going to know that ddg is hiding the best result from me?

This definitely isn’t a small annoyance kind of problem, in my eyes. It’s a deal breaker. I’ll never use ddg again. If they’re going to censor like Google does, then I’m going to use Google because it generally has better results. Ddg needs to offer something beyond what Google offers to make up for their bad results, and they aren’t doing that.


> Ddg needs to offer something beyond what Google offers

If you're searching on google for medical terms it builds a profile on you, and you should be concerned that they could be selling that information to insurers, directly or indirectly.


I've been curious about this as a health care worker. The number of times I've looked up a client condition to better help the person I can't count. So is googles profile of me tainted?


I would be somewhat worried that information could be abused now or in the future. Since that data is necessarily going to be noisy, insurers probably aren't going to make black and white decisions based on it, but they could score you somewhat worse over it. You could maybe trust them that their algorithms would be able to determine that since you work in health care and are probably in the top 5% of people who search health terms that your individual data is polluted by your job, but I would never trust an insurer's black-box algorithms. They only need to be statistically correct over the population, they can always fuck you over individually and still make a fantastic profit for themselves.

Is this really going on? Maybe not, but is it worth it to ignore the risk? Can you just use DDG so it isn't a question?


I have long ago switched to ddg but was curious about it as a past google user


probably not, because probably they also know your occupation, or you crossed a threshold of too many to be normal/direct etc.

(except in reality it's less simple than "they know your occupation", it's a huge cloud of data points that an ai makes correlations and assosciations that no human actually knows. It means the searches would also be weighted by indirect things like, not only your occupation but your assosciations. Say you don't have the medical occupation, but your computer makes a lot of medical searches, because your roomate in your college dorm has a medical major etc. And right now, the ais are still pretty stupid and absolutely making a lot of obvious unsafe conclusions, but they also do get more and more spooky every day.)

But this doesn't make it any better. If they did a perfectly accurate job of profiling you, that is not better than doing an inaccurate job.

That much insight is like being married to someone, where they intimately know all your biases and motivations, know all your buttons, know how to manipulate you, know how to weigh any opinions you might express against their knowledge of where you got every idea you ever had,

except it isn't a marriage, and they aren't subject to all that same vulnerability to you, and they aren't even a human but a corporation, and they have this intimate knowledge of everyone not just one spouse or sibling or best friend.


They probably have a good idea of what those profiles look like, even if they don't already know you work in healthcare.


!g leads to Google, who also blocks yt-dl and other similar tools.

Edit: I was wrong.


Uh, what? youtube-dl.org is the first result for youtube-dl on Google, followed by GitHub repo, ytdl-org.github.io, Wikipedia, etc.

TPB is the first result too. IIRC at one point searching for TPB only returned proxy sites, but that doesn’t appear to be the case now.


Any source on that? I just searched both "youtube-dl" and "pirate bay" and got reasonable-looking results.


You don't get "reasonable-looking" results for "pirate bay" on google.


My first result is https://thepiratebay.org and the rest of the page is proxies.


Weird, I only get shady proxies and the wikipedia page.


I wonder if it was unintentional. They rely on Bing's database for their search engine; if these sites disappeared from their data source, might it have disappeared from DDG without any direct intention?


Still missing from Bing. So this is DDG inserting an override.


I also find that searching Bing does not result in youtube-dl.org, only the repository, while DDG returns both, but Bing is not the ONLY source for DDG results, at least according to them.

https://help.duckduckgo.com/duckduckgo-help-pages/results/so...

"We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google)."

I've tried comparing Google, Bing and DDG on a private window before, and I didn't find Bing and DDG more similar than Bing and Google. Searching for monkey: https://news.ycombinator.com/item?id=27598329


That is really strange - I just searched for both on bing.com and got relevant results back. Search bubble?


Indeed. My search was done in a fresh private window on a computer I never typically use Bing on.

Also, throughout this whole situation I always got the Github page as the first result for "youtube-dl".


Or search region? I’ve had different results before based on your Bing/DDG region setting.


Is there a search alternative that isn't just using Google or Bing underneath? DDG is just Bing and all the recent filtering and what not that Bing has been doing also applies to DDG. Many things that used to show up on DDG no longer do.


There are plenty - Kagi, Brave, Mojeek, Yandex, Rightdao, Gigablast...


Been using Kagi for the last week and the search results have been SURPRISINGLY good. Like, in the "I don't have to scroll to find it" category of good, and no having to deal with spam sites that just copy the actual answer but somehow rank higher than the original, like currently plagues Google.


This is probably because Kagi uses Google and Bing indexes.


And even if you disagree with their ranking, you can change ranking weights for domains yourself.


That revenue model is crazy, no way will they last.


> That revenue model is crazy, no way will they last.

Why? Their plan is to be an amazing SE, but only for the comparatively small amount of people that are willing to pay. They aren’t VC funded but boostrapped, so they don’t need to ruin a good thing for dumb returns.


I believe that their pricing will keep their potential customer base too small to remain commercially viable. I'm sure you've seen a supply and demand curve before. The optimal price point isn't where the price is highest, because then no one buys the product/service. It's actually where sufficient customers can afford the product/service to maximise profit. There are very few people willing to pay $30/month for a search engine. Even if Kagi can convince 1,000 people to do so, I do not believe $30,000 a month in revenue will even keep the lights on.

It is highly likely that $5/month, for example, would attract millions of users, and $25 million/month would certainly keep the lights on.


$30 is not the price unless you need a lot of searches. Looking at my consumption tab, I’m usually below $10/month, so I’ll probably end up in a $10-$20 plan. Some napkin math gives around 2,550 searches per month at $30, Vlad said in Discord $30 is for people who want around 100 searches every day.

Also, you can’t forget that every search actually costs them money, as every single search incurs the fee from Google and Microsoft for their APIs. I think there was an idea around having a $5 trial plan with only 10 searches a day or something. Unlimited searches for $5 would be "We are losing money on every customer, but we are making it up in volume"


>Also, you can’t forget that every search actually costs them money, as every single search incurs the fee from Google and Microsoft for their APIs. I think there was an idea around having a $5 trial plan with only 10 searches a day or something. Unlimited searches for $5 would be "We are losing money on every customer, but we are making it up in volume"

Are they really querying Google and Microsoft for every search? This seems highly inefficient. Then Kagi is basically just a fancy wrapper. I thought their ambition was to build out Teclis and TinyGem to serve more and more content and rely less and less on other search providers. At the very least, I don't think they need to hit Google every time someone queries "games." This is hopefully queried once per period and stored in Teclis/cached.

To be blunt, if this is just a fancy wrapper/aggregator then there is no reason to use this over you.com.


It uses data from both Google and Bing, but reorders it substantially. In addition, they use data from their own index and some different sources. The interesting part is the ranking.

And I mean, it should be easy to compare. For me, kagi is substantially better than any other SE I tried, so I’ll pay for it when it releases, but just test it yourself.


We are not optimizing for revenue but for staying independent and sustainable. (Kagi dev here)

In other words it does not matter if we could have $100,000/month with another price point if it would cost us $150,000 to do so (every search has a fixed cost that does not go down with scale).

Our current price point includes a tiny margin that would allows us to break even at around 50,000 users. That is what we are optimizing for.


If I may ask, why do you believe that having more users and profit would undermine your ability to stay independent and sustainable? I would have thought that more users and profit would do the opposite: secure Kagi's ability to stay independent and sustainable. Intentionally reducing users and profit doesn't make any sense to me, given your stated goal.


I think you were arguing for optimal price point which would maximize revenue, not profit?

Our profit margin is already razor thin, I do not think it can be further optimized. Meaning we can not further reduce price to get more users without operating at a loss.


I see the confusion. Optimal price point optimises for profit: https://www.accountingtools.com/articles/2017/5/13/optimal-p...

It sounds like you have a high marginal cost. As in, you pay every time someone searches for something. Can this not be reduced over time with greater efficiency of scale?

FYI I use Kagi and like it. I’ve replaced Google :)


Nope, the cost is pretty much fixed. What we charge the users is just a hair above it so we have a hope of reaching sustainability at around 50,000 users. In other words there is nothing to optimize further.

And it is only about one cent to search the entire web in 300ms with everything else Kagi does, that has to be impressive ? :)


Very impressive :) So if Kagi profits $x per search, wouldn't they profit more with more searches? More users, more searches, more profit, more sustainable.


Yes, but we can not get more users by reducing price as it would not be $X any more, but 0 or -$X per search.


Gotcha! Thanks for explaining, and thanks for the cool work you guys are doing.


I guess we need an up-to-date wikipedia page about all those alternatives.


Kagi uses both Google and Bing index.


Doesn't Brave also use Bing underneath?


If their about page isn't lying, about 90% of their results are original.


They bought the cliqz team and their work / tech when that died, so I'd assume they're using what those people had been building.


> Brave Search is built on top of a completely independent index, and doesn’t track users, their searches, or their clicks.

https://brave.com/brave-search-beta/


Switched to Brave Search - not looking back. I don't even care if the index sucks, it's been "good enough" and it has bangs.


Maybe DDG understood that they have no purpose if all they have to propose is Google's bad sides and censorship, without Google's search power...


DDG's selling point was always privacy.


Right, then why would they bother with stuff like dontbubble.us?

[0] https://techcrunch.com/2011/06/20/duckduckgo-to-google-bing-...


They are not back, it's misleading bullshit. "Pirate bay" was never hidden from search in that sense. Look for "<something> torrent" that is found on thepiratebay by Brave (I guess I won't provide the exact search term I tested, sorry). Now try to look for it on DDG. You'll find many torrent sites, but not thepiratebay.org


if all you nerds stop talking about it maybe it will stay up


DDL is a fringe search engine. They are not doing themselves any good by crippling their search results. It's just going to push the 10 people still using their stuff away.


What is DDG?


Drop Dead Gorgeous

or

Duck Duck Go


Sorry to be that guy but what was going on? This is alarming as someone who uses DDG and youtube-dl kinda too much probably.

Also if anyone else is slightly hung over and searches DDG in DDG and gets super confused for three seconds because DDG is a rapper apparently just know you aren't alone. I'm right here with you, whoever you are.


From an April 17 Twitter thread from Gabriel Weinberg at DuckDuckGo:

> ... [W]e are not "purging" YouTube-dl or The Pirate Bay and they both have actually been continuously available in our results if you search for them by name (which most people do). Our site: operator (which hardly anyone uses) is having issues which we are looking into.

(Note that "site:" in his comment is how you restrict DDG searches to a specific domain e.g. "site:example.com")

https://twitter.com/yegg/status/1515636218691739653


>> Our site: operator (which hardly anyone uses) is having issues which we are looking into

I noticed problems with their site: operator too earlier today, and still now as well. In my case, when I used it in a search I saw that the word “site” itself was also bolded in some results. So it looks like it is using the operator itself also as a search term, which it shouldn’t.

I find it surprising that hardly anyone uses it though.


When you handle 10 million searches a day it's easy for even large groups of users to get lost in the crowd.


> I find it surprising that hardly anyone uses it though.

Same! I use it for searching reddit all the time.


That's incomplete though, when I tried it (in response to seeing it posted here a couple of days ago) I couldn't get any yt-dl.org search results, i.e. adding site: appeared to function correctly - no results.


Did the original complaints of those sites disappearing mention the usage of the site: filter?


Yes.

> For example, searching for “site:thepiratebay.org” is supposed to return all results DuckDuckGo has indexed for The Pirate Bay’s main domain name. In this case, there are none.

> This whole-site removal isn’t limited to The Pirate Bay either. When we do similar searches for ["site:1337x.to", "site:NYAA.se", "site:Fmovies.to", site:"Lookmovie.io"], and ["site:123moviesfree.net"], no results appear.

https://torrentfreak.com/duckduckgo-removes-pirate-sites-and...


These links should provide the additional context you’re looking for:

https://news.ycombinator.com/item?id=31044587

https://nitter.net/i/status/1515635886855233537


Too little too late. You're already dead for me.


So long, DDG. Censorship before they even gained enough momentum to get users to stick will undoubtedly be the beginning of the end for them. Looking forward to the next Google competitor. Eventually, someone will get it right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: