Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The less verbose way of doing semijoins is by an IN subquery.




>subquery

>less verbose

Well…

In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.


What do you mean? Here are your real alternatives for doing a semijoin (assuming ANSI SQL, no vendor extensions):

  SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 WHERE t2.x = t1.x );
  SELECT * FROM t1 WHERE x IN ( SELECT x FROM t2 );
  SELECT * FROM t1 JOIN ( SELECT DISTINCT x FROM t2 ) s1 USING (x);
Now tell me which one of these is the less verbose semijoin?

You could argue that you could fake a semijoin using

  SELECT DISTINCT * FROM t1 JOIN t2 USING (x);
or

  SELECT * FROM t1 JOIN t2 USING (x) GROUP BY t1.*;
but it doesn't give the same result if t1 has duplicate rows, or if there is more than one t2 matching t1. (You can try to fudge it by replacing * with something else, in which case the problem just moves around, since “duplicate rows” will mean something else.)

No, sorry, you’re certainly correct, I just meant that any subqueries are generally crazy verbose. And then you usually want additional Where clauses or even Joins in there, and it starts to stop looking like a Where clause, so I’m often happy when I can push that logic into From.

Yes, I would certainly prefer if you could write

SELECT * FROM t1 SEMIJOIN t2 USING (x);

although it creates some extra problems for the join optimizer.


It's great being able to use an any join (and the counterpart anti join) in Clickhouse to deal with these operations.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: