Tangentially related to the post: I have what I think is a related computer vision problem I would like to solve and need some pointers on how you would go about doing it.
My desk is currently set up such that I have a large monitor in the middle. I'd like to look at the center of the screen when taking calls. I'd also like it to appear as though I am looking straight into the camera, and the camera is pointed at my face. Obviously, I cannot physically place the camera right in front of the monitor as that would be seriously inconvenient. Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.
I have multiple webcams that I can place around the monitor to my liking. I would like to have something similar to what is seen when you open this webpage, but for a video. hopefully at higher quality since I'm not constrained to a monocular source.
I've dabbled a bit with OpenCV in the past, but the most I've done is a little camera calibration for de-warping fisheye lenses. Any ideas on what work I should look into to get started with this?
In my head, I'm picturing two camera sources: one above and one below the monitor. The "synthetic" projected perspective would be in the middle of the two.
Is capturing a point cloud from a stereo source and then reprojecting with splats the most "straightforward" way to do this? Any and all papers/advice are welcome. I'm a little rusty on the math side but I figure a healthy mix of Szeliski's Computer Vision, Wolfram Alpha, a chatbot, and of course perseverance will get me there.
This is a solved problem on some platforms (Zoom and Teams), which alter your eyes so they look like they are staring into the camera. Basically you drop your monitor down low (so the camera is more centered on your head) and let software fix your eyes.
If you want your head to actually be centered, there are also some "center screen webcams" that exist that plop into the middle of your screen during a call. There are a few types, thin webcams that drape down, and clear "webcam holders" that hold your webcam at the center of your screen, which are a bit less convenient.
Nvidia also has a software package you can use, but I believe it is a bit fiddle to get setup.
> Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.
I appreciate the pragmatism of buying another thing to solve the problem but I am hoping to solve this with stuff I already own.
I’d be lying if the nerd cred of overengineering the solution wasn’t attractive as well.
If you want overengineered and some street cred, instead of chaging the image to make it seem like you're looking in a new place, how about creating a virtual camera exactly where you want to look, from a 3D reconstruction??
Here's how I'd have done it in grad school a million years ago (my advisor was the computer vision main teacher at my uni)
If you have two webcams, you can put them on either side of your monitor at eye level (or half way up the monitor), do stereo reconstruction in real time (using e.g., opencv), create an artificial viewpoint between the two cameras and re-project the construction to the point that is the average of the two camera positions to create a new image. Then, feed that image to a virtual camera device. The zoom call connects to the virtual camera device. (on linux this might be as simple as setting up a /dev/ node)
It's much easier to reconstruct a little left / right of a face when you have both left and right images, than it is to reconstruct higher / lower when you have only above or below. This is because faces are not symmetric up/down.
This would work, it would be kinda janky, but it can be done realtime with modern hardware using cheap webcams, python, and some coding.
The hardest part is creating the virtual webcam device that the zoom call would connect to, but my guess is there's a pip for that.
Any imager would do, but quality would improve with:
* Synchronized capture - e.g., an edge triggered camera with, say, a rasp pi triggering capture
* Additional range information, say, from a kinnect or cell phone lidar
* A little delay to buffer frames so you can do time-series matching and interpolation
If you really want to see some esoteric computer architecture ideas, check out Mill Computing: https://millcomputing.com/wiki/Architecture. I don't think they've etched any of their designs into silicon, but very fascinating ideas nonetheless.
Something which wasn't addressed fully but might be worth discussing further: Both Unity and Unreal Engine 4/5 make use of Vulkan, and thus every game made with these engines which runs on Windows and Linux almost certainly is using Vulkan somewhere (please correct me if I'm wrong!). I have a very hard time believing that people are making fewer games with these engines now.
This isn't to say that Windows games built on these engines aren't entirely running on DX12 code either. I think most games these days give you the choice to pick your graphics backend. It's an impossible ask, but I'd love to see what the stats are for Unreal/Unity graphics API usage across games.
That being said, on the iOS/macOS front, I don't know what these games are using to deploy to that platform. It could be that they use MoltenVK, but I could also see them using OpenGL or their own Metal rendering pipelines. As someone who grew up gaming on a desktop PC, I forget that smartphones and tablets are the future of the gaming industry. It felt weird to see Apple showing someone sitting on their couch with their iPhone 15 connected to the television, playing No Man's Sky via bluetooth controller, but simultaneously really cool.
It appears to me that languages born out of a design-by-committee process struggle to make anyone exceptionally happy, because the only way the language moves forward is by keeping all of its members equally miserable, or my masquerading as one language when in reality it's closer to five different languages held together by compiler flags and committee meetings.
To a first approximation, games using Unity/UE use D3D on Windows, OpenGL and Vulkan on Android, and Metal on iOS. Native Linux builds are not worth it, Proton (which uses Vulkan) is good enough.
Vulkan is not really a design by committee API, it is pretty much what happens when an IHV (AMD in this case) with poorly performing drivers gets to design an API (Mantle) without any adults in the room. D3D12 strikes a somewhat and Metal a much better balance in terms of usability.
The design by committee is spot on, if anything, AMD rescued OpenGL vNext to turn into yet another Long Peaks, or OpenCL 2.0.
Had not been for them to offer Mantle to Khronos, to this day you would most likely be getting OpenGL 5 with another extension batch, not that Vulkan isn't already an extension soup anyway.
I know Unity at least defaults to DX11 and I believe it doesn't package other renderers at all unless you explicitly enable them in Project Settings or use a rendering feature that requires them, and I can't imagine many people are digging into the renderer settings without good reason
Perhaps I am projecting my experience from the before times, when GPUs only supported certain versions of DirectX, or certain features were causing crashes on different systems.
It never super common but I remember doing it recently for Path of Exile while trying to get the most out of my Macbook's performance.
This is the sort of thing I expected to see when Chris Lattner moved to Google and started working on the Swift for Tensorflow project. I am so grateful that someone is making it happen!
I remember being taught how to write Prolog in University, and then being shown how close the relationship was between building something that parses a grammar and building something that generates valid examples of that grammar. When I saw compiler/language level support for differentiation, I the spark went off in my brain the same way: "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"
Thanks! You may find DeepProbLog by Manhaeve et al. interesting, which brings together logic programming, probabilistic programming and gradient descent/neural networks. Also, more generally, I believe in the field of program synthesis there is some research on deriving programs with gradient descent. However, as also pointed out in the comment below, gradient descent may not always be the best approach to such problems (e.g., https://arxiv.org/abs/1608.04428).
>> "If you can build a program which follows a set of rules, and the rules for that language can be differentiated, could you not code a simulation in that differentiable language and then identify the optimal policy using it's gradients?"
What's a "policy" here? In optimal control (and reinforcement learning) a policy is a function from a set of states to a set of actions, each action a transition between states. In a program synthesis context I guess that translates to a function from a set of _program_ states to a set of operations?
What is an "optimal" policy then? One that transitions between an initial state and a goal state in the least number of operations?
With those assumptions in place, I don't think you want to do that with greadient descent: it will get stuck in local minima and fail in both optimality and generalisation.
Generalisation is easier to explain. Consider a program that has to traverse a graph. We can visualise it as solving a maze. Suppose we have two mazes, A and B, as below:
Black squares are walls. Note that the two mazes are identical but the exit ("E") is in a different place. An optimal policy that solves maze A will fail on maze B and v.v. Meaning that for some classes of problem there is no policy that is optimal for the every instance in the class and finding an optimal solution requires computation. You can't just set some weights in a function and call it a day.
It's also easy to see what classes of problems are not amenable to this kind of solution: any decision problem that cannot be solved by a regular automaton (i.e. one that is no more than regular). Where there's branching structure that introduces ambiguity -think of two different parses for one string in a language- you need a context-free grammar or above.
That's a problem in Reinforcement Learning where "agents" (i.e. policies) can solve any instance of complex environment classes perfectly, but fail when tested in a different instance [1].
You'll get the same problem with program synthesis.
___________
[1] This paper:
Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability
makes the point with what felt like a very convoluted example about a robotic zoo keeper looking for the otter habitat in a new zoo etc. I think it's much more obvious what's going on when we study the problem in a grid like a maze: there are ambiguities and a solution cannot be left to a policy that acts like a regular automaton.
Thank for taking the time to explain such a worked out example. I was indeed picturing something a long the lines of "If you could write a program equivalent to a game where you solve a maze, could you produce a maze-solver program if the game were made in this runtime.
Not really. The world of Bayesian modelling has much fancier tools: Hamiltonian MC. See MC Stan. There’s also been Gibbs samplers and other techniques which support discrete decisions for donkeys years.
You can write down just about anything as a BUGS model for example, but “identifying the model” —- finding the uniquely best parameters, even though it’s a globally optimisation —- is often very difficult.
Gradient descent is significantly more limiting than that. Worth understanding MC. The old school is a high bar to jump.
I wrote a Gibbs Sampler to try and fit a Latent Dirichlet Allocation model on arXiv abstracts many moons ago! I'd probably have to start from primitive stuff if I were to give it another go today.
I agree with everything you've said so far: getting to the point where you can use gradient descent to solve your problem often requires simplifying your model down to the point where you're not sure how well it represents reality.
My lived experience--and perhaps this is just showing my ignorance--I've had a much harder time getting anything Bayesian to scale up to larger datasets and every time I've worked with graphical models it's just such a PITA compared to what we're seeing now where we can slap a Transformer Layer with embeddings and get a decent baseline. The Bitter Lesson has empowered the lazy, proverbially speaking.
Tensorflow has a GPU-accelerated implementation of Black Box Variational Inference, and I've been meaning to revisit that project for some time. No clue about their MC sampler implementations. Then I stumbled across https://www.connectedpapers.com/ and Twitter locked up it's API, so admittedly both of those took a lot of the wind out of my sail.
Currently saving up my money so that I can buy Kevin Murphy's (I think he's on here as murphyk) two new books that released not too long ago https://probml.github.io/pml-book/. The draft PDFs are on the website, but unfortunately I'm one of those people who can't push themselves to actually read a text if it's not something I can hold in my hands.
I have been planning to work on something like this. I think that eventually, someone will crack the "binary in -> good source code out of LLM" pipeline but we are probably a few years away from that still. I say a few years because I don't think there's a huge pile of money sitting at the end of this problem, but maybe I'm wrong.
A really good "stop-gap" approach would be to build a decompilation pipeline using Ghidra in headless mode and then combine the strict syntax correctness of a decompiler with the "intuition/system 1 skills" of an LLM. My inspiration for this setup comes from two recent advancements, both shared here on HN:
2. AICI: We need a better way of "hacking" on top of these models, and being able to use something like AICI as the "glue" to coordinate the generation of C source. I don't really want the weights of my LLM to be used to generate syntactically correct C source, I want the LLM to think in terms of variable names, "snippet patterns" and architectural choices while other tools (Ghidra, LLVM) worry about the rest. https://github.com/microsoft/aici
Obviously this is all hand-wavey armchair commentary from a former grad student who just thinks this stuff is cool. Huge props to these researchers for diving into this. I know the authors already mentioned incorporating Ghidra into their future work, so I know they're on the right track.
Bio: I am a C++ Engineer who is passionate about Audio/Video technology and machine learning. At my previous role, I wrote a real-time smart recording app in C++17 that would record H.264/H.265 video from security cameras by detecting changes in the foreground using real-time background subtraction, and then would re-encode the video and upload it to the cloud over an HTTPS connection. In a contractor role before that, I built a topic classification service for that would scale to 100s of thousands of inputs per second on a single GCP e2.mini instance, since the client was very price sensitive.
My first instinct was to ask "Does this play well with CIRCT?" And thankfully they answer that right away in the README.
I'm personally of the opinion that there is a LOT of room for improvement in the hardware design tooling space, but a combination of market consolidation, huge pressure to meet deadlines, and an existing functional pipeline of Verilog/VHDL talent is preventing changes.
That's not to say "Verilog/VHDL are bad", because clearly they've been good enough to support nearly all of the wonderful designs powering today's devices. But it is to say, "the startup scene for hardware will continue to look anemic compared to the SaaS scene until someone gives me all of the niceties I have for building SaaS tools in software."
A huge amount of ideas (and entire designs) start off as software sims, which enables kernel/compiler engineers to start building out support for new hardware before it's manufactured.
There is some interesting work going on at SiFive building hardware with Chisel[1], as well as some interesting work lead by a professor at William and Mary to improve simulations[2].
Tangentially related: I am currently scoping out an idea for how language models could be used to augment decompilers like Ghidra.
At a surface level, this was partially an intellectually interesting project because it is similar to a language translation project, however instead of parallel sentence pairs, I will probably probably be creating a parallel corpus of "decompiled" C code which will have to be aligned to the original source C code that produced the binary/object file.
Then I realized, the only way I could reasonably build this corpus would be by having some sort automated flow for building arbitrary open source C projects...
Perhaps I will attempt this project with a Go corpus instead.
an interesting project. go contains many source artifacts which make decompilation a bit more straight forward as well. I havent seen anyone really attempt this for go, but would be notable research
If it turns out that its easier for a language model to translate "Ghidra C" into readable Go code than to deal with CMake/Bazel/GNU autoconf/Ninja/Apache Meson/etc I wonder if that says more about the language model or the state of C/C++ toolchains...
Bio: I am a C++ Engineer who is passionate about Audio/Video technology and machine learning. At my previous role, I wrote a real-time smart recording app in C++17 that would record H.264/H.265 video from security cameras by detecting changes in the foreground using real-time background subtraction, and then would re-encode the video and upload it to the cloud over an HTTPS connection. In a contractor role before that, I built a topic classification service for that would scale to 100s of thousands of inputs per second on a single GCP e2.mini instance, since the client was very price sensitive.
I didn't experience this, but I was at dinner with someone who had recently emigrated from Russia, so I decided to ask, "What is it about the education systems in formerly Soviet domains that created such a strong passion for computing?" He answered in two parts:
1. Scholars who might've considered studying literature and philisophy might have a hard time competing on the global stage, as the Soviet state didn't take kindly to the idea of promoting anything that could be perceived as anti-Soviet ideals, even if it's for the sake of an academic exercise. Not that the Soviet Union was alone in this practice, but this practice in particular affected their academic community to the extent that many who might've considered literature or philosophy changed their minds.
2. Trade restrictions between the 50s and 60s with large portions of the West created a large demand for semiconductor products on behalf of the state, as the USSR understood the strategic importance of this technology early on. While trade restrictions were gradually relaxed in the decades leading up to Perestroika, the domestic industry for computer products had been established, similar to China's own semiconductor industry and Deng Xiaoping's economic reforms which opened the country to global trade.
This is mostly just the verbal account of one person followed by my own personal research, so this is by no means an authoritative take. If there are others with more knowledge (acquired through research or lived experience) I'd love to hear it, as my knowledge of the history of computing has a Soviet-sized hole in it.
Eastern Europe in general has a strong math & science culture. It operates on a lot of levels and is hard to summarize, but it predates the 50’s for sure.
How so? There were no Eulers in Eastern Europe except for Russia, which had a prestigious science academy. No Poland, Hungary, Bulgaria, Romania, etc. had that.
There were no Eulers anywhere except Switzerland (later Russia), though I guess there was a Gauss in Germany, so I think that’s a slightly poor example. A pretty random list of polish mathematicians you may have heard of:
- Copernicus
- Marie Curie
- Banach and Tarski
I think it’s worth noting that the history of the modern borders and what happened inside them is more complicated than that of (say) Switzerland so, for example, someone may have come from what is now considered Poland but be seen as a Prussian or Russian scientist based on where the political entity that was around when they were working.
I think it’s reasonable to say that the math & science culture goes back at least as far as when socialist-style education systems were being set up.
> If there are others with more knowledge (acquired through research or lived experience) I'd love to hear it, as my knowledge of the history of computing has a Soviet-sized hole in it.
The famous book Mathematics, its Content, Methods, and Meaning by Alexandrov, Kolmogorov, et al. has two chapters on computing, which is interesting both to take a sneak peak into Soviet-era techniques, as well as to understand the importance Kolmogorov and friends gave to the topic.
I guess my language in the original comment didn't really convey the full context of the discussion. My use of the phrase "passion" was referring to what appeared to be the observed success of these countries in the fields of mathematics, statistics, and computer science I thought were disproportionate to other measures of human developent. For example, the OECD data indicates that Hungary invests a lesser proportion of its budget into education compared to other countries mentioned in the article, or even other OECD countries[1]. However, Hungary is clearly established itself as a creator of serious mathematical talent based on the IMO stats shared by the author of the post (even thought the math skills used to win gold at the IMO are very different from what's needed to advance the field as a whole[2]).
At a very superficial level, the question is akin to, "Why does Argentina have such a successful national soccer team, when there are other countries with a strong cultural link to soccer and much more capable of pouring money into building a good team?". There is clearly some nuance I was missing when diving into this question but, I don't know what I don't know. I started by asking the question to learn more.
My desk is currently set up such that I have a large monitor in the middle. I'd like to look at the center of the screen when taking calls. I'd also like it to appear as though I am looking straight into the camera, and the camera is pointed at my face. Obviously, I cannot physically place the camera right in front of the monitor as that would be seriously inconvenient. Some laptops solve but I don't think their methods apply here as the top of my monitor ends up being quite a bit higher than what would look "good" for simple eye correction.
I have multiple webcams that I can place around the monitor to my liking. I would like to have something similar to what is seen when you open this webpage, but for a video. hopefully at higher quality since I'm not constrained to a monocular source.
I've dabbled a bit with OpenCV in the past, but the most I've done is a little camera calibration for de-warping fisheye lenses. Any ideas on what work I should look into to get started with this?
In my head, I'm picturing two camera sources: one above and one below the monitor. The "synthetic" projected perspective would be in the middle of the two.
Is capturing a point cloud from a stereo source and then reprojecting with splats the most "straightforward" way to do this? Any and all papers/advice are welcome. I'm a little rusty on the math side but I figure a healthy mix of Szeliski's Computer Vision, Wolfram Alpha, a chatbot, and of course perseverance will get me there.