Reminds me of the Debian Deep Learning Team and their recent thread about ML being incompatible with Free Software values and principles
They classify the models upon reproducibility and openness criteria. So a freely accessible model lacking proper documentation/reproducibility is essentially seen as non-free: "toxic candy" follows all Free Software good practices, yet the model they depend on suffer from such issues. That would mean if there was e.g. a particularly sneaky input data bias, it could cause issues down the line where it would be hard to determine if it was caused by a poorly calibrated parameter or missing training examples.
Yep, a similar area. I don't think they state that machine learning is incompatible, though?
From your first link: the (experimental, work-in-progress) policy document is a worthwhile read:
See the second link, the main Debian DL/ML team member has concluded that open ML is not feasible. Firstly due to hardware; everyone uses the proprietary nvidia CUDA stuff, AMD have open drivers but are playing catch up and both of them require proprietary firmware. Also the speed with which the space is changing, faster than Debian release cycles. Also the lack of libre training data, training opennes and the costs of retraining.
Another thing that was mentioned in the thread is non-copyright issues with training data, for example laws around consent for use of personal data.
Running code state-of-the-art for federated learning on mobiles: https://github.com/jverbraeken/trustchain-superapp/commits/m... (Publication: https://arxiv.org/abs/2110.11006)
People like to code their own work, instead of contributing to others. Incremental work does not count. This is more of a flaw in scientific output counting I believe. Also, our technology stack is completely "zero-server" that makes everything on Android much harder and complex. Disclaimer: I'm the thesis adviser of this student
"People like to code their own work, instead of contributing to others. Incremental work does not count. This is more of a flaw in scientific output counting I believe"
Lots of truth here. Actually, maybe the OP could change the title to make it clear you're talking about machine learning models? When I first saw the title I thought it would be about the deeply problematic state of scientific coding and "scientific" modelling in general. The ML world isn't so bad regardless of how it may seem, because it's got deep roots in computer science and industry. The moment you start looking at the code for things like epidemiological models, climate models etc it becomes clear that the incentive structures in science are completely broken.
I've not only heard from others but also seen it with my own eyes that people who call themselves "scientists" will happily write models in C that e.g. use pointer values in an equation instead of the dereferenced values, leading to incorrect outputs, and when this is pointed they claim there are no problems, or that they checked the results and the bugs make no difference (i.e. they lie). Or they'll have race conditions that cause unstable outputs and pretend it's because of a (pre seeded) PRNG they used, apparently in the hope that other scientists don't understand pseudo-randomness properly. Or they'll mix up their variables because everything is a single letter and there are thousands of lines of code, or they'll do out-of-bounds reads in a sorting algorithm because they don't know about standard libraries etc.
Nobody ever seems to retract papers because of bugs like these, and when the results of their model don't match reality they'll make arguments like "the model was validated against other models, which is a reasonable way to prove validity" or "we don't make predictions we make scenarios".
If researchers were more willing to collaborate on shared codebases they'd start to learn programming better, be able to recruit wider and more diverse teams, they'd share infrastructure and best practices and generally things might stand a chance of improving. But, as you say, there are major flaws in how the output of scientists are evaluated (by governments/non-profits), and this has a nasty habit of converting well meaning scientists into what are effectively pseudo-scientists.
As far as I understand, the real problems are (1) how do you gather data to train on and (2) you need a lot of expensive GPU.
Tensorflow and many more are already open source, so it looks like the problem is not softwareside... Same thing like we have OpenGL but we don't have the 3d models of games...
So you need some "Open-Source-Data" which we don't have, but still need otherwise the tenser-flow model will be useless...
It's true that TensorFlow is open source, but it's far from trivial to make it work, so I don't think it's correct to say that there are no software-side problems.
I have an AMD GPU which means I need to use TensorFlow-ROCm, but I've tried a few times and never succeeded in making it work.
1 is “trivially” solved using something like dolthub.com - 2 I think is more a problem of FOSS economics. If a maintainer only wants to ship reproduced copies of a model - who should pay for the very expensive GPU compute to do so? Today most CI platforms offer free CPU compute to turn source code into binaries but I’m not aware of a platform that offers enough free GPU compute to make a similar model feasible
It seems to me you should be able to take a model, and insert or modify layers without having to retrain everything in the model. I believe this is called transfer learning.
A new layer could be initialized to act as a 1:1 passthrough, then trained independently of all the existing layers, in situ. Once any gains were optimized, you could then train the whole model to realize any improvements.
It’s amazing to me when anyone who is even familiar with ML can entertain ideas like the above. Is it the result of “library glue” experience with little basic knowledge? I’m genuinely curious how this comes about. It seems lack of basic understanding is rampant in the field.
So you don't think it is feasible?
yeah! We should all get together and do a bunch of free work so the big companies can leverage it without having to pay. Who's with me!
hmm, maybe I'm a pessimistic misanthrope.
Good thing Linus didn't think the same way, we'd all use paid Windows today!
They were talking about making models in a more collaborative fashion, which is a non-obvious problem. Models are heavy, single-task oriented and expensive to train from scratch.
I thought they were talking about building aircraft (or other hobby) models :D why people love to kidnap widely used terms like that?!
You mean "statistical model" is kidnapping "model" from its true meaning?
I thought it was going to be about 3d printed models!
The big companies can leverage it, but so can the small startups, government, solo developers, non-profit orgs, charities... anyone in the world, actually.
That's the great thing about open source. It's for everyone without discrimination.
There's only so many groups that can leverage and profit from costly blackboxes that predict people's behavior. Additionally, i view ML as a hyperscaler version of science: you're trading money (training hw) for expert time. Most of the time FLOSS and other non-capitalistic entities will have few of the former and (reasonably) much of the later.
edit: i'm probably being a bit hyperbolic about the usecases for ML. But the hyperscaler thing still holds: most of the interesting usecases i know of could be done more lowtech with more expert knowledge, with more parcimonious usage of stochastic optimization.
Well if we could give away all the valuable things, companies couldn't profit from selling anything - the thing would already be free! So we should all do more work for free so as not to be taken advantage of by those companies. Now who's with _me_?
Ughh.. I think we need openly sourced models and open-sourced data, but also am pessimistic about how much the current market forces will try to either just appropriate the results or plain simply suppress the spread of those results. For all that FOSS has come to(I'm happy, but disappointed) I feel like the difficulties of building a business around FOSS is still a pretty big force keeping it's growth down. Not sure there's a change anytime soon. I predict (open-source) model building and data collation and creation is gonna have these same problems.
Let dead neurons die
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.