Hacker News
3 months ago by pabs3

Reminds me of the Debian Deep Learning Team and their recent thread about ML being incompatible with Free Software values and principles

https://salsa.debian.org/deeplearning-team/ml-policy https://lists.debian.org/msgid-search/8bc4a2fdb2a0619d1d8214...

3 months ago by woolion

They classify the models upon reproducibility and openness criteria. So a freely accessible model lacking proper documentation/reproducibility is essentially seen as non-free: "toxic candy" follows all Free Software good practices, yet the model they depend on suffer from such issues. That would mean if there was e.g. a particularly sneaky input data bias, it could cause issues down the line where it would be hard to determine if it was caused by a poorly calibrated parameter or missing training examples.

3 months ago by jka

Yep, a similar area. I don't think they state that machine learning is incompatible, though?

From your first link: the (experimental, work-in-progress) policy document is a worthwhile read:

https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/...

3 months ago by pabs3

See the second link, the main Debian DL/ML team member has concluded that open ML is not feasible. Firstly due to hardware; everyone uses the proprietary nvidia CUDA stuff, AMD have open drivers but are playing catch up and both of them require proprietary firmware. Also the speed with which the space is changing, faster than Debian release cycles. Also the lack of libre training data, training opennes and the costs of retraining.

3 months ago by pabs3

Another thing that was mentioned in the thread is non-copyright issues with training data, for example laws around consent for use of personal data.

3 months ago by synctext

Running code state-of-the-art for federated learning on mobiles: https://github.com/jverbraeken/trustchain-superapp/commits/m... (Publication: https://arxiv.org/abs/2110.11006)

People like to code their own work, instead of contributing to others. Incremental work does not count. This is more of a flaw in scientific output counting I believe. Also, our technology stack is completely "zero-server" that makes everything on Android much harder and complex. Disclaimer: I'm the thesis adviser of this student

3 months ago by mike_hearn

"People like to code their own work, instead of contributing to others. Incremental work does not count. This is more of a flaw in scientific output counting I believe"

Lots of truth here. Actually, maybe the OP could change the title to make it clear you're talking about machine learning models? When I first saw the title I thought it would be about the deeply problematic state of scientific coding and "scientific" modelling in general. The ML world isn't so bad regardless of how it may seem, because it's got deep roots in computer science and industry. The moment you start looking at the code for things like epidemiological models, climate models etc it becomes clear that the incentive structures in science are completely broken.

I've not only heard from others but also seen it with my own eyes that people who call themselves "scientists" will happily write models in C that e.g. use pointer values in an equation instead of the dereferenced values, leading to incorrect outputs, and when this is pointed they claim there are no problems, or that they checked the results and the bugs make no difference (i.e. they lie). Or they'll have race conditions that cause unstable outputs and pretend it's because of a (pre seeded) PRNG they used, apparently in the hope that other scientists don't understand pseudo-randomness properly. Or they'll mix up their variables because everything is a single letter and there are thousands of lines of code, or they'll do out-of-bounds reads in a sorting algorithm because they don't know about standard libraries etc.

Nobody ever seems to retract papers because of bugs like these, and when the results of their model don't match reality they'll make arguments like "the model was validated against other models, which is a reasonable way to prove validity" or "we don't make predictions we make scenarios".

If researchers were more willing to collaborate on shared codebases they'd start to learn programming better, be able to recruit wider and more diverse teams, they'd share infrastructure and best practices and generally things might stand a chance of improving. But, as you say, there are major flaws in how the output of scientists are evaluated (by governments/non-profits), and this has a nasty habit of converting well meaning scientists into what are effectively pseudo-scientists.

3 months ago by yonixw

As far as I understand, the real problems are (1) how do you gather data to train on and (2) you need a lot of expensive GPU.

Tensorflow and many more are already open source, so it looks like the problem is not softwareside... Same thing like we have OpenGL but we don't have the 3d models of games...

So you need some "Open-Source-Data" which we don't have, but still need otherwise the tenser-flow model will be useless...

3 months ago by jstanley

It's true that TensorFlow is open source, but it's far from trivial to make it work, so I don't think it's correct to say that there are no software-side problems.

I have an AMD GPU which means I need to use TensorFlow-ROCm, but I've tried a few times and never succeeded in making it work.

3 months ago by edude03

1 is “trivially” solved using something like dolthub.com - 2 I think is more a problem of FOSS economics. If a maintainer only wants to ship reproduced copies of a model - who should pay for the very expensive GPU compute to do so? Today most CI platforms offer free CPU compute to turn source code into binaries but I’m not aware of a platform that offers enough free GPU compute to make a similar model feasible

3 months ago by mikewarot

It seems to me you should be able to take a model, and insert or modify layers without having to retrain everything in the model. I believe this is called transfer learning.

A new layer could be initialized to act as a 1:1 passthrough, then trained independently of all the existing layers, in situ. Once any gains were optimized, you could then train the whole model to realize any improvements.

3 months ago by throwaway1492

It’s amazing to me when anyone who is even familiar with ML can entertain ideas like the above. Is it the result of “library glue” experience with little basic knowledge? I’m genuinely curious how this comes about. It seems lack of basic understanding is rampant in the field.

3 months ago by mikewarot

So you don't think it is feasible?

3 months ago by bryanrasmussen

yeah! We should all get together and do a bunch of free work so the big companies can leverage it without having to pay. Who's with me!

hmm, maybe I'm a pessimistic misanthrope.

3 months ago by visarga

Good thing Linus didn't think the same way, we'd all use paid Windows today!

They were talking about making models in a more collaborative fashion, which is a non-obvious problem. Models are heavy, single-task oriented and expensive to train from scratch.

3 months ago by brabel

I thought they were talking about building aircraft (or other hobby) models :D why people love to kidnap widely used terms like that?!

3 months ago by visarga

You mean "statistical model" is kidnapping "model" from its true meaning?

3 months ago by teh_infallible

I thought it was going to be about 3d printed models!

3 months ago by brabel

The big companies can leverage it, but so can the small startups, government, solo developers, non-profit orgs, charities... anyone in the world, actually.

That's the great thing about open source. It's for everyone without discrimination.

3 months ago by lapinot

There's only so many groups that can leverage and profit from costly blackboxes that predict people's behavior. Additionally, i view ML as a hyperscaler version of science: you're trading money (training hw) for expert time. Most of the time FLOSS and other non-capitalistic entities will have few of the former and (reasonably) much of the later.

edit: i'm probably being a bit hyperbolic about the usecases for ML. But the hyperscaler thing still holds: most of the interesting usecases i know of could be done more lowtech with more expert knowledge, with more parcimonious usage of stochastic optimization.

3 months ago by drewcoo

Well if we could give away all the valuable things, companies couldn't profit from selling anything - the thing would already be free! So we should all do more work for free so as not to be taken advantage of by those companies. Now who's with _me_?

3 months ago by nandhinianand

Ughh.. I think we need openly sourced models and open-sourced data, but also am pessimistic about how much the current market forces will try to either just appropriate the results or plain simply suppress the spread of those results. For all that FOSS has come to(I'm happy, but disappointed) I feel like the difficulties of building a business around FOSS is still a pretty big force keeping it's growth down. Not sure there's a change anytime soon. I predict (open-source) model building and data collation and creation is gonna have these same problems.

3 months ago by Kalanos

Let dead neurons die

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.