I don’t know what, if any, CS background you have, but that is way off. The training dataset is used to generate the weights, or the trained model. In the context of building a trained LLM model, the input is the dataset and the output is the trained model, or weights.
It’s more appropriate to call deepseek “open-weight” rather than open-source.
I used the word “source” a couple times in that post… The first time was in a general sense, as an input to generate an output. The training data is the source, the model is the “function” (using the mathematics definition here, NOT the computer science definition!), and the weights are the output. The second use was “source code.”
Weights can be changed just like a compiled binary can be changed. Closed source software can be modified without having access to the source code.
The weights aren’t the source, they’re the output. Modifying the weights is analogous to editing a compiled binary, and the training dataset is analogous to source code.
Same girl?