The best advice I can give is that you can think of your app and its data store as either:
#1 code is of prime importance, data store is simply a "bucket" for its data
#2 data is of prime importance, code is simply the means to read/write/display it
In my personal experience, #2 is a way better way to work. Apps can come and go, but data can last a long time, and the better your database is modelled the better the outcomes you'll have long term.
Corollary - I have seen some abject disasters where #1 has been adopted. Not necessarily just because of #1 alone but it's certainly been a major factor.
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."
-- Fred Brooks, The Mythical Man Month (1975)
When reviewing patches, I always look at the changes to .h files first, to see how the data structures are modified. Frequently you can predict exactly what the code changes are going to look like as a result.
I imagine that this doesn’t work as well with PIMPL code.
While I am completely in agreement with the intent of this aphorism, and also what the article says about the relational model, I feel this is (as aphorisms tend to be) something of an exaggeration. I think, for example, that it would be difficult to deduce the rules of cricket from its scoresheets.
Tell me your tables' semantics, however, and I think one could do much better. ..
I suspect that, when Brooks made that comment, he was probably thinking mostly in terms of programs for entering and reporting on that data. We're doing a lot more now.
Well, the scoresheet is not enough of course, you need all the "datastructures" (field design, player number and layout, equipment, etc).
edit: but yes, it can be a bit of an exaggeration. Figuring out all the red-black tree rules from just the structure definitions, relations and and invariants is probably a bit of a stretch for most.
One awful thing I've seen a few times is people who put data into a database in a format meant to make it easier for working with a particular platform, such as serialized Java dates (or even dates as longs).
It completely ties you to one implementation and precludes using the data for other things such as reporting and alerts or even ad hoc queries.
You can always convert in the DB when selecting if you want the date. Dates/times in general are kind of horrible in all DBs and programming languages so I don't think using unix seconds (or milliseconds, etc) is obviously wrong. It's kind of like storing strings as plain bytes in the DB, it's not what you'd do by default but it's not completely wrong.
I tend to agree that dates and times are awful to work with but storing everything as a Unix timestamp still seems to be the easiest way to avoid most of the headaches, in my experience. there's nothing worse than trying to reconcile a database filled with naive dates with no associated time zone information. is it daylight savings? is it in this timezone or did they enter it while traveling and their computer automatically updated the time?
The simplest way to minimize these issues, in my experience, is to put the logic for converting local time into UTC in the program and only store the Unix stamps in the DB.
So you don't like serialized java dates, which presumably are some standard at least.
You don't like longs / epoch.
So..... what do you like?
Date formatting/conversion/serialization is not standardized across databases. And it's generally hair pulling (at least the Oracle version was last I checked which was a long time ago).
Longs or ISO-8601 strings? No DB-specific crud. Yes please.
At least you can compare longs and index them. That's not too bad. But I prefer to just use the native date type; it allows you to use the built-in date functions.
> In my personal experience, #2 is a way better way to work. Apps can come and go, but data can last a long time
Exactly. Data is far more valuable than code. We still work with data from literally centuries ago. Most programs written even 10 years ago are probably no longer in use. I would be surprised if the oldest code still doing something useful was more than 20 or so years old.
And that is just open-source.
How did they miss GNU Maxima?
> I would be surprised if the oldest code still doing something useful was more than 20 or so years old.
...you can't be serious? Every other company that has been writing code for 20+ years probably has code that is 20+ years old. Just look at banks.
I could see how someone would get that impression If they've only worked at companies who only exist solely to light VC cash on fire.
It's an incorrect impression, obviously.
For >20 year old code, Calc.exe is just one example off the top of my head - https://github.com/Microsoft/calculator
The majority of the products I've personally worked on over my 15 year career in software development are still doing useful things and contain code written more than 20 years ago.
You've just torn apart most NoSQL databases in a very short comment :-))
So this is just an observation and not a judgment.
With only 4 years of professional experience I have never worked on a MongoDB project in which MongoDB somehow wasn't an issue. The proposed solutions merely being "improve the indices" or "scale the cluster", often without clearly defining what's going on.
So granted I have a very limited experience, it's baffling to me mainly in comparison to the PostgreSQL projects I worked on. They definitely also had problems but these problems were clearly defined even if resolution wasn't quick or easy. It was usually an out-of-date or generally messy schema that was causing issues and folks usually were able to clearly define the schema problems.
I hear this about MongoDB (that it's a plague and you should stay far away) all the time but DynamoDB, which is also NoSQL? One of AWS' finest products; just launching an app on AWS? Use DynamoDB! So on and so forth ad nauseum.
Why is DynamoDB the bee's knees but MongoDB is a thing to be despised?
Don't worry; for every misuse of a NoSQL database when they should have used a relational DB there's a solution misusing a relational DB for a queue, event bus, etc. Nature likes a balance!
Would you remind a dumb-wit like myself why that's a problem, when the API of a relational db is virtually perfect for a queue or event bus? Push, pop, look-but-don't-touch, data available through the most scrutinized and battle-tested query language ever, what else do you need?
Why does PostgreSQL include LISTEN/NOTIFY if I'm not supposed to use it for events?
NoSQL databases are mostly useful for performance reasons. You can have the most beautifully modeled relational database schema but if it performs/scales outside your tolerance bounds, then it's useless.
IMO, this is the main reason we have so many database technologies: they make important performance tradeoffs.
It isn't that difficult to optimize relational SQL databases. I've dealt with databases with billions of rows in hundreds of tables and the company's initial thought was to flatten the schema due to perf. complaints.
We got queries down from 30+ seconds to 5 ms simply by properly indexing, defragmentation, analyzing query plans, SQL Stored Procedures, etc.
I see a lot of complaints from developers claiming they have to join a "million row table to million row table" and reports are slow, and this gets blamed on the DB. These should not be slow outside of how much bandwidth is being pushed over the wire, which is often exactly what the problem is. They just didn't see it.
True, but you really, really want to be sure that if you're going the NoSQL route, your manual schema management will be top notch or your data structures will be super simple and the query demands will also be super simple.
Most of these high performance use cases actually fit this model well (well, the schema management being awesome is the wildly varying part).
But for everything else, there's Mastercard and RDBMS.
You're familiar with every NoSQL database?
There you go, I've edited my previous comment to account for pedantry.
> All these database aspects remained virtually unchanged since 1970, which is absolutely remarkable. Thanks to the NoSQL movement, we know it's not because of a lack of trying.
Kind of an aside, but I find it odd that people treat "relational" and "SQL" as synonymous (as well as "non-relational" and "NoSQL"). You could make a relational database that was managed with a language other than SQL, right?
Yes. PostgreSQL is a reference to the Post Ingres database (POSTGRES) adopting SQL. Ingres used QUEL. The POSTGRES database used POSTQUEL which was similar to Ingres’ QUEL. This is why the Postgres C library is called libpq (library post quel)
True, but SQL is (for better or worse) the de-facto standard query language for relational databases. The alternatives are all very niche.
I would like a better alternative, but it would need some very significant benefits compared to SQL to gain any traction. SQL is just so entrenched at this point that even NoSql database engines are adding support for pseudo-SQL query languages (which is really the worst of both worlds - the clunkyness of SQL syntax without the power of the relational model).
I’m curious as to why you find sql to be clunky, I find it extremely on point in most cases. I mean, how would you write a SELECT that was better than:
SELECT whatever FROM thisplace?
I know you can make it clunky with parameters and crazy stored procedures, and I’ve myself been guilty of a few recursive SQL queries that most people who aren’t intimate with SQL struggle to understand quickly, but I consider those things to be bad practice that should only be done when everything else is unavoidable.
The fact that SQL is still the preferred standard sort of speaks volumes to me about how good it is. We’re frankly approaching something similar with C styles languages. I recently did a gig as an external examiner, and it took me a while to realise that some code I was reading by a student in their PDF report was Kotlon and not TypeScript, because they look so alike.
The problem with sql is what happens when you fall off the SELECT FROM JOIN WHERE GROUP BY HAVING ORDER BY LIMIT cliff. The simple stuff in sql reads like English, but for that case ORM would generate a pretty efficient query anyway. The complex stuff in sql looks terrible in my experience and ORM bail out quickly. Once you can’t get the result with a simple SELECT then sql stops being declarative. Instead of writing what you want to get, you write something like a postmodern poem while having a stroke, just to convince postgres benevolent spirits to give you something almost right. Complex UPDATEs and DELETEs with joins are even worse.
Also lack of syntax sugar doesn’t help. SELECT list could support something like “t1.* EXCEPT col1, col2”. Maybe JOIN ON foreign key would be nice. IS DISTINCT FROM for sane null comparisons looks terrible. Aliases for reusing complicated statements are really limited. Upsert syntax is painful. Window functions are so powerful that I can’t really complain about them though.
We use a lot of sql for business logic, but some code I have to reread from zero every time I need it. Maybe we modeled our data wrong or there is some inherent complexity you can’t avoid, but I mostly blame sql the language. Unfortunately I have no idea how it could be improved.
Anyway I think the sql cliff is real. Once you take a step outside the happy path prepare for a headache. For me sql definitely is in some local maxima, after all I use it every day at work.
One very very simple fix is to mention the table first:
FROM this SELECT whatever
This already allows autocomplete for the attributes to work, and has an easier mental model - you think about the tables, then you think about their attributes. It also matches relational algebra better, where you'd do the projection (picking the attributes you want) at the end.
But anyway, simple cases being simple doesn't mean the language isn't horrible for more complex ones.
One thing I always complain about is join clauses making it easy to do the wrong thing (NATURAL JOIN) and annoying to do the correct thing (joining on the defined foreign keys).
SQL syntax assumes queries have operations in a certain order - join, filter, group, filter again, project. What if you want to join after a grouping? What if you want to filter after a project? What if you want to group over a projection? You will have to use the clunky subquery syntax or WITH-clauses.
Compare to LINQ-syntax in C#, where you can just chain the operations however you want.
Another issue is that you can't reuse expressions. If you have an expression in a projection, you will have to repeat the same expression in filters and grouping. This leads to error-prone copy-pasting of expression or more convoluted syntax using subqueries.
That's the first / starter use case though, SQL can get a bit crazy once you get into enterprise spaces - stored procedures, funky datatypes, auditing & history features like temporal tables, hundreds, thousands of tables and a similar amount of columns, naming & organizing things, etc.
Thankfully, most people will never have to deal with any of that, myself included. The biggest databases I've had to deal with were very relatable - one about books & authors, another about football and historic results. The other biggest database is one I'm working with and building right now, it's a DB for an installation of an application managing tons of configurations, a lot of domain specific terms. The existing database is not normalized of course, and uses a column with semicolon-separated-values as an alternative to foreign keys. Sigh. Current challenge is to implement history, so that a user can revert to previous versions. I'll probably end up implementing temporal tables in sqlite.
> for better or worse
They work, they are good enough, everybody knows how to use them, gaining skills in those languages is valuable and timeless. No inane things like new languages du jour like golang that (fail at) reinventing the wheel appear every few years.
SQL remains beautifully boring & useful and is as close to program language perfection as we will ever get.
Is there something similar for SQL, where you allow an alternative syntax and maybe programming approach and then use SQL as connection to the DB ?
I'll take the clunkiness of SQL over the clunkiness of Mongo's json-based query language any day.
Try http://www.querymongo.com/ :)
Which in itself is confusing given the upside potential of having relational model without the baggage. Hadley Wickham has had huge success in the context of R re-implementing the relational model without SQL (see: dplyr). It is more user friendly and much more pleasant to use than the mess that database users have to put up with.
dplyr and friends are great! But probably difficult to implement the same idioms in languages less open to features like R’s non-standard evaluation.
But you could also use SQL for that new RDBMS, or support multiple query languages. It doesn't matter, all relational databases are very similar because they are relational: they do the same things, potentially in the same way, regardless of the query language front end.
Apart from being subjectively displeased at SQL syntax, most problems with SQL are actually query language independent issues in the RDBMS: maddening proprietary extensions, implementation limits, nonportable details, library and system issues (e.g. character encoding and default configurations).
The best alternative query languages can do is making certain queries (advanced, rare ones) easier to express.
I have grown to understand that the relational model is the answer for solving all hyper-complex problems. The Out of the Tar Pit paper was a revolution for my understanding of how to approach properly hard things:
The sacred artifact in this paper is Chapter 9: Functional Relational Programming.
Based upon inspiration in this paper, we have developed a hybrid FRP system where we map our live business state to a SQLite database (in memory) and then use queries defined by the business to determine logical outcomes or projections of state for presentation. Assuming you have all facts contained in appropriate tables, there is always some SQL query you could write to give the business what they want. An example:
> Give me a SQL rule that says the submit button is disabled if the email address or phone number are blank/null on their current order.
I hope the advantages of this are becoming clear - You can have non-developers (ideally domain experts with some SQL background) build most of your complex software for you. No code changes are required when SQL changes.
--Disable Order Submit Button Rule SELECT 1 -- 1 == true, 0 == false FROM Customer c, Order o WHERE o.CustomerId = c.Id AND c.IsActiveCustomer = 1 AND o.IsActiveOrder = 1 AND (IsNullOrEmpty(o.EmailAddress) OR IsNullOrEmpty(o.PhoneNumber))
The relational model in this context is powerful because it is something that most professionals can adopt and collaborate with over time. You don't have to be a level 40 code wizard to understand that a Customers table is very likely related to a ShoppingCarts table by way of some customer identity. If anyone starts to glaze over at your schema diagrams, just move everything into excel and hand the stakeholders some spreadsheets with example data.
Ahhh... a fellow traveller on the road.
I have been hacking on a Clojure/Script library (https://github.com/wotbrew/relic) to experiment with exactly this kind of thing. / PLUG
For most problems I encounter I want to focus on the data and its relationships.
I cannot help feeling like programming languages and idioms have me deal with too much 'mechanism', the machine like structures and forms that are mostly incidental to the data domains I work in, and are entirely about performance, layout, access pattern etc - when a machine could do an ok job of making those decisions for me until optimisation becomes important.
Glad to see there's more than one of us out there.
> I cannot help feeling like programming languages and idioms have me deal with too much 'mechanism'
Absolutely. Most "best practices" represent a lot of noise that gets in between me and solving my problem. My latest prototypes have been reduced to using static classes/methods in virtually all areas. Getting everything out of the way is so important. When data is king, your code can turn into this ultra-utilitarian apparatus.
I've had partial success writing code generators that can output supporting infrastructure models/code based upon my SQL table definitions. Assuming those generators are well-tested, high-quality code supporting a schema of arbitrary complexity could be produced instantly.
Do you have any additional resources about this model of thought? It's like Redux on steroids lol. I wonder if anybody has done a SQLite-as-the-Store pattern library for front end apps before. I'd use the hell out of that!
One of these days I am just going to have to write a book about it. There are so many layers and perspectives to consider. Maybe another small rant about our roadmap will help you see more clearly how it could work for you:
I am currently looking at an iteration that will use event sourcing at the core (i.e. append-only logs which record the side-effects of commands), with real-time replays of these events into SQLite databases (among other in-memory working sets). The SQLite databases would serve as the actual business customer front-end. We would also now have a very powerful audit log query capability that would go directly against this event source. I would just think about the database as the layer the business can communicate with. It is your internal/technical customer. As long as that database is proper, everything else downstream works without you thinking about it.
The biggest reason for pursing this is to decouple the schema from the logical reality as much as possible. The business likes to change their mind (sometimes for very good reason) and we have to keep them honest at some level. As proposed here, that source of truth will be the read-only event logs.
When you look at this on a whiteboard, you may recognize that it resembles a textbook definition of CQRS. Perhaps try reading up on: CQRS, event sourcing, database normalization, SQLite's application-defined function capability, and anything else that looks adjacent.
So you are talking about something like SQLite-as-the-API? I was a big fan of this idea for a while. Difficulties in streaming query updates from SQL dbs and difficulties in working with deeply nested, hierarchical data made me shy away from SQL though.
Having a client-side relational db to work with would remove huge amounts of complexity to most client apps though.
Client-side databases are relatively popular in ClojureScript land. DataScript and Fulcro both come to mind immediately. DataScript is an immutable in-memory database that uses Datalog to describe queries and Fulcro is a full-stack framework that automatically normalises your database.
> SQLite-as-the-store pattern
I remain consistently amazed at what one can do with lowly little SQLite. I can't count the number of times I struggled with some scripts performance issues when handling a large amount of data that came through as a CSV or text file, which immediately evaporated when i realized "why am I asking the program to handle a million tuples at a time via internal data structures when an indexed database would do it so much faster?"
An in-memory (or even SSD-resident) SQLite database can be ridiculously fast for handling in-process data and not just the stuff you want to keep when you're done.
I think your example is a solid illustration as to why you may not want non-developers writing SQL:
These queries aren't identical in their output, though they are logically the same. Is that a good thing? What are the implications of allowing such ad-hoc queries within a system?
SELECT CASE WHEN o.EmailAddress IS NULL OR o.PhoneNumber IS NULL THEN 1 ELSE 0 END AS [IsSubmitDisabled] FROM Customer c INNER JOIN Order o ON c.Id = o.CustomerId AND o.IsActiveOrder = 1 WHERE c.IsActiveCustomer = 1
The SQL was only intended for illustrative purposes against a contrived domain model. Actual correctness of the query in whatever hypothetical domain was imagined at that time is not something I would focus too much on.
My point exactly. "Not focusing too much" on the SQL is a road you likely don't want to go down, and is rather inevitable when you bestow that power unto non-developers! (You were too quick. I made an edit)
The promise of low code.
I’m sure there is a time and place for it (Power Apps is a billion dollar business), like time keeping frontend logic. Apps with significant logic and complexity will always need developers.
I’m not critical of your app as I have no real understanding of what it does and why the design choices were made. I do not think embedding logic in SQL would be an ideal way for any app I’ve worked on to work.
> The promise of low code.
Not exactly. We still have shitloads of code. It has just been modularized & standardized such that each component can now mostly be configured by SQL the business writes.
Our product is a B2B application, and each one of our customers has very precise & unique business rules they would like to see applied.
Every customer uses the same code pile, but with varying configuration (SQL).
In my limited experience, code built around relational databases is difficult to grok. The structure is inverted with respect to typical OOP. Rather than having a sane class hierarchy, where one can start at the top and navigate down to understand how objects are nested, the structure is inverted, piecing together class relationships requires looking at the tables and following foreign keys which effectively point to parent classes. It feels backwards.
> In my limited experience, code built around relational databases is difficult to grok. The structure is inverted with respect to typical OOP. Rather than having a sane class hierarchy, where one can start at the top and navigate down to understand how objects are nested, the structure is inverted, piecing together class relationships requires looking at the tables and following foreign keys which effectively point to parent classes. It feels backwards.
Other than your first sentence (which is subjective), you are correct. The only question is whether you are going to mangle the database structure to fit your OO hierarchy or design your program in a non-OO way to fit the relational structure.
Since the database will live on long past the program, and will have multiple programs talking to it, it makes sense to design your program around the data, not design the database structure around your program.
[EDIT: See https://blogs.tedneward.com/post/the-vietnam-of-computer-sci... for why OO is a terrible design for persistent data]
I think it’s worth noting that “database-centric” and “app-centric” notions of databases are both found in the wild. That’s how I remembered the difference between MySQL and PostgreSQL—MySQL was what happened when a bunch of app developers needed a database, and it was extremely popular e.g. with the PHP webdev crowd. If you needed to access the database, you went through the app. PostgreSQL is what happened when DBAs designed a database. If you needed access to the database, you connected to the database. A lot of other databases can be understood this way… like how MongoDB further shifts database concepts into the application.
(Honestly I’m definitely in your camp… if I need a database, design the database first, then write the code.)
You're right: object orientation and relational data don't mesh well together.
Where you're wrong, and it may take more experience to realize, is that the problems in the mismatch lie far more on the OOP side than on the relational side.
Object oriented designs - the navigable graphs of objects, leaving aside polymorphism - usually privilege a particular perspective of the data, a perspective which is suited to a specific application. For example, a store with items, customers and orders: the application is focused on creating transactions. Relational models, on the other hand, support multiple perspectives "out of the box" and are expected to support multiple applications; for example, reports that roll up by item, by category and by customer; reusing the same customers for a different application; and so on.
Different applications and perspectives usually require different object models. The alternate application reusing the same customer entities won't want to couple with a store application and decorate every customer with an irrelevant list of orders. A reporting app is better off with a visual query builder that creates SQL to hand off to the database than trying to iterate over object graphs performing aggregations and sorts. And so on.
For applications with ambitions for a longer lifespan, start with the database schema. Keep application models fairly thin; focus on verbs; try to ensure only one piece of code owns mutating a logical database entity with a set of verbs; and keep entanglements like UI out of those verb implementations so that you reuse the verbs, rather than being too tempted to reimplement direct database modifications or push too much logic into models.
Object oriented models come and go, applications are rewritten, but database migrations happen at a much slower pace, and data lasts for a very long time.
Class hierarchies are not "sane", they're inherently hard to extend in a way that preserves a consistent semantics. This is exactly what the relational model is intended to fix.
Foreign keys does not correspond to class hierarchies in OOP. If anything, they correspond to object composition.
E.g. you might have a Customer table and an Order table with FK from Order to Customer. This would correspond to a Customer object holding a collection of Order objects - it would not correspond to Order being a subclass of Customer.
In any case, the relational model is not supposed to map directly to any particular programming model or paradigm. You should be able to interact with relational data from OOP, imperative, functional and declarative languages.
> Foreign keys does not correspond to class hierarchies
If a foreign key is primary in two tables (1:1 relation), then it can correspond well to a hierarchical relation. In other cases, foreign keys can also correspond to composition as you say, and it depends on intention.
An example of inheritance could be the table Parties with primary key Party_ID. The table Customers has primary FK Party_ID (Parties.Party_ID). If you must use ORM, you can implement this is as Customer extends Party.
The class Customer then inherits whatever ability the class Party has to access further tables with FK Party_ID, ex something like PartyAddresses.
This might make more sense if you also have a table Providers with primary FK Party_ID (Parties.Party_ID), and the table Orders with FKs provider_Party_ID (Providers.Party_ID) and customer_Party_ID (Customers.Party_ID).
Normalizing like this allows you to have inheritance, polymorphism, and encapsulation in OOP.
"Access Path Dependence" continues to plague our computers because they are a built-in assumption of file systems.
This is really showing its age. I want my computer to think in terms of what data is, where it came from, any other metadata I received it with, and any metadata I've added, notably, but not primarily, the various "places" I've put it.
The fact that I can't retrieve the URL I downloaded anything from, years later, no mater how many times I've moved it, is just shameful. It's cheap information our tools could be preserving but aren't.
So if you ask me what I think of the relational model, I'll tell you: it's a good idea and we should try it.
Have opened this thread a couple of times today, hoping to see comments disagreeing with the post. Still no such comments, so I’ll take the duty… hopefully not causing a
I have been programming for ~15 years, the majority of my life. Web, mobile, HPC, CUDA, assembly for x86 and ARM, kernel modules, LLVM plugins, and databases… Not once in my career I found the concept of relational databases efficient or relevant. They are frustrating to use, exceptionally slow and generally provide SQLs, or other DSLs, which is the most archaic form of query representation I could think of - not binary and not general-purpose.
It reminds me of no-code development platforms. They may work (not very well) for some super simple tasks, but as soon as you want to do something at least remotely non-trivial, they put more barriers, then provide help. From personal experience, again, I have grown to hate frontend development so much, that have tried a dozen of website builders, before reverting to good old HTML/CSS (plus a bit of JS) every time I wanted to refresh my blog or companies website.
Plus, I wouldn’t immediately dismiss the concept of Graph Databases. If we want to be truly canonical, we wouldn’t create hundreds of columns in our tables, with just a few relations. The ideology is that every unique “type” (in any sense you prefer), should be in its own table, linked with the other “types” in other tables… Then theory ends and starts practice. Try implementing a fast JOIN in a relational database. Then increase the depth to 3, tracing the relations of relations of relations. Even in a non distributed case it is a horror. Both the nested SQL queries and the program that will be evaluating them. Graph DBs are designed to solve specifically that issue really well.
Another point: how “relational algebra” suddenly makes smth superior to anything else? It’s not a Grand Unified Theory of Physics, not rocket science and not even Graph Theory for that matter. The latter being the biggest and most studied branch of Theoretical Computer Science with brilliant theorems being published even today.
Not saying that todays popular Graph DBs are good (they are mostly disgusting), but I would still much rather think of my data as a graph, than a table with some foreign keys
My big realization was that the relational model is a constraint on your data so that relational algebra can optimize your queries for perf. But these constraints cause a whole range of issues, which always result in a poorly modeled domain as people try to work around the limits of the optimizer or the expressiveness of SQL.
I've always felt that what we need is a way to maintain a logical schema (E-R diagram / graph schema), and then the physical schema is automatically generated along with denormalizations for perf as necessary. A graph db is simply denormalizing its joins using index-free adjacency.
One problem I have with relational databases is I don't know of a good way to represent sum types - I remember seeing some possible solutions but they looked very complex and hard to understand (or dbms-specific)
can you elaborate a little? i've not heard of the term sum types before and when googling superficailly they dont seem that exciting, particularly for persistent data. When would they not just be a foreign key to table or one column for each allowed datatype (or a mix of the two)?
Sorry, I am assuming here its my ignorance thats teh issue not knowing any real word examples of why they are a big deal.
I don't know if this explanation is a good one, but I'll try using Haskell syntax. In Haskell, you can have product types like:
where an element of this type is expressed as `Coord x y` where x and y are both floats. Examples of elements of this type are `Coord 1.1 0.9` or `Coord -2.9 10.0`, etc. Product types are equivalent to structs in C, if you're familiar with C.
data CartesianCoordinate = Coord Float Float
But you can also have sum types. Instead of starting with the general idea, I'll point out that C enums are a special case of sum types:
and then point out linked lists are more representative of the general idea:
data Day = Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday
where an element of `ListOfFloats` is for instance `Node 1.0 (Node 2.0 End)`, or `Node 0.3 End`, or just `End`.
data ListOfFloats = Node Float ListOfFloats | End
The pipe symbol | is what makes a sum type a sum type. It means that an element of the type is either one possibility or another possibility or another possibility.
One final example is a type consisting of all possible mathematical expressions. This is also a sum type:
An element of this type can be something like `Add (Const 1.5) (Times (Const 0.2) (Const 2.8))`, which is supposed to represent the expression "1.5 + 0.2*0.8". Interestingly, you can't easily express this type in most OOP languages.
data Expr = Add Expr Expr | Times Expr Expr | Negate Expr | Inverse Expr | Const Float
In simple set theory parlance, product types refer to Cartesian products, and sum types are set-theoretic unions.
The relevance to relational databases is that each row of a table corresponds to an element of some product type. Each row of the same table has the same product type. But there is no means defining a "table" whose elements belong to a sum type as opposed to a product type. Why is that?
As the parent mentioned, you can encode your sum type by providing all the columns and constraining exactly one to be non-null.
Maybe the File System shouldn’t be hierarchical but rather relational as well.
Microsoft Sharepoint uses SQL Server for all user Item storage, so it's an example of an all relational file store. They did this around the time they were also experimenting with the Cairo a sibling comment mentions. Also as another commenter noted you query on the metadata. Folders/directories for instance are represented as metadata. Installations can also describe very detailed ontologies including ones provided by third parties. It's a lot of work.
Edit: I think I was actually thinking of WinFS which came out of Cairo later, around 2000.
That means onedrive for business uses sql server as it uses sharepoint online, interesting.
That's an interesting point. As I recall (unfortunately it's been a while). There was tension at first who would be top dog. SharePoint or OneDrive, and OneDrive won, so SharePoint Online is actually mapped over OneDrive. MS could do that because ironically even though SharePoint On-prem had been entirely SQL Server underneath, SharePoint On-prem devs generally never wrote a line of SQL. They interacted with the SharePoint Object model and that was basically mapped over OneDrive. What OneDrive is underneath I don't think MS had to be as forthcoming about because of it's cloud-based nature. I was on my way out by that point though.
Microsoft wondered that too and for a while developed "CairoOFS" as a possible replacement for NTFS. It was intended as a relational "object filesystem" https://betawiki.net/wiki/Microsoft_Cairo
And then we could have fun looking at the execution plan of our file access. :D
Meant in jest; though I think the idea of a "one true way" to access files is a pipe dream. The hierarchical works better on my computer than whatever scheme we've cooked up for our phones.
> The hierarchical works better on my computer than whatever scheme we've cooked up for our phones.
What do you mean? Hierarchical is how I access files on my phone. Is that an Apple thing?
I thought phones had moved to an "ownership" model where access to files goes through the applications. That said, trying right now, I see there is a files app that seems to be mainly types. I'm assuming that is actually by folder?
The only thing the relational models really gives you is the ability to join two relations, which lets data decompose for storage but re-compose for many different uses. I'm not immediately seeing how that would be useful in a file context, where generally you want to look up a specific blob of data (aka a file).
A tree-based file-system is optimised for doing a search from the users perspective, finding a file takes log(files) steps and finding related files is trivially cheap. It is likely hard to outdo that with a relational model.
"Finding files" has never been my problem. Organizing and categorizing them has. And the reason it's been a problem is because my files don't contort into a neat hierarchy.
> I'm not immediately seeing how that would be useful in a file context, where generally you want to look up a specific blob of data (aka a file).
I would really enjoy a filesystem that's a loose tag-based hierarchy rather than a strict single folder tree.
In SQL terms, something like:
So tags are organized in hierarchies, but there can be multiple parallel hierarchies, and a file can belong simultaneously to multiple hierarchies. Say one is a flat list of tags by user, another is a flat list of tags by apps, and another represents replication or backup strategies, plus the usual directory organization.
create table files ( file_id bigint primary key , file_name text ) create table tags ( tag_id bigint primary key , tag_name text not null , parent_tag bigint null ) create table file_tags ( file_id bigint , tag_id bigint )
(I'm not sure if a file should be allowed to belong to multiple tags within the same hierarchy; my gut says no.)
Let's say that as a convention we separate the root tag in each hierarchy with a colon, and other tags with a slash. Your file 'cool_code.py' may be found under the tags 'storage:sda', 'users:roenxi', 'apps:pycharm', 'pycharm:projects/cool_app/src/utils', and 'rclone:firstname.lastname@example.org/202201'.
> (I'm not sure if a file should be allowed to belong to multiple tags within the same hierarchy; my gut says no.)
Gmail is a good example. Has very flexible tagging schema. Every tag exists in a hierarchy. Emails can have multiple tags.
SQL is not good for hierarchies though. You need to use recursive CTEs or denormalize relations and things.
This looks really cool!!
I would see it more as a different and better way to store the same and additional information. In a RDBMS an index is already storing references in a tree format. If you map that to files and folders you now have basically the same thing as a file system.
Doing it more as a DB enables the OS to use the knowledge from RDBMS for efficiency, which I'm sure rivals the best file systems and it's possible to create multiple indexes and views for other use cases.
Our current view on file systems and the knowledge we have is heavily influenced by slow spinning disks, while RDBMS have leveraged RAM a lot more. With todays fast SSDs the file system operates in a reality that is more like RAM than a slow spinning disk.
> In a RDBMS an index is already storing references in a tree format. If you map that to files and folders you now have basically the same thing as a file system.
Yeah, a file system stores general data so it is very easy to map it to an RDBMs that also stores general data.
But what this is identifying that once the relational nature of the data isn't a factor, the best lookup structure is a tree.
> A tree-based file-system is optimised for doing a search from the users perspective, finding a file takes log(files) steps and finding related files is trivially cheap. It is likely hard to outdo that with a relational model.
A tree structured file system is effectively conflating an index with relations. Another way to look at it is that most FSes are like databases which only allow one index, and expose the index to users.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.