Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is a DB that doesn't need indexes attractive to DBAs?
7 points by didgetmaster on June 7, 2022 | hide | past | favorite | 48 comments
I have worked a few jobs that required some DB tuning in order to get queries to run faster. One of the first things to do is to create and maintain indexes on relational tables. You spend time analyzing queries to determine what columns need indexes and what kind of index each one needs. Once the queries get fast enough, someone comes up with a new query that is slow because it filters on a column without an index and your nose is right back at the grindstone.

I created a new DB architecture that is radically different than anything I have seen. It is a set of performant key-value stores that can be grouped together to form relational tables. The data within each store is optimized for search so you have just one copy of each data point (no separate indexes). It can handle large tables (100 million+ rows and 2000+ columns) and has a flexible schema so it can handle data like Json that includes arrays for each column. The system is in open beta at <www.Didgets.com> so anyone can download and try it.

Since I never liked the grunt work of creating and maintaining indexes (along with the performance hit they cause for inserts, updates, and deletes); I thought that other database experts would be really interested in a new system that does not require them. Queries on my system are faster than on other highly-tuned DBs. But so far I have gotten a collective yawn from many people I show it to.

Am I missing something? Do others really like creating and maintaining DB indexes, even though I did not?



See "triple store" and "columnar database" for similar products. Salesforce got a patent circa 2000 for a database that automatically builds indexes based on the queries people make that left a huge gap in database products available, one reason why people looked at graph databases and the semantic web and said "meh".


This is infuriating, as patents often are, because it’s entirely obvious in its face (did not read the patent). Query optimizers have been suggesting indexes for a long time.


Do you have any pointers or more info on that patent?


I think it's this one

https://patents.google.com/patent/US8423535B2/en

but they have numerous related patents.


one of many reasons ;)


It looks like you wrote more about the database advantages here than on the main website https://didgets.com/ A feature matrix or direct comparison would be great. For any advantage in one area (e.g. query speed) I suspect there will be downside in another area (e.g. slower inserting of new data).

When comparing database I'm used to look at feature lists, e.g. read vs write performance, ability to compress data, scale among multiple servers or such. For example https://en.wikipedia.org/wiki/ClickHouse#Features or https://en.wikipedia.org/wiki/Apache_Cassandra#Main_features Then later I look at the ecosystem, e.g. how to do backups, client libraries (linking to a shared library vs direct vs HTTP API) and ease of use.

I don't mind creating indexes. I maintain database that have almost as many indexes (measured by data size) than data records.


The project is just getting going so there are still some features yet to be implemented before it will compete directly with other DBs feature for feature. (e.g. it only does a couple types of JOIN operations so far).

The key-value stores I created were actually invented to be used as tags for the object store I created to compete with conventional file systems. It wasn't until I discovered how fast the tags worked before I decided to benchmark them against popular DBs like PostgreSQL, Microsoft SQL Server, or MySQL. When I discovered that queries against the same data sets were about 4-10x faster on my system, I turned my focus to the DB features.

As you pointed out, there is still a lot of work to be done on the marketing and sales side if this is going to be a successful startup.

I haven't done much benchmarking on insertion speed. I can insert CSV or Json files at about 100K rows per second, but I have not done a direct comparison against row oriented systems for that.


I may be unconventional, but I prefer to think of a relational DB indexes-first; ie. the indexes _are_ the database. The tables themselves are a data source, choosing indexes (database layout) carefully to match the functional requirements.

So naturally any claim of a generic data structure that out-performs tailored indexes in _all_ cases raises an eyebrow.

The only logical answer is that the range of queries you are measuring by is substantially limited, and I would pass on investigating further.


I don't claim to be able to outperform indexed tables in other databases in "all cases". After doing extensive testing, I have certainly found a few cases where a query in Postgres, MySQL or MS SQL Server will run a little bit faster than on my system. But in the vast majority of cases, they ran substantially faster on Didgets.

I disagree with your claim that "the range of queries you are measuring by is substantially limited". I have run a broad range of queries against hundreds of data sets of all sizes.

You are free to "pass on investigating further", but if you are interested in having an open mind, then feel free to try it (even if to just try and prove me wrong). If you have a good CSV or Json file with a few million records, it only takes about 10 minutes to download the Didgets software; load in your data; and run a handful of queries of your choosing. What do you have to lose (besides a few minutes of your time)?


I'm not here to disagree with you; I'm literally just answering your question -- of why people would pass on your offering.

> What do you have to lose (besides a few minutes of your time)?

Well, exactly this.

Sadly this is almost the definition of experience, or intuition -- not spending time on things which are unlikely to be successful.

Extraordinary claims require extraordinary evidence, and by now I would have liked some concise specifics. Why is your system faster? What exactly are the operations in the "broad range" of queries? Which operations are slower, and why?

It takes more than a few minutes to evaluate something like this.

I think you aren't doing any JOINs.


I certainly understand your skepticism. Everyone is busy and can't go chasing after every shiny new object. My system is faster for a number of reasons.

1) I have a unique way of storing the data in a compact format. Each value is de-duped and reference counted to save space and speed lookups.

2) I have a number of algorithms that use hashing, bloom filters, and other techniques in combination that I think are superior.

3) I utilize the multi-core features of modern processors for individual queries. With multi-threading I can run different parts of the same query in parallel. Other DBs like Postgres have made some progress in this area, but I think they have run into trouble trying to port their old architecture to take advantage of this.

So if you have a query such as:

"SELECT col5, col7, col10 FROM <table> WHERE (col5 ILIKE 'A%') AND (col7 < 10000) AND (col10 ILIKE '%Hello%);

the system can find the matching keys for each column in parallel. Like I said, there might be instances where the same query on another system will be faster, but on Didgets a broad range of queries are about 4x - 10x faster in my tests.

It will certainly take more than a few minutes to evaluate every feature, but it should only take a few minutes for someone to figure out that I am not just blowing smoke.


You are comparing performance on your system with an index on all three columns to a system with no indexes on the filter columns.


In every comparison with other DBs, I made sure the other system was as fast as I could make it with proper indexes on every column I was querying against.


What indexes did you add on Postgres for that query?


Not interested in video, text or it didn't happen.

I run transactional systems, concurrent oltp performance or it didn't happen.

We have hugely diverse sets of interrelated applications and systems build either against the primary data set, down stream datalake/bi/cubes systems. Integration and compatibility or it didn't happen.

We run Cloud or Linux systems. Platform support or it didn't happen.

Some of us run databases that are decades old.. it needs to have been happening for at least 5 years and/or be being shipped by a major vendor or it didn't happen.

See also: must have 10 years experience in technologies that have existed for 5.

New tech is hard, data is important and we're busy, in an industry with many many new shiny things.

Neither nice nor fair, sorry.


When I worked at Novell in the early 90s, one of the engineers ran a demo of some "new shiny thing" called an Internet Browser (I think it was the original Mozilla one).

Several of my co-workers made comments about it that reminded me of your post. The management of Novell basically ignored the whole new "Internet thing" until it put them out of business.

I am not saying that my data management system is going to put all those other DB systems out of business, but it is interesting how so many people on these tech forums seem to be against anything but the status quo.


Yeah, tbh, it sounded harsher than i intended, and you could actually plot a pretty successful path by life simply by doing the opposite of everything I say.

But there's lots of new shiny. Yes, some of it turns out to be the new big thing. Lots more of it turns out to a neat feature that simply gets replicated / bought and integrated into current platforms. Lots more settled into specific niches or makets you'll almost never hear about on HN.

The majority disappears, wrong or right, and we all have time and attention budgets.

I'm not against, I'm just not interested until it can tick more boxes. Or unless it happens to tweak my interest for whatever personal quirk of mine - and in this instance all the information is stuck behind my least preferred delivery style so....?

Even then it's often moot.. the database I'm using now is dictated by application compatibility more than anything.. there's already "better" for almost any measure of the word, _except_ that this is the db our vendor app supports and have stored procedures developed for, and porting isn't a priority.

All that said, if and when you take over the db industry, you're welcome to hold me up as yet another grumpy old timer who couldn't keep up!


>We run Cloud or Linux systems. Platform support or it didn't happen.

This 100x

I find it very strange that the demo requires Windows


While the system is designed to run on any platform, I simply don't have the bandwidth to support every OS for every build at this stage of the startup. We got it running on one Linux distribution for a few builds, but until we get more resources I have to limit its scope for now.

We are looking for innovators and early adopters at this stage (for those familiar with Geoffery Moore's book 'Crossing the Chasm') who are willing to evaluate the technology without concern that every OS, every language, every integration, and every feature isn't working perfectly yet.

I thought HN would be a great place to find such people, but based on several of the comments, there are a lot of late adopters and even laggards hanging out here.


I suspect the data science folks are more likely to be interested in what you've done than dba's, at the stage you're at.

They're more likely to be working with desktop tools, with their own copies of unstructured or semi-structured data and they're far more likely to be doing as-hoc or exploritory queries and manipulations, and they'll care far less about concurrent access.. which is absolutely critical to just about any environment who still think it's worthwhile to hire dba's.

Faster queries for a single user on a desktop solves no problems I have - as nifty as some of the techniques you're starting to talk about are.


>there are a lot of late adopters and even laggards hanging out here

Databases, data stores, etc all have to work flawlessly for a long time

You're not going to find many (if any) "early adopters" in that field


Perhaps a better way to ask the question is:

If you had to choose between DB system A and DB system B which were functionally equivalent and the only difference between them is that system A required you to create and maintain indexes to get the same level of performance as system B does without needing this step; how important would that feature be to you when making the decision?

For me, I would choose system B over system A every day of the week and twice on Sunday. But if everyone else just says 'Meh' to that characteristic, then I don't want to spend much time promoting it. My system has a bunch of other features that I think are really good too, so if that is the case I would rather just focus on promoting them instead.


If the system can run without indexes while still being performant in other areas ( multiuser oltp for example ) then that's awesome, absolutely. Pretty much every other trick for faster multi-record queries sacrifices oltp performance in some way.

But to step back a bit. DBA's manage database engines as much as they manage the data, often more. Schema design, index maintenance etc is part of it, but on a stable system often only a small part.

And replacing those db engines is a big deal, as per my earlier comments about compatability and integration - without big, strategic management support, it's often effectively impossible to change them. So you're offering a choice that, in my experience at least, DBA's actually rarely make - coders, architects and history usually make, or made those choices.

We also often don't care about the data "as information", only in the sense that it's been entrusted to our care. The people who use the data usually wear different hats (bi, ml, analyst, data science, sometimes managers and of course coders ).

But those people aren't adding / removing indexes in DBA managed environments.

Your parallel column scan sounds cool, and I can see how it falls naturally from your design, and it's often hard to retro fit onto an older engine.

My instances have 11000+ clients during peak, we've already turned off almost all capabilities for parallel queries - because for 95% of the workload it cant help, and likewise an errant query that suddenly jumped up and started consuming multiple cores worth resources could be a disaster.

So it sounds like you've got some cool stuff, you're just talking to the wrong demographic!


Can you explain what you mean by "grouped together to form relational tables"? Is a row comprised of pointers into each key-value store? It sounds like you have removed the decision to add indexes by adding an index to every single column.


Each column is a set of key-value pairs. The values are stored such that searching for specific values or patterns are extremely fast, much like regular indexes do. You can drop a column from the table (or add a new one) without needing to export all the data; modify the schema; and re-import the data to the modified table. A table is just a 'group' of these key-value stores I call table columns.

For example, an address column (type=STRING) may have a value (e.g. '123 Main St.') mapped to key 234 (e.g. the row key). In fact it might have 2 or more values mapped to that same key (assuming the address column is not a primary key column). So unlike regular relational tables, you can have multiple values per column for any row. This is very similar to how NoSQL systems store Json documents where any value can be an array.

A query that wants to find all rows that start with a '1' or contain 'Main' or end in 'St.' will get key 234 (along with any other keys that match). Once the keys are found, it can retrieve the values for any other specified columns that are mapped to those keys and give you the whole row.

In essence, this columnar store DB either has no indexes or is nothing but indexes depending on how you look at it. The point is there is no duplication of data between a regular store and a separate index.


So you have a primary key ("row key") and each column C is essentially its own table containing column C and the primary key column, with the index being on column C. You can see how there is data duplication here, right? Each primary key value is duplicated #columns times.

So we have a mapping from column value to row index. Is there an inverse mapping stored if I want to find all column values contained in a given row?


Yes there is duplication of keys since each column has a copy of each key where a value in that column is mapped to that key.

The column (key-value store) maps both ways. It is really fast to find all the keys that are mapped to a given value (or pattern) as well as find all the values that are mapped to a key.

You could say the key is a primary key since each key is unique, but a table can also have a 'Primary Key' column where a unique value can only be mapped to a single key and every row must have a value in that column.

This system is very efficient for sparse tables. Null values are essentially free since the key-value pair is just missing if a column does not have a value for the row. It takes some processing to determine the key is not mapped anywhere in the column, but it is very fast.


Its really hard to grasp what’s so great aboit your idea for an engineer who would love to use something new and cool, but doesn’t know any DB details.

Can you give clear “grandma” explanation of what’s the benefits of your solution and why exactly is it better than Postgres from user’s perspective?


There are many sites on the Internet where you can download data sets in CSV or Json format. Some of these data sets are small and can be easily loaded into a spreadsheet. Others have millions of rows and require SQL skills that many people don't have to load the data into a database and analyze it.

Using Didgets, someone with very limited DB skills can go download a data set from Kaggle for example; drop the file on the Didget Browser database window; and with just a few clicks of the mouse begin analyzing the data.

Since the system can process large data sets so fast, it is easy to drill down to subsets of the data or build pivot tables even when there are many millions of rows in the table. It has some built-in charting for those who like a visual representation to see patterns.

There are videos on the Didgets website that show how to do this.


Amazing, thank you. Now it sounds to me that I can use your software to quickly import big files somewhere.

But then the new question arises — after I’ve imported, why should I stay there and not export data to Postgres that I’m familiar with (supposedly your db can export as fast as import)?


Since our system is still under development and is not yet ready to replace whatever mainstream DB you are using; this is precisely what most of our customers are using it for. It is a great tool for analyzing data quickly. You can export the data once you finish cleaning it, analyzing it, or narrowing it down to a subset.

Again, we are trying to find traction in the market by getting people to try it and find value in the features that we currently have working.


Got you. Thanks and good luck with promoting.


I seem to remember Snowflake was a relational db that claimed indexes weren't required in most circumstances, but I only used it briefly. I guess the indexes are implemented under the hood somewhere and it's not really a massive problem to create indexes if you have a technical team and can plan your queries to an extent. The way in which people used to use databases Oracle/SqlServer/Ingres etc has changed so much. When I started working with Oracle the whole system was centered around the database, but a lot has changed especially with cloud and (micro)service approach. I actually quite liked how things used to be, it worked well for the most part, but scaling was difficult. It's still difficult, but for a different reason.


> I created a new DB architecture that is radically different than anything I have seen. It is a set of performant key-value stores that can be grouped together to form relational tables. The data within each store is optimized for search so you have just one copy of each data point (no separate indexes). It can handle large tables (100 million+ rows and 2000+ columns) and has a flexible schema so it can handle data like Json that includes arrays for each column. The system is in open beta at <www.Didgets.com> so anyone can download and try it.

Honestly ... this sounds like Splunk (which leverages MongoDB, among other things)


Any new data technology that can manipulate relational data will 'sound like' a lot of things. A Tesla sounds a lot like a Honda if you never look under the hood.

All I am asking is for people to approach this technology with an open mind and take a few minutes to watch a couple demo videos or download the software and try it before they dismiss it out of hand because it seems at first glance to be similar to something else.


Explain how what you've built isn't one of the myriad options folks here have pointed-out that it "sounds like"

...so far you haven't :)


The Didgets system is a general-purpose data manager that can store unstructured data (file data); highly structured data (relational DB data); and semi-structured data (NoSQL data) all within a single system.

The same key-value stores that are used to form relational tables are also used to attach meta-data tags to file objects. So you can create many relational tables in one of our containers (a pod) along with importing a hundred million files.

Just like a relational query where you want to look at every row in a table where the firstName column value starts with the letter 'A', you can also query the system for every JPEG file where any folder in its path contains 'Microsoft', for example.

The system can return the rows in the table that match more quickly than other DB systems; but it can also show you all the files that match hundreds of times faster than a conventional file system.

I have loaded in DB tables that have hundreds of millions of rows. I have loaded in tables with thousands of columns. I have loaded in over 200 million files. The system can drill down to subsets of these objects faster and easier than any system I have used before.

I have worked with many file systems and databases. I have built tools (e.g. PartitionMagic and Drive Image) that manipulate data. I think I have built something that would appeal to other data enthusiasts like myself, but getting them to try it out for themselves is a real challenge.


>but it can also show you all the files that match hundreds of times faster than a conventional file system

I call baloney

You can't return data faster than the filesystem unless it's all in RAM


I didn't say that I could read the actual file contents faster than a FS. My system still takes about the same amount of time to read in a 1GB file off disk as the regular file API.

What I did say is that searching for all files that meet specific criteria (e.g. files with a .jpg extension, larger than 2GB, and whose parent folder contains the word 'Microsoft' in it) is much, much faster on my system.

If you go through the hassle of indexing all your files using one of the OS vendors preferred indexing tools, the gap can be narrowed but I think my system is still faster.


>What I did say is that searching for all files that meet specific criteria (e.g. files with a .jpg extension, larger than 2GB, and whose parent folder contains the word 'Microsoft' in it) is much, much faster on my system.

Baloney

Unless you preprocess your filesystem into some kind of flatfile or in-memory map, you cannot beat the filesystem for access speed


File systems perform 2 main functions. The first is to allocate blocks and maintain the integrity of the data stream (file contents) of each file. While Didgets has a few algorithms that I think are superior for allocating and tracking extents for file contents; the benefits in this area are just a few percentage points at best (nothing to write home about).

The second is to create and maintain metadata about each file. This includes file names, date and time stamps, access permissions and extra metadata like tags. This is where Didgets excels greatly over conventional systems. We store file metadata completely different than file systems. Their designs are decades old and very inefficient in many cases.

Didgets uses the same key-value stores for attaching metadata tags to each file as we use to create columns in relational database tables. Finding files with certain tags is extremely fast as is just finding files of a certain type (e.g. documents or photos).


Didgets is not some abstraction between a file system and applications where it just caches data in RAM. It is designed to replace file systems as an on-disk storage solution.

Yes, you import files into Didgets just like you would copy files from one file system to a new one; but the data is stored on disk in its new format. Just like other file systems do when you mount the volume from disk, Didgets will cache some system structures in memory to speed things up; but our on-disk structures are much smaller so we can read and cache what we need faster and using less memory to do it.

As I pointed out earlier, the system is designed to manage DB tables as well as unstructured file data so it is much more than just a different kind of file system.

It seems obvious from your comments that you have no intention of downloading and actually trying the software so it seems pointless to try and address all your objections. If on the other hand you try it and run into a real problem instead of just a perceived one...then I will be happy to help.


> Didgets is not some abstraction between a file system and applications where it just caches data in RAM. It is designed to replace file systems as an on-disk storage solution.

But that's not what you claimed until just now - and it flies in the face of other claims

Is it a "database"? Or a filesystem replacement? If you "import files into Didgets just like you would copy files from one file system to a new one", then it's sitting on top of a file system, and acts as a database (whether it's actually new or not is up for debate (but so far...it doesn't sound "new" to me, based on everything you've said)

"data is stored on disk in its new format" - not from what you've described elsewhere, where you claim you're just putting metadata around file names for faster searching (eg of file name attributes of binary objects like images)

If you've built a file system, no one in their right mind is going to rely on it for a very long time - based on your demo being a Windows application where you need to import CSVs etc into it...it's pretty apparent it's not a file system (at least, it's not a native file system - it's still a file format hosted on the underlying OS' access to whatever file system(s) it's using)

If you've built a database or filesystem mapper, someone may (or may not) want to use it - but so far you've provided nothing compelling in any of your descriptions to sound actually new ...you've described core features of plenty of existing tools/concepts - columnar databases, slocate, kvstores, ELK, Splunk, BeFS, etc


>It seems obvious from your comments that you have no intention of downloading and actually trying the software so it seems pointless to try and address all your objections. If on the other hand you try it and run into a real problem instead of just a perceived one...then I will be happy to help.

My comments are aimed at your claims that either don't hold water, or where you try to defend your work as something brand new, when clearly other things already do what you describe :)


So ... you're preprocessing the filesystem's contents into another file and/or keeping it in memory

>Didgets uses the same key-value stores for attaching metadata tags to each file as we use to create columns in relational database tables. Finding files with certain tags is extremely fast as is just finding files of a certain type (e.g. documents or photos).

Just like I said

You are not exceeding the performance of the filesystem, you're hiding the filesystem behind your preprocessed abstraction

Like locate does on *nix (https://en.wikipedia.org/wiki/Locate_(Unix))


You just described Splunk :)


>I created a new DB architecture that is radically different than anything I have seen. It is a set of performant key-value stores that can be grouped together to form relational tables.

So ... MongoDB + Neo4j

Maybe one or the other without the combination?


Help some data scientist and get fame.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: