From Data Purgatory to Machine Learning Success: Why Mining Companies Need Databases

30 Oct

Mining activity showing multiple benches in an open-pit operation (source: unsplash)

It starts with a handful of files. Then you derive some analysis from those files and create a few more. Later, you put it all into a bigger desk study, there are intermediate files generated that get further processed. You present the desk study and you’re asked to tweak some things, now you have multiple versions of the same files all with slightly different values.

Welcome to data purgatory. Where you spend more time sifting through project folders looking for one particular dataset from that time you did that analysis but you find three copies across three different project folders. Which one to use?

With geospatial data, it gets even more complicated. Do I have all the component parts of the shape file? Is it in the right projection?

Now you want to start doing machine learning on the data when the fidelity of the data is as important as the model itself.

Data integrity, or lack thereof, is a major risk that is often overlooked in a company that relies on data to find, extract and sell materials from the earth. This is why mineral exploration companies need good data management.

There comes a point when “Just a Bunch of Shapefiles” is no longer a viable option.

Why Your Data Scientist Can’t Save You

If you’re using a data scientist to find your data, you’re wasting their talents.

As a project and a company grow, it naturally hoards its data. Everything from recon geological mapping, stream-sediment geochemistry and soil samples over early targets, through to historic data that led them to the tenement in the first place, and then on to the drilling data, the environmental baseline data, the mineral rights package (if you’re in the UK) and the constraints layers from local authorities.

These datasets stack up and the usual storage format is the vintage ESRI Shapefile (.shp) that isn’t just one file but many, and if you lose the constituent parts, it can be rendered useless.

There are some solutions such as Geopackages or GeoJSONs. However, they are not mainstream. The Shapefile is king, and it’s one of the biggest weaknesses in your Data Management Plan (do you have one of those?).

If you have a Shapefile but lost the .dbf then all the attributes of your data are lost. Lose the .shx file and that’s your index gone, not a complete disaster, but an important part of the dataset. Lose the .prj and your projection and coordinate system info are gone - that can be a real pain.

No data scientist can recreate data, if the .dbf is missing. And it’s less than ideal to guess at the projection and coordinate system when the .prj is missing, especially if the dataset is old, or inherited from elsewhere.

All these data issues can be passed off as necessary data wrangling but they’re not. They waste time, compromise data integrity and risk the fidelity of your machine learning model… and they’re completely avoidable with good data management.

Enter the Matrix… I mean Database

When 80% of the work for a machine learning model lands on data preparation (wrangling) it is a no-brainer to have a strong base from which to work.

That base could take the form of a database.

This is not new technology, nor is it necessarily overtly expensive. It forms an investment that causes small trickle-down benefits and savings throughout the business.

A small (100s GB) database can be as little as a few hundred dollars a month. When your geologist spends 2 hours finding the right file, that might be £100 wasted. Make the reasonable assumption that that happens once a week by multiple people and you have a costly problem.

Or worse, when your geologist spends 3hrs finding the wrong file, and takes that forward into an analysis that gets presented to the board. What is the cost then?

Good data management starts with creating a system that has a database at its core. The database allows you to store data in one place, without worrying about lots of little files, and protect it as your single-source of truth. It solves many of the issues for spatial data.

The spatial database becomes the go-to place for your people to find the data they need. They can extract it, analyse it, and create new data from it. But it guarantees that the data being used is consistent.

Now, of course, databases need maintenance, files need updating, or adding, or removing. It is never a finished article, but the great thing about a database is you can set it up to log changes meaning you have a record of what’s happened. Who deleted that? Who updated that? When?

There are lots of settings that can allow you to control who can and can’t do things to the database - which is beyond the scope of this article - and you can also give minimal access to the database and instead serve data out via a Web Map Service (WMS) or similar.

You can also set up checks that data maintains consistent metadata - that may include forcing the data to be a consistent projection or flagging missing data. What you can set up is really limitless whilst operating within the database itself.

A spatial database should be the Fort Knox of your data management where you guarantee data integrity and fidelity of your analysis.

How does this impact my machine learning strategy?

Machine learning for mineral exploration, in most cases, is just fancy pattern recognition - it’s complex, and requires a skilled operator but at the end of the day, it is just a really good game of spot the difference. Or perhaps, spot the same thing given a certain likelihood.

The key to machine learning, beyond understanding what the algorithm of choice is doing (no mean feat in itself for some tools), is understanding what your data is doing. What does this piece of data do? How is it used by the model to predict… what? Is that piece of data a valid proxy for this geological process?

A deep dive on machine learning applications for mineral exploration is another article, so let’s assume that the geologist and the data scientist have had a good long chat about the data and everything is hunky dory.

Assuming the general premise of the data model and the machine learning approach are agreed, someone then sends a zip file of all the required input datasets.

There are Shapefiles, there are CSVs, there’s even a Geopackage and a GeoJSON. OK, this is just classic data wrangling. So the data scientist spends a day or two getting it all into the same format… and then realises it all has different metadata. Some data has no copper analysis which was critical for the data model. They’re all in different projections. The Shapefile of the geology has serious topological errors causing overlapping polygons and gaps in the data. A few have “_v3” or _v4_FINAL” in the file name, does this mean v3 isn’t the final version for that dataset?

These little inconsistencies bloat the time taken for data preparation and if not fixed will diminish the fidelity of the modelling process, compromising results in the form of missed opportunities, or worse, false positives.

Every single problem above could be fixed with a database. The data scientist would be able to log-in and pull the files directly from the database system without requiring a data trawl. The data would be checked for errors as part of it being added to the database.

If these problems aren’t addressed, or are unfixable, the machine learning approach will be less robust, perhaps even fundamentally flawed. All because of poor data management.

Databases are your friend

If you’ve read this far, you’re probably wondering how to start this journey out of data purgatory. Well it’s all relatively straightforward through open-source solutions such as PostgreSQL/PostGIS and Geoserver.

Open-source free software doesn’t mean no cost though. Infrastructure does cost money and these software packages need to sit on a server. But the costs overall are minimal compared to mainstream, proprietary solutions.

It also needs bespoke skills in database management and SQL which are not common to your average geologist. If you are unsure, this is something Neo Geo Consulting has substantial experience with and, with a geoscience background, are happy to help.

It is worth noting that if you have designs on machine learning beyond mineral exploration, then the PostgreSQL stack is still a solid choice as it forms the basis of many Artificial Intelligence data stacks as a vector database for Large Language Models.

Ultimately, if you want to be creating robust data analysis, you need to put some effort into your data management system and you will be hard pressed to find a better companion than a spatial database.

Lead the way to better data in mineral exploration

As I mentioned in my previous post on UK mining policy, successful jurisdictions like Finland and Australia do this at scale delivering accessible, standardised datasets that are freely available - and it’s a database system at their heart.

At a company scale, geologists are notoriously bad at standardising data and are often worse at managing that data. The first step is getting everything into a database and putting some careful thought into the metadata and standards that your project needs.

So, if you’re interested in reducing your risk of incomplete, out-of-date or missing data, think about a spatial database solution.

If you’d like to talk more about how to build a database for your company, or want to chat through data management plans in your business, reach out on LinkedIn or via our website at www.neogeo.uk/contact

Chris Yeomans