gravatar
 · 
February 25, 2024
 · 
4 min read

Dataplex: Google Cloud Platform’s Data Catalog

dataplex-data-catalog-blog-cover

Dataplex is a comprehensive data platform designed to simplify and optimise the way organisations handle their data.

In the ever-evolving landscape of data management, the ability to harness information effectively is becoming more and more important for organisations aiming to stay ahead of the curve.

However, navigating the vast amount of data available can be a daunting task without the appropriate tools. This is where the importance of data management and data catalog come into play.

In this blog, we will look into Dataplex and Data Catalog, which is Google Cloud Platform's (GCP) solution for managing data sources, big and small. 

What is Dataplex?

Dataplex is a service within GCP that unifies distributed data and automates data management and governance for that data. In simpler terms, it's like having a highly organised and efficient virtual space to store, access, and work with all your data from different sources. This is very useful for enterprises which already have tons of datasets, needing a place to bring together their data.

Dataplex Functions:

  • Build a domain-specific data mesh across data that's stored in multiple Google Cloud projects, without any data movement
  • Consistently govern and monitor data with a single set of permissions
  • Explore and organise information about your data stored in different places using the ‘Data Catalog’
  • Run data quality tests and data lifecycle management tasks
  • Securely query metadata by using BigQuery and open source tools, such as SparkSQL, Presto, and HiveQL

This means that you can pull in data from various sources, offering flexibility and scalability. This includes but is not limited to:

  • Google Cloud Storage Buckets
  • Streaming Data Sources: Dataplex supports streaming real-time data, pulling it in. This is useful if you have applications where up-to-the-minute information is critical. 
  • Data Warehouses: Whether you're using Amazon Redshift, Google BigQuery, or Snowflake, Dataplex can pull data into its platform from data warehouses. This is beneficial for organisations leveraging these powerful analytics and storage solutions.
  • On-Premises Databases: If you have a database, Dataplex can make a direct connection to it pulling any datasets in.
  • Data Lakes
  • Database Management Systems (e.g. mySQL, PostgresSQ)

Now that we know what Dataplex is, and what it can be used for, let’s look into one of it’s most prominent features– the Data Catalog.

Dataplex Data Catalog

Data Catalog is a fully managed, scalable metadata management service within Dataplex.

A Data Catalog is a fundamental asset for organisations, especially those dealing with vast and diverse datasets coming from different sources like the ones above. 

As most organisations today are dealing with a large and growing number of data assets, the Data Catalog can help and solve multiple challenges faced by data stakeholders (consumers, producers, and administrators):

  • Searching for insightful data
  • Understanding data
  • Making data useful

A Data Catalog serves as a centralised hub, empowering organisations to:

  • Attain a cohesive perspective, minimising the challenges of navigating for specific data
  • Facilitate data-driven decision-making, deriving insights quickly through the augmentation of data with both technical and business metadata
  • Enhance data management practices, boosting operational efficiency and overall productivity.
  • Establish control over data, fostering increased trust and confidence in its reliability.
data-sources-in-dataplex-diagram

An advantage of Data Catalog is that you can connect assets hosted on GCP but you can also connect non-GCP Data Assets such as Oracle, SQL, Redshift, etc.

Once connected, users can use the Data Catalog to search for, run tasks or manage datasets. All without any data movement. 

The Data Catalog uses metadata to tag data entries to help the organisation with technical or business matters. 

Connecting datasets to Data Catalog in Dataplex

So, how do I connect my datasets to the Data Catalog?

If you already use Google Cloud to store your data, it’s simple! For a given project, Data Catalog automatically catalogs the following Google Cloud assets such as BigQuery datasets, Dataplex lakes, zones or tables, Analytics Hub linked datasets, Vertex AI models and datasets.

You can access the Data Catalog directly from the Google Cloud Console.

However, if you are not already on Google Cloud Platform and are trying to bring over datasets which are not already on GCP, you can still use public community-contributed connectors or build on the Data Catalog API.

Whether you are already on GCP or not, our expert data team can help you get set up on Dataplex Data Catalog, so you and your team can start navigating your organisational data with ease.

Fill in the form to get in touch 👇

Stay Social

© Cobry Ltd | 0333 789 0102
24 Sandyford Place, Glasgow, Scotland, UK, G3 7NG
167/169 Great Portland Street, 5th Floor, London, W1W 5PF
Newsletter
Compliance
Privacy Policy

Care for a towel? 👀

logo-established-large