如何使用数据版本控制管理数据湖中的模式验证

数据团队依赖许多其他“第三方”发送数据的情况并不少见，他们经常在没有进行任何沟通或让数据团队知道太晚的情况下更改数据的模式。

每当发生这种情况时，数据管道就会遭到破坏，数据团队需要修复数据湖。这是一个充满繁重任务的人工过程。在通常情况下，数据团队可能会推，试图证明模式已经改变。

但是随着发展和进步，数据团队意识到，以自动持续集成（CI）/持续交付（CD）的方式简单地阻止模式一起更改是更明智的。

模式更改和模式验证给数据团队带来了很多痛苦，但是市场上有一些解决方案可以帮助解决这个问题——幸运的是，其中一些是开源的。

一个循序渐进的教程，介绍如何使用开源数据版本控制工具lakeFS解决模式验证问题。

模式验证允许用户为数据湖创建验证规则，例如允许的数据类型和值范围。它保证保存在数据湖中的数据遵循已建立的模式，该模式描述了数据的结构、格式和限制。

这是一个需要解决的问题——如果不快速采取行动，就会在数据处理过程中看到不一致和错误。

为什么需要处理模式验证?

花费一些时间正确地管理模式是值得的，有以下四个原因：

处理数据湖中的模式并非一帆风顺

在数据仓库中，用户处理的是严格的数据模型和严格的模式。数据湖与之相反。大多数情况下，它们最终包含广泛的数据源。

为什么这很重要?因为在数据湖中，模式的定义可以在数据源之间发生变化，并且当添加新数据时，模式可能会随着时间的推移而变化。这使得在数据湖中的所有数据上实施统一的模式成为一个巨大的挑战。如果不能解决这个问题，将不得不解决数据处理问题。

但这还不是全部。由于构建在数据湖之上的数据管道的复杂性不断增加，无法拥有一个一致的模式。数据管道可以包括多个流程和转换，每个流程和转换都需要一个唯一的模式定义。

模式可能随着数据的处理和修改而变化，因此很难确保跨整个管道进行模式验证。

这就是版本控制系统可以派上用场的地方。

在数据湖中实现模式验证的数据版本控制

lakeFS是一个开源工具，它可以将数据湖转换为类似Git的存储库，让用户像软件工程师管理代码一样管理它。这就是数据版本控制的意义所在。

与其他源代码控制系统一样，lakeFS有一个称为hook的特性，它是定制的脚本或程序，lakeFS平台可以运行这些脚本或程序来响应指定的事件或操作。

这些事件可以包括提交更改、合并分支、创建新分支、添加或删除标记等等。例如，当合并发生时，在合并完成之前，在源分支上运行一个预合并挂钩。

它如何应用于模式验证呢? 用户可以创建一个预来验证Parquet文件的模式与当前模式是否相同。

在这个场景中，将在一个摄取分支中创建一个delta表，并将其合并到生产中。接下来将更改表的模式，并尝试再次合并它，模拟将数据提升到生产的过程。

首先，将设置一些全局变量并安装将在本例中使用的包，这些包将在Python笔记本中运行。

在设置好lakeFS凭证后，可以开始创建一些包含存储库和分支名称的全局变量：

Pythonrepo = "schema-validation-example-repo" mainBranch = "main" ingestionBranch = "ingestion_branch"

每个lakeFS存储库都需要有自己的存储命名空间，所以也需要创建一个：

PythonstorageNamespace = 's3://' # e.g. "s3://username-lakefs-cloud/"

在本例中，使用AWS S3存储。为了使一切顺利进行，用户的存储需要配置为与lakeFS一起运行，lakeFS与AWS、Azure、Google Cloud或内部部署对象存储(如MinIO)一起工作。

如果在云中运行lakeFS，则可以通过复制示例存储库的存储名称空间并将字符串附加到其上，将其链接到存储。所以，如果lakeFS Cloud提供了这个

可以通过以下方式进行配置：

PythonstorageNamespace = 's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:2ae87b7718e5bb16573c021e542dd0ec429b7ccc1a4f9d0e3f17d6ee99253655/my_random_string'

在笔记本中，将使用Python代码，因此也必须导入lakeFS Python客户端包：

Pythonimport lakefs_client from lakefs_client import models from lakefs_client.client import LakeFSClient import osfrom pyspark.sql.types import ByteType, IntegerType, LongType, StringType, StructType, StructField

Python%xmode Minimal if not 'client' in locals(): # lakeFS credentials and endpoint configuration = lakefs_client.Configuration() configuration.username = lakefsAccessKey configuration.password = lakefsSecretKey configuration.host = lakefsEndPoint client = LakeFSClient(configuration) print("Created lakeFS client.")

以下将在本例中创建delta表，因此需要包括以下包：

Pythonos.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta-core_2.12:2.0.0 --conf "spark.sql.extensinotallow=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" pyspark-shell'

lakeFS公开了一个S3网关，它允许应用程序以与S3通信的方式与lakeFS进行接口。要配置网关，并执行以下步骤：

Pythonfrom pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", lakefsAccessKey) sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", lakefsSecretKey) sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", lakefsEndPoint) sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

现在已经准备好在笔记本中大规模使用lakeFS版本控制。

以下将使用Python客户端创建存储库：

Pythonclient.repositories.create_repository( repository_creatinotallow=models.RepositoryCreation( name=repo, storage_namespace=storageNamespace, default_branch=mainBranch))

在这种情况下，将使用预合并挂钩来确保架构没有更改。操作文件应提交到lakeFS存储库，前缀为_lakeFS_actions/。未能分析操作文件将导致运行失败。

将提交以下钩子配置操作文件，pre-merge-schema-validation.yaml：

Python#Parquet schema Validator #Args: # - locations (list of strings): locations to look for parquet files under # - sample (boolean): whether reading one new/changed file per directory is enough, or go through all of them #Example hook declaration: (_lakefs_actions/pre-merge-schema-validation.yaml): name: pre merge checks on main branch on:、 pre-merge: branches: - main hooks: - id: check_schema_changes type: lua properties: script_path: scripts/parquet_schema_change.lua # location of this script in the repository args: sample: false locations: - tables/customers/

中的子文件夹LuaHooks中。必须将文件提交到文件夹_lakeFS_actions下的lakeFS存储库：

Pythonhooks_config_yaml = "pre-merge-schema-validation.yaml" hooks_prefix = "_lakefs_actions" with open(f'./LuaHooks/{hooks_config_yaml}', 'rb') as f: client.objects.upload_object(repository=repo, branch=mainBranch, path=f'{hooks_prefix}/{hooks_config_yaml}', cnotallow=f )

只是设置了一个动作脚本，在合并到main之前运行scripts/parquet_schema_che.lua。

然后将创建脚本本身（parquet_schema_che.lua）并将其上载到脚本目录中。正如人们所看到的，使用嵌入式LuaVM来运行钩子，而不依赖于其他组件。

此文件也位于ample-repo中的LuaHooks子文件夹中：

Python--[[Parquet schema validatorArgs: - locations (list of strings): locations to look for parquet files under - sample (boolean): whether reading one new/changed file per directory is enough, or go through all of them ]] lakefs = require("lakefs")strings = require("strings")parquet = require("encoding/parquet") regexp = require("regexp") path = require("path")visited_directories = {} for _, location in ipairs(args.locations) doafter = ""has_more = trueneed_more = trueprint("checking location: " .. location)while has_more doprint("running diff, location = " .. location .. " after = " .. after)local code, resp = lakefs.diff_refs(action.repository_id, action.branch_id, action.source_ref, after, location)if code ~= 200 thenerror("could not diff: " .. resp.message)endfor _, result in pairs(resp.results) dop = path.parse(result.path)print("checking: '" .. result.path .. "'")if not args.sample or (p.parent and not visited_directories[p.parent]) thenif result.path_type == "object" and result.type ~= "removed" thenif strings.has_suffix(p.base_name, ".parquet") then-- check it!code, content = lakefs.get_object(action.repository_id, action.source_ref, result.path)if code ~= 200 thenerror("could not fetch>

本网站的文章部分内容可能来源于网络和网友发布，仅供大家学习与参考，如有侵权，请联系站长进行删除处理，不代表本网站立场，转载者并注明出处：https://jmbhsh.com/baobaofuzhuang/36628.html

如何使用数据版本控制管理数据湖中的模式验证

为什么需要处理模式验证?

处理数据湖中的模式并非一帆风顺

在数据湖中实现模式验证的数据版本控制

相关推荐

联系我们