GitLab Duo Agent平台如何革新DataOps
手动创建dbt模型是一个耗时的过程,可能会消耗数据工程师数小时的时间。特别是当没有(重大)业务转换时,这并非工程师数据处理工作中最吸引人的部分。
但是,如果能够自动化整个流程呢?在本教程中,我将展示GitLab Duo Agent平台如何在几分钟内生成完整的dbt模型,包括适当的结构、测试和文档。
我们要构建什么
我们的营销团队希望有效管理和优化广告投资。其中一个广告平台是Reddit,因此我们从Reddit Ads API提取数据到我们的企业数据平台Snowflake。在GitLab,我们有三层存储:
- 原始层 - 来自外部来源的未处理数据的首次着陆点;尚未准备好供业务使用
- 准备层 - 具有源模型的首次转换层;仍未准备好供一般业务使用
- 生产层 - 最终转换数据,准备好供业务使用和Tableau报告
在本教程中,数据已通过我们的提取解决方案Fivetran进入原始层,我们将生成dbt模型,通过准备层处理数据到生产层。
无需我们自己编写任何dbt代码行,在本教程结束时,我们将拥有:
- 准备层中的源模型
- 生产层中的工作区模型
- Reddit Ads数据集中所有13个表(包括112列)的完整dbt配置
- 验证结果的测试查询
整个过程将花费不到10分钟,而手动操作通常需要数小时。以下是需要遵循的步骤:
1. 准备数据结构
在GitLab Duo生成我们的模型之前,它需要了解完整的表结构。关键是对Snowflake的信息模式运行查询:
SELECT table_name,column_name,data_type,is_nullable,CASE WHEN is_nullable = 'NO' THEN 'PRIMARY_KEY'ELSE NULL END as key_type
FROM raw.information_schema.columns
WHERE table_schema = 'REDDIT_ADS'
ORDER BY table_name, ordinal_position;
此查询捕获:
- 所有表和列名
- 用于适当模型结构的数据类型
- 可空约束
- 主键标识(此数据集中的非空列)
专业提示:在Reddit Ads数据集中,所有非空列都用作主键——这是一种模式。我通过检查像ad_group这样的表进行了验证,该表有两个非空列(account_id和id),都被标记为主键。运行此查询返回了112行元数据,我将其导出为CSV文件用于模型生成。
2. 设置GitLab Duo
有两种与GitLab Duo交互的方式:
- Web UI聊天功能
- Visual Studio Code插件
我选择了VS Code插件,因为我可以在本地运行dbt模型来测试它们。
3. 输入"魔法"提示
以下是我用于生成所有dbt代码的确切提示:
Create dbt models for all the tables in the file structure.csv.I want to have the source models created, with a filter that dedupes the data based on the primary key. Create these in a new folder reddit_ads.
I want to have workspace models created and store these in the workspace_marketing schema.Take this MR as example: [I've referenced to previous source implementation]. Here is the same done for Source A, but now it needs to be done for Reddit Ads. Please check the dbt style guide when creating the code: https://handbook.gitlab.com/handbook/enterprise-data/platform/dbt-guide/
使此提示有效的关键要素:
- 源模型和工作区模型的清晰规范
- 参考先前类似合并请求的示例
- 样式指南参考以确保代码质量和一致性
- 适当组织的特定模式定位
4. GitLab Duo的处理过程
提交提示后,GitLab Duo开始工作。整个生成过程花费了几分钟,在此期间GitLab Duo:
- 读取并分析CSV输入文件
- 检查元数据中的表结构
- 参考我们的dbt样式指南以获取编码标准
- 考虑类似的合并请求以正确构建结构
- 为所有13个表生成源模型
- 为所有13个表创建工作区模型
- 生成支持的dbt文件:
- sources.yml配置
- 包含测试和文档的schema.yml文件
- 更新了dbt_project.yml与模式引用
结果
输出结果令人印象深刻:
- 1个修改的文件:dbt_project.yml(添加了reddit_ads模式配置)
- 29个新文件:
- 26个dbt模型(13个源 + 13个工作区)
- 3个YAML文件
- 自动生成近900行代码
- 内置数据测试,包括主键列上的唯一约束
- 所有模型和列的通用描述
- 源模型中的适当去重逻辑
- 遵循GitLab dbt样式指南的清晰、一致的代码结构
transform/snowflake-dbt/
├── dbt_project.yml [MODIFIED]
└── models/├── sources/│ └── reddit_ads/│ ├── reddit_ads_ad_group_source.sql [NEW]│ ├── reddit_ads_ad_source.sql [NEW]│ ├── reddit_ads_business_account_source.sql [NEW]│ ├── reddit_ads_campaign_source.sql [NEW]│ ├── reddit_ads_custom_audience_history_source.sql [NEW]│ ├── reddit_ads_geolocation_source.sql [NEW]│ ├── reddit_ads_interest_source.sql [NEW]│ ├── reddit_ads_targeting_community_source.sql [NEW]│ ├── reddit_ads_targeting_custom_audience_source.sql [NEW]│ ├── reddit_ads_targeting_device_source.sql [NEW]│ ├── reddit_ads_targeting_geolocation_source.sql [NEW]│ ├── reddit_ads_targeting_interest_source.sql [NEW]│ ├── reddit_ads_time_zone_source.sql [NEW]│ ├── schema.yml [NEW]│ └── sources.yml [NEW]└── workspaces/└── workspace_marketing/└── reddit_ads/├── schema.yml [NEW]├── wk_reddit_ads_ad.sql [NEW]├── wk_reddit_ads_ad_group.sql [NEW]├── wk_reddit_ads_business_account.sql [NEW]├── wk_reddit_ads_campaign.sql [NEW]├── wk_reddit_ads_custom_audience_history.sql [NEW]├── wk_reddit_ads_geolocation.sql [NEW]├── wk_reddit_ads_interest.sql [NEW]├── wk_reddit_ads_targeting_community.sql [NEW]├── wk_reddit_ads_targeting_custom_audience.sql [NEW]├── wk_reddit_ads_targeting_device.sql [NEW]├── wk_reddit_ads_targeting_geolocation.sql [NEW]└── wk_reddit_ads_time_zone.sql [NEW]
生成的代码示例
以下是生成的代码质量示例。对于time_zone表,GitLab Duo创建了:
准备层源模型
WITH source AS (SELECT *FROM {{ source('reddit_ads','time_zone') }}QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _fivetran_synced DESC) = 1
),renamed AS (SELECTid::VARCHAR AS time_zone_id,code::VARCHAR AS time_zone_code,dst_offset::NUMBER AS time_zone_dst_offset,is_dst_active::BOOLEAN AS is_time_zone_dst_active,name::VARCHAR AS time_zone_name,offset::NUMBER AS time_zone_offset,_fivetran_synced::TIMESTAMP AS fivetran_synced_atFROM source
)SELECT * FROM renamed
Schema.yml
models:- name: reddit_ads_time_zone_sourcedescription: Time zone data from Reddit Ads systemcolumns:- name: time_zone_iddescription: Unique identifier for time zone recordsdata_tests:- unique- not_null- name: time_zone_codedescription: Code for the time zone- name: time_zone_dst_offsetdescription: Daylight saving time offset for the time zone- name: is_time_zone_dst_activedescription: Flag indicating if daylight saving time is active- name: time_zone_namedescription: Name of the time zone- name: time_zone_offsetdescription: Offset for the time zone- name: fivetran_synced_atdescription: Timestamp when the record was last synced by Fivetran
Source.yml
sources:- name: reddit_adsdatabase: RAWschema: reddit_adsloaded_at_field: _fivetran_syncedloader: fivetrandescription: Reddit Ads dataquoting:database: trueschema: falseidentifier: falsetables:- name: time_zone
工作区模型
WITH source AS (SELECT *FROM {{ ref('reddit_ads_time_zone_source') }}
)SELECT * FROM source
5. 质量验证
现在代码看起来不错,我将其推送到MR并执行CI测试管道来测试代码并验证结果。我要求GitLab Duo创建一个验证查询:
Create a test query to test the row counts between the raw layer and the workspace layer. Keep in mind that we do deduplication, so we can compare both using distinct on the primary keys.
AI生成了一个全面的验证查询,该查询:
- 比较原始层和工作区层之间的行计数
- 考虑了去重逻辑
- 测试所有13个表
- 计算数据保留百分比
生成的SQL测试查询
-- Reddit Ads Row Count Validation Test
-- Compares distinct counts between RAW layer and WORKSPACE_MARKETING layer
-- Accounts for deduplication logic in source modelsWITH raw_counts AS (-- Single primary key tablesSELECT 'ad' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.ADUNION ALLSELECT 'business_account' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.BUSINESS_ACCOUNTUNION ALLSELECT 'campaign' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.CAMPAIGNUNION ALLSELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.CUSTOM_AUDIENCE_HISTORYUNION ALLSELECT 'geolocation' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.GEOLOCATIONUNION ALLSELECT 'interest' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.INTERESTUNION ALLSELECT 'time_zone' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.TIME_ZONE-- Composite primary key tablesUNION ALLSELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(account_id, '|', id)) AS raw_countFROM RAW.REDDIT_ADS.AD_GROUPUNION ALLSELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', community_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_COMMUNITYUNION ALLSELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', custom_audience_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_CUSTOM_AUDIENCEUNION ALLSELECT 'targeting_device' AS table_name, COUNT(DISTINCT _fivetran_id) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_DEVICEUNION ALLSELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', geolocation_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_GEOLOCATIONUNION ALLSELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', interest_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_INTEREST
),workspace_counts AS (-- Workspace layer counts using primary keys from schema.ymlSELECT 'ad' AS table_name, COUNT(DISTINCT ad_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_ADUNION ALLSELECT 'business_account' AS table_name, COUNT(DISTINCT business_account_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_BUSINESS_ACCOUNTUNION ALLSELECT 'campaign' AS table_name, COUNT(DISTINCT campaign_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CAMPAIGNUNION ALLSELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT custom_audience_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CUSTOM_AUDIENCE_HISTORYUNION ALLSELECT 'geolocation' AS table_name, COUNT(DISTINCT geolocation_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_GEOLOCATIONUNION ALLSELECT 'interest' AS table_name, COUNT(DISTINCT interest_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_INTERESTUNION ALLSELECT 'time_zone' AS table_name, COUNT(DISTINCT time_zone_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TIME_ZONE-- Composite primary key tablesUNION ALLSELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(ad_group_account_id, '|', ad_group_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_AD_GROUPUNION ALLSELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(targeting_community_ad_group_id, '|', targeting_community_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_COMMUNITYUNION ALLSELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(targeting_custom_audience_ad_group_id, '|', targeting_custom_audience_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_CUSTOM_AUDIENCEUNION ALLSELECT 'targeting_device' AS table_name, COUNT(DISTINCT targeting_device_fivetran_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_DEVICEUNION ALLSELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(targeting_geolocation_ad_group_id, '|', targeting_geolocation_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_GEOLOCATIONUNION ALLSELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(targeting_interest_ad_group_id, '|', targeting_interest_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_INTEREST
)-- Final comparison with validation results
SELECT r.table_name,r.raw_count,w.workspace_count,r.raw_count - w.workspace_count AS count_difference,CASE WHEN r.raw_count = w.workspace_count THEN ':check_mark_button: PASS'WHEN r.raw_count > w.workspace_count THEN ':warning: RAW > WORKSPACE (Expected due to deduplication)'ELSE ':cross_mark: FAIL - WORKSPACE > RAW (Unexpected)'END AS validation_status,ROUND((w.workspace_count::FLOAT / r.raw_count::FLOAT) * 100, 2) AS data_retention_percentage
FROM raw_counts r
JOIN workspace_counts w ON r.table_name = w.table_name
ORDER BY r.table_name;
运行此查询显示:
- 去重后行计数零差异
- 所有表100%数据保留
- 所有测试成功通过
底线:巨大的时间节省
传统方法:6-8小时的手动编码、测试和调试
GitLab Duo方法:6-8分钟的生成 + 审查时间
这代表了开发效率的60倍改进(从6-8小时到6-8分钟),同时保持了高代码质量。
成功的最佳实践
基于此经验,以下是关键建议:
准备元数据
- 提取包括数据类型和约束的完整表结构
- 预先识别主键和关系
- 导出干净、格式良好的CSV输入文件
注意:通过MCP将GitLab Duo连接到您的(元)数据,您可以排除此手动步骤。
提供清晰的上下文
- 尽可能参考现有的示例MR
- 指定您的编码标准和样式指南
- 明确文件夹结构和命名约定
彻底验证
- 始终为数据完整性创建验证查询
- 在合并前本地测试
- 运行CI/CD管道以捕获任何问题
利用AI进行后续任务
- 自动生成测试查询
- 创建文档模板
- 构建验证脚本
下一步
此演示展示了像GitLab Duo这样的AI驱动开发工具如何也改变数据工程工作流。在几分钟内生成数百行生产就绪代码的能力——包括测试、文档和适当结构——代表了我们处理重复开发任务方式的根本转变。
通过利用AI处理dbt模型创建的重复方面,数据工程师可以专注于更高价值的活动,如数据建模策略、性能优化和业务逻辑实现。
准备好自己尝试了吗?从一个小数据集开始,仔细准备元数据,并观察GitLab Duo如何将数小时的工作转变为几分钟的自动生成。
更多精彩内容 请关注我的个人公众号 公众号(办公AI智能小助手)
对网络安全、黑客技术感兴趣的朋友可以关注我的安全公众号(网络安全技术点滴分享)
公众号二维码
公众号二维码