当前位置：首页 > news >正文

GitLab Duo Agent平台如何革新DataOps工作流

news 2025/10/13 12:10:15

GitLab Duo Agent平台如何革新DataOps

手动创建dbt模型是一个耗时的过程，可能会消耗数据工程师数小时的时间。特别是当没有（重大）业务转换时，这并非工程师数据处理工作中最吸引人的部分。

但是，如果能够自动化整个流程呢？在本教程中，我将展示GitLab Duo Agent平台如何在几分钟内生成完整的dbt模型，包括适当的结构、测试和文档。

我们要构建什么

我们的营销团队希望有效管理和优化广告投资。其中一个广告平台是Reddit，因此我们从Reddit Ads API提取数据到我们的企业数据平台Snowflake。在GitLab，我们有三层存储：

原始层 - 来自外部来源的未处理数据的首次着陆点；尚未准备好供业务使用
准备层 - 具有源模型的首次转换层；仍未准备好供一般业务使用
生产层 - 最终转换数据，准备好供业务使用和Tableau报告

在本教程中，数据已通过我们的提取解决方案Fivetran进入原始层，我们将生成dbt模型，通过准备层处理数据到生产层。

无需我们自己编写任何dbt代码行，在本教程结束时，我们将拥有：

准备层中的源模型
生产层中的工作区模型
Reddit Ads数据集中所有13个表（包括112列）的完整dbt配置
验证结果的测试查询

整个过程将花费不到10分钟，而手动操作通常需要数小时。以下是需要遵循的步骤：

1. 准备数据结构

在GitLab Duo生成我们的模型之前，它需要了解完整的表结构。关键是对Snowflake的信息模式运行查询：

SELECT table_name,column_name,data_type,is_nullable,CASE WHEN is_nullable = 'NO' THEN 'PRIMARY_KEY'ELSE NULL END as key_type
FROM raw.information_schema.columns
WHERE table_schema = 'REDDIT_ADS'
ORDER BY table_name, ordinal_position;

此查询捕获：

所有表和列名
用于适当模型结构的数据类型
可空约束
主键标识（此数据集中的非空列）

专业提示：在Reddit Ads数据集中，所有非空列都用作主键——这是一种模式。我通过检查像ad_group这样的表进行了验证，该表有两个非空列（account_id和id），都被标记为主键。运行此查询返回了112行元数据，我将其导出为CSV文件用于模型生成。

2. 设置GitLab Duo

有两种与GitLab Duo交互的方式：

Web UI聊天功能
Visual Studio Code插件

我选择了VS Code插件，因为我可以在本地运行dbt模型来测试它们。

3. 输入"魔法"提示

以下是我用于生成所有dbt代码的确切提示：

Create dbt models for all the tables in the file structure.csv.I want to have the source models created, with a filter that dedupes the data based on the primary key. Create these in a new folder reddit_ads.
I want to have workspace models created and store these in the workspace_marketing schema.Take this MR as example: [I've referenced to previous source implementation]. Here is the same done for Source A, but now it needs to be done for Reddit Ads. Please check the dbt style guide when creating the code: https://handbook.gitlab.com/handbook/enterprise-data/platform/dbt-guide/

使此提示有效的关键要素：

源模型和工作区模型的清晰规范
参考先前类似合并请求的示例
样式指南参考以确保代码质量和一致性
适当组织的特定模式定位

4. GitLab Duo的处理过程

提交提示后，GitLab Duo开始工作。整个生成过程花费了几分钟，在此期间GitLab Duo：

读取并分析CSV输入文件
检查元数据中的表结构
参考我们的dbt样式指南以获取编码标准
考虑类似的合并请求以正确构建结构
为所有13个表生成源模型
为所有13个表创建工作区模型
生成支持的dbt文件：
- sources.yml配置
- 包含测试和文档的schema.yml文件
- 更新了dbt_project.yml与模式引用

结果

输出结果令人印象深刻：

1个修改的文件：dbt_project.yml（添加了reddit_ads模式配置）
29个新文件：
- 26个dbt模型（13个源 + 13个工作区）
- 3个YAML文件
自动生成近900行代码
内置数据测试，包括主键列上的唯一约束
所有模型和列的通用描述
源模型中的适当去重逻辑
遵循GitLab dbt样式指南的清晰、一致的代码结构

transform/snowflake-dbt/
├── dbt_project.yml                                                    [MODIFIED]
└── models/├── sources/│   └── reddit_ads/│       ├── reddit_ads_ad_group_source.sql                        [NEW]│       ├── reddit_ads_ad_source.sql                              [NEW]│       ├── reddit_ads_business_account_source.sql                [NEW]│       ├── reddit_ads_campaign_source.sql                        [NEW]│       ├── reddit_ads_custom_audience_history_source.sql         [NEW]│       ├── reddit_ads_geolocation_source.sql                     [NEW]│       ├── reddit_ads_interest_source.sql                        [NEW]│       ├── reddit_ads_targeting_community_source.sql             [NEW]│       ├── reddit_ads_targeting_custom_audience_source.sql       [NEW]│       ├── reddit_ads_targeting_device_source.sql                [NEW]│       ├── reddit_ads_targeting_geolocation_source.sql           [NEW]│       ├── reddit_ads_targeting_interest_source.sql              [NEW]│       ├── reddit_ads_time_zone_source.sql                       [NEW]│       ├── schema.yml                                            [NEW]│       └── sources.yml                                           [NEW]└── workspaces/└── workspace_marketing/└── reddit_ads/├── schema.yml                                        [NEW]├── wk_reddit_ads_ad.sql                              [NEW]├── wk_reddit_ads_ad_group.sql                        [NEW]├── wk_reddit_ads_business_account.sql                [NEW]├── wk_reddit_ads_campaign.sql                        [NEW]├── wk_reddit_ads_custom_audience_history.sql         [NEW]├── wk_reddit_ads_geolocation.sql                     [NEW]├── wk_reddit_ads_interest.sql                        [NEW]├── wk_reddit_ads_targeting_community.sql             [NEW]├── wk_reddit_ads_targeting_custom_audience.sql       [NEW]├── wk_reddit_ads_targeting_device.sql                [NEW]├── wk_reddit_ads_targeting_geolocation.sql           [NEW]└── wk_reddit_ads_time_zone.sql                       [NEW]

生成的代码示例

以下是生成的代码质量示例。对于time_zone表，GitLab Duo创建了：

准备层源模型

WITH source AS (SELECT *FROM {{ source('reddit_ads','time_zone') }}QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _fivetran_synced DESC) = 1
),renamed AS (SELECTid::VARCHAR                               AS time_zone_id,code::VARCHAR                             AS time_zone_code,dst_offset::NUMBER                        AS time_zone_dst_offset,is_dst_active::BOOLEAN                    AS is_time_zone_dst_active,name::VARCHAR                             AS time_zone_name,offset::NUMBER                            AS time_zone_offset,_fivetran_synced::TIMESTAMP               AS fivetran_synced_atFROM source
)SELECT * FROM renamed

Schema.yml

models:- name: reddit_ads_time_zone_sourcedescription: Time zone data from Reddit Ads systemcolumns:- name: time_zone_iddescription: Unique identifier for time zone recordsdata_tests:- unique- not_null- name: time_zone_codedescription: Code for the time zone- name: time_zone_dst_offsetdescription: Daylight saving time offset for the time zone- name: is_time_zone_dst_activedescription: Flag indicating if daylight saving time is active- name: time_zone_namedescription: Name of the time zone- name: time_zone_offsetdescription: Offset for the time zone- name: fivetran_synced_atdescription: Timestamp when the record was last synced by Fivetran

Source.yml

sources:- name: reddit_adsdatabase: RAWschema: reddit_adsloaded_at_field: _fivetran_syncedloader: fivetrandescription: Reddit Ads dataquoting:database: trueschema: falseidentifier: falsetables:- name: time_zone

工作区模型

WITH source AS (SELECT *FROM {{ ref('reddit_ads_time_zone_source') }}
)SELECT * FROM source

5. 质量验证

现在代码看起来不错，我将其推送到MR并执行CI测试管道来测试代码并验证结果。我要求GitLab Duo创建一个验证查询：

Create a test query to test the row counts between the raw layer and the workspace layer. Keep in mind that we do deduplication, so we can compare both using distinct on the primary keys.

AI生成了一个全面的验证查询，该查询：

比较原始层和工作区层之间的行计数
考虑了去重逻辑
测试所有13个表
计算数据保留百分比

生成的SQL测试查询

-- Reddit Ads Row Count Validation Test
-- Compares distinct counts between RAW layer and WORKSPACE_MARKETING layer
-- Accounts for deduplication logic in source modelsWITH raw_counts AS (-- Single primary key tablesSELECT 'ad' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.ADUNION ALLSELECT 'business_account' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.BUSINESS_ACCOUNTUNION ALLSELECT 'campaign' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.CAMPAIGNUNION ALLSELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.CUSTOM_AUDIENCE_HISTORYUNION ALLSELECT 'geolocation' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.GEOLOCATIONUNION ALLSELECT 'interest' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.INTERESTUNION ALLSELECT 'time_zone' AS table_name, COUNT(DISTINCT id) AS raw_countFROM RAW.REDDIT_ADS.TIME_ZONE-- Composite primary key tablesUNION ALLSELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(account_id, '|', id)) AS raw_countFROM RAW.REDDIT_ADS.AD_GROUPUNION ALLSELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', community_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_COMMUNITYUNION ALLSELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', custom_audience_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_CUSTOM_AUDIENCEUNION ALLSELECT 'targeting_device' AS table_name, COUNT(DISTINCT _fivetran_id) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_DEVICEUNION ALLSELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', geolocation_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_GEOLOCATIONUNION ALLSELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(ad_group_id, '|', interest_id)) AS raw_countFROM RAW.REDDIT_ADS.TARGETING_INTEREST
),workspace_counts AS (-- Workspace layer counts using primary keys from schema.ymlSELECT 'ad' AS table_name, COUNT(DISTINCT ad_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_ADUNION ALLSELECT 'business_account' AS table_name, COUNT(DISTINCT business_account_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_BUSINESS_ACCOUNTUNION ALLSELECT 'campaign' AS table_name, COUNT(DISTINCT campaign_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CAMPAIGNUNION ALLSELECT 'custom_audience_history' AS table_name, COUNT(DISTINCT custom_audience_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_CUSTOM_AUDIENCE_HISTORYUNION ALLSELECT 'geolocation' AS table_name, COUNT(DISTINCT geolocation_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_GEOLOCATIONUNION ALLSELECT 'interest' AS table_name, COUNT(DISTINCT interest_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_INTERESTUNION ALLSELECT 'time_zone' AS table_name, COUNT(DISTINCT time_zone_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TIME_ZONE-- Composite primary key tablesUNION ALLSELECT 'ad_group' AS table_name, COUNT(DISTINCT CONCAT(ad_group_account_id, '|', ad_group_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_AD_GROUPUNION ALLSELECT 'targeting_community' AS table_name, COUNT(DISTINCT CONCAT(targeting_community_ad_group_id, '|', targeting_community_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_COMMUNITYUNION ALLSELECT 'targeting_custom_audience' AS table_name, COUNT(DISTINCT CONCAT(targeting_custom_audience_ad_group_id, '|', targeting_custom_audience_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_CUSTOM_AUDIENCEUNION ALLSELECT 'targeting_device' AS table_name, COUNT(DISTINCT targeting_device_fivetran_id) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_DEVICEUNION ALLSELECT 'targeting_geolocation' AS table_name, COUNT(DISTINCT CONCAT(targeting_geolocation_ad_group_id, '|', targeting_geolocation_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_GEOLOCATIONUNION ALLSELECT 'targeting_interest' AS table_name, COUNT(DISTINCT CONCAT(targeting_interest_ad_group_id, '|', targeting_interest_id)) AS workspace_countFROM REDDIT_DBT_MODEL_GENERATION_PROD.WORKSPACE_MARKETING.WK_REDDIT_ADS_TARGETING_INTEREST
)-- Final comparison with validation results
SELECT r.table_name,r.raw_count,w.workspace_count,r.raw_count - w.workspace_count AS count_difference,CASE WHEN r.raw_count = w.workspace_count THEN ':check_mark_button: PASS'WHEN r.raw_count > w.workspace_count THEN ':warning: RAW > WORKSPACE (Expected due to deduplication)'ELSE ':cross_mark: FAIL - WORKSPACE > RAW (Unexpected)'END AS validation_status,ROUND((w.workspace_count::FLOAT / r.raw_count::FLOAT) * 100, 2) AS data_retention_percentage
FROM raw_counts r
JOIN workspace_counts w ON r.table_name = w.table_name
ORDER BY r.table_name;

运行此查询显示：