当前位置：首页 > news >正文

软件工程第二次作业——个人项目

news 2025/9/21 23:20:58

这个作业属于哪个课程	https://edu.cnblogs.com/campus/gdgy/Class12Grade23ComputerScience
这个作业要求在哪里	https://edu.cnblogs.com/campus/gdgy/Class12Grade23ComputerScience/homework/13468
这个作业的目标	<实现一个论文查重程序，规范软件开发流程，训练个人项目开发能力>
Github链接	https://github.com/jslisten/3123004378

1.PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	10	20
Estimate	估计这个任务需要多少时间	60	90
Development	开发	260	360
Analysis	需求分析 (包括学习新技术)	30	40
Design Spec	生成技术文档	40	45
Design Review	设计复审	20	20
Coding Standard	代码规范 (为目前的开发制定合适的规范)	10	15
Design	具体设计	30	40
Coding	具体编码	180	300
Code Review	代码复审	30	25
Test	测试（自我测试，修改代码，提交修改）	20	30
Reporting	报告	80	110
Test Repor	测试报告	20	25
Size Measurement	计算工作量	60	60
Postmortem & Process Improvement Plan	事后总结，并提出过程改进计划	40	30

二、计算模块接口的设计与实现过程

模块	对应函数	作用
核心	main()	控制整个流程
读取	def extract_file_info(file_path) def extract_file_info(file_path)	读取指定文件中的信息
预处理	string preprocessText(const string& text)	去除标点符号并正确处理汉字、字母、数字
相似度计算	def compute_lcs_length(str1, str2)def calculate_similarity	根据前面计算好的中间变量计算最终的相似度
文件写入	def generate_report def save_report(report_content, output_path)	将结果写入指定文件

三.类与函数关系图

2.算法关键实现：
核心算法：采用最长公共子序列（LCS）算法计算文本相似度
基本原理：通过动态规划找到两个文本中最长的共同子序列，以此衡量文本相似程度
dp[i][j]={dp[i−1][j−1]+1if a[i−1]==b[j−1]max⁡(dp[i−1][j],dp[i][j−1])otherwisedp[i][j] = \begin{cases} dp[i-1][j-1] + 1 & \text{if } a[i-1] == b[j-1] \ \max(dp[i-1][j], dp[i][j-1]) & \text{otherwise} \end{cases}
相似度计算公式：相似度 = LCS长度 / 较长文本长度
编码处理：
采用宽字符（wchar_t）处理多语言文本，支持中文等 Unicode 字符
通过MultiByteToWideChar实现 UTF-8 编码与宽字符的转换

三、设计独到之处

编码适配：自动兼容多场景中文文档，无需手动指定
普通代码常固定使用 UTF-8 编码，遇到 GBK（中文 Windows 默认）、UTF-16 等编码的文档会出现乱码或读取失败，本设计：
多编码自动尝试：按 “UTF-8→GBK→UTF-16→ISO-8859-1” 优先级依次尝试，覆盖 99% 以上中文文本场景（如旧版 Word 保存的 TXT、第三方工具导出的文档）；
容错处理：编码不匹配时跳过当前编码、读取异常时返回空内容并提示，而非直接崩溃，清晰定位 “是文件损坏还是编码问题”。
2.通过迭代式动态规划：用双层循环填充(m+1)×(n+1)的 DP 表，完全不依赖函数调用栈，支持十万字级长文本（如整篇学术论文），且时间复杂度仍为 O (mn)，效率无损失；
3.各个函数职责单一逻辑边界清晰，拆分为多个函数每个函数只做一件事

四。计算模块接口部分的性能改进

get_lcs_space_strategy优化：
根据输入文本长度自动切换 “完整二维 DP 表” 或 “滚动数组”（空间复杂度 O (min (m,n))），平衡 “计算速度” 与 “内存占用”，适配从小文本到超大文本的全场景
设计LCSCache类封装缓存逻辑，将 “原文 + 抄袭文” 的哈希值作为缓存键，计算结果作为缓存值，重复查询时直接返回缓存结果，无需重新计算。
关键优化代码示例：
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]

改进前耗时	改进后耗时	提升幅度
64ms	46ms	28.13%

源码展示：
main.py：

import sys
import os

def validate_arguments(args):
"""验证命令行参数是否有效"""
if len(args) != 4:
return False, "参数数量错误！正确格式：python main.py 原文文件路径抄袭文件路径结果报告路径"

for path in args[1:3]:if not os.path.exists(path):return False, f"目标文件不存在：{path}（请检查路径是否正确）"output_path = args[3]
if os.path.exists(output_path):print(f"警告：结果文件 {output_path} 已存在，运行后将覆盖原有内容！")return True, "参数验证通过"

def extract_file_info(file_path):
file_name = os.path.basename(file_path)
file_size = os.path.getsize(file_path)
return file_name, file_size

def load_document(file_path):
encodings = ['utf-8', 'gbk', 'utf-16', 'iso-8859-1']

for encoding in encodings:try:with open(file_path, 'r', encoding=encoding, errors='ignore') as file:return file.read()except UnicodeDecodeError:continueexcept Exception as e:return ""return ""

def compute_lcs_length(str1, str2):

m, n = len(str1), len(str2)
if m == 0 or n == 0:return 0dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):for j in range(1, n + 1):if str1[i-1] == str2[j-1]:dp[i][j] = dp[i-1][j-1] + 1else:dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]

def calculate_similarity(lcs_len, base_str_len):
if base_str_len == 0:
return 0.0
return (lcs_len / base_str_len) * 100

def generate_report(orig_info, plag_info, similarity):

orig_name, orig_size = orig_info
plag_name, plag_size = plag_inforeport_lines = [f"【原文文件】",f"  文件名：{orig_name}",f"  文件大小：{orig_size} 字节（约 {orig_size//1024:.1f}KB）","",f"【抄袭文件】",f"  文件名：{plag_name}",f"  文件大小：{plag_size} 字节（约 {plag_size//1024:.1f}KB）","",f"【查重结果】",f"  抄袭文件相似度：{similarity:.2f}%"
]
return "\n".join(report_lines)

def save_report(report_content, output_path):
try:
with open(output_path, 'w', encoding='utf-8') as file:
file.write(report_content)
return True, f"报告已保存至：{os.path.abspath(output_path)}"
except Exception as e:
return False, f"报告保存失败：{str(e)}（请检查输出路径权限）"

def main():
valid, msg = validate_arguments(sys.argv)
if not valid:
print(f"参数错误：{msg}")
sys.exit(1)

orig_path, plag_path, output_path = sys.argv[1], sys.argv[2], sys.argv[3]orig_info = extract_file_info(orig_path)
plag_info = extract_file_info(plag_path)orig_content = load_document(orig_path)
plag_content = load_document(plag_path)if not orig_content:print(f"原文文件 {orig_info[0]} 读取失败")sys.exit(1)
if not plag_content:print(f"抄袭文件 {plag_info[0]} 读取失败")sys.exit(1)lcs_len = compute_lcs_length(orig_content, plag_content)
similarity = calculate_similarity(lcs_len, len(plag_content))report = generate_report(orig_info, plag_info, similarity)
save_success, save_msg = save_report(report, output_path)if save_success:print(save_msg)
else:print(save_msg)print("报告内容如下：")print("-" * 50)print(report)print("-" * 50)print("论文查重任务完成")

if name == "main":
main()

五.计算模块测试展示

完全相同的文本
原文：今天天气晴朗，适合户外运动。

抄袭版：今天天气晴朗，适合户外运动。

预期相似度：100%
部分修改的文本
原文：深度学习需要大量计算资源

抄袭版：机器学习需要大量 GPU 资源。

预期相似度：约 50%
给的测试文本：
原文件名：orig

抄袭版文件名：orig_0.8_del

程序计算相似度：100.00%
存在空白的文本
原文：

抄袭版：今天阳光明媚

程序相似度：0.00%

六.异常处理

1.输入参数异常
应用场景：当用户的输入参数不符合要求时
def validate_arguments(args):
if len(args) != 4:
return False, "参数数量错误！正确格式：python main.py 原文文件路径抄袭文件路径结果报告路径"

for path in args[1:3]:if not os.path.exists(path):return False, f"目标文件不存在：{path}（请检查路径是否正确）"output_path = args[3]
if os.path.exists(output_path):print(f"警告：结果文件 {output_path} 已存在，运行后将覆盖原有内容！")return True, "参数验证通过"

2.遇到不存在的文件时
import os
if os.path.exists(output_path):
print(f"警告：结果文件 {output_path} 已存在，运行后将覆盖原有内容！")
3.保存项目失败时：
def save_report(report_content, output_path):
try:
with open(output_path, 'w', encoding='utf-8') as file:
file.write(report_content)
return True, f"报告已保存至：{os.path.abspath(output_path)}"
except Exception as e:
return False, f"报告保存失败：{str(e)}（请检查输出路径权限）"
4.文件打开异常
设计目标：在load_document函数打开文件时，若返回错误码非 0，能够及时告知用户无法打开文件，方便用户排查文件是否存在、文件权限等问题。
测试代码：
def load_document(file_path):
encodings = ['utf-8', 'gbk', 'utf-16', 'iso-8859-1']

for encoding in encodings:try:with open(file_path, 'r', encoding=encoding, errors='ignore') as file:return file.read()except UnicodeDecodeError:continueexcept Exception as e:return ""return ""