学术论文

      一种基于样本的模拟口令集生成算法

      An Efficient Algorithm to Generate Password Sets Based on Samples

      摘要:
      大规模的用户口令集因可用于评估口令猜测算法的效率、检测现有用户口令保护机制的缺陷等,而广受系统安全研究领域的重视,然而,尽管可以通过一些渠道,譬如网站口令泄露、用户自愿征集或者个别网站出于研究目的的共享等,获取真实的大规模用户明文口令对当前研究人员来说仍然非常困难.为应对上述问题,该文提出了一种基于样本的模拟口令集生成算法(Sample Perturbation Based Password Generation,SPPG).该算法利用较容易获得的小规模真实口令样本,通过学习生成概率模型,并产生大规模用户口令集合.为评估这一算法的效能,该文提出了一组模拟口令集质量的检测指标,包括真实口令覆盖率、Zipf分布拟合度等.最后,论文对比了SPPG算法与当前常见的用户口令猜测概率模型,包括概率上下文无关文法和多种马尔科夫模型,在生成用户口令集上的效能差异.结果显示,SPPG算法产生的模拟口令集在各指标下都有更好的表现.平均地,在真实口令覆盖率上,相对上下文无关文法和四阶马尔科夫模型分别提高了9.58%和72.79%,相对三阶和一阶马尔科夫模型分别提高了10.34倍和13.41倍,并且Zipf分布的拟合度保持在0.9及以上的水平.同时,其口令结构分布和特殊模式的使用也更符合真实用户生成口令的情况.
      Abstract:
      Large-scale real user password sets are well regarded important in the field of system security research,due to their usages in evaluating the efficacy of the algorithms that guess passwords,and detecting defects of existing password protection mechanisms,etc.At present,some ways of capturing real passwords are available for researchers,such as accidental or malicious passwords disclosure,voluntary user contributions,or sharing by voluntary websites for research purposes.However,there are some serious limitations involved in collecting user password sets in the above ways.For example,password sets that are captured from passwords disclosure may have been tampered,and therefore their quality cannot be guaranteed.What's more,types of these password sets are limited.As a result,it is still difficult for researches to have access to the large-scale clear-text user passwords in a systematic manner.Motivated to resolve the above issue,this paper presents a sample perturbation based password generation algorithm(SPPG for short).The algorithm is to use a given small-scale real user password sample as a training set to generate a probability model that can then be used to provide large-scale password sets.The small-scale sample is relatively easier to obtain.With the purpose of improving the authenticity of the simulation password sets,the SPPG algorithm is designed based on the idea of sample perturbation.On the one hand,the algorithm takes advantage of the Probabilistic Context-Free Grammar to parse the sample,and then generates passwords that have the same structures with passwords in the sample.On the other hand,it also utilizes rules that are frequently used for users to deform their passwords,and then generates passwords that are similar to passwords in the sample.To evaluate the efficacy of the SPPG algorithm,this paper presents a set of criteria to evaluate the quality of the simulation password sets.These criteria include the coverage rate of the real passwords,the goodness of fit to the Zipf distribution,the similarity of password structure distributions and the proportion of special patterns.In the end,this paper compares the efficacy of the SPPG algorithm with the popular probability models of password guessing,including the Probabilistic Context-Free Grammar and several variants of the Markov models.In the experiment,small-scale samples are randomly selected from real user password sets,and then are used by different models to generate the simulation password sets.The experiment results show that the SPPG algorithm has better performances.On average,the coverage of the real passwords is improved by 9.58% and 72.79% respectively compared with the Probabilistic Context-Free Grammar and the 4-order Markov model.And the coverage of the real passwords is 10.34 times more than the 3-order Markov model and 13.41 times more than the 1-order Markov model.Besides,the goodness of fit to the Zipf distribution remains at a high level that is no less than 0.9.As for the password structure distribution and the proportion of special patterns,simulation password sets generated by the SPPG algorithm are also shown to be more similar to the real password sets compared with simulation password sets generated by the other models.
      作者: 韩伟力 [1] 袁琅 [2] 李思斯 [3] 王晓阳 [1]
      Author: HAN Wei-Li [2] YUAN Lang [3] LI Si-Si WANG Xiao-Yang
      作者单位: 复旦大学软件学院 上海201203 上海市数据科学重点实验室 上海201203 中国计算机学会(CCF)
      刊 名: 计算机学报 ISTICEIPKU
      年,卷(期): 2017, 40(5)
      分类号: TP391
      在线出版日期: 2017年7月4日
      基金项目: 上海市科委“创新行动计划项目”,国家自然科学基金(61572136,61370080)资助.This paper is supported by the Shanghai Innovation Action Project under Grant No.16DZ1100200,which focuses on constructing an infrastructure for big data oriented computing,and the National Natural Science Foundation of China under Grant No.61572136,which studies the security issues caused by sensing abilities of mobile devices,and then proposes a defense mechanism.The method proposed by this paper will help the big data testbed (Grant No.16DZ110020) to create simulated data based on a small and true sample,when a test of big data requires mass