DeepSeek-affiliated Hangzhou DeepSeek AI Fundamental Technology Research Co.,The Relic Of My Sister Next Door Ltd. today filed a patent for a new web data collection system designed to improve efficiency and data quality. The patent outlines a method for discovering more webpage links while minimizing website traffic impact. It assesses downloaded content to predict the quality of undiscovered links, prioritizing high-value data and reducing redundant downloads. Efficient web data collection is crucial for training large language models (LLMs), which power AI systems like ChatGPT. Existing techniques struggle with incomplete link retrieval, excessive downloads that can crash websites, and low-quality data filtering. DeepSeek’s proposed system aims to solve these issues by optimizing data allocation and maintaining metadata accuracy. [iThome, in Chinese]
Related Articles
2025-06-26 07:12
2850 views
Best headphones deal: Save up to 51% on Beats at Amazon
SAVE UP TO 51%:As of May 12, save up to 51% on Beats earbuds and headphones. Get the Beats Studio Pr
Read More
2025-06-26 04:54
718 views
Amazon Prime price increase 2022: How to avoid paying $139 and lock in $119
UPDATE: Feb. 17, 2022, 4:40 p.m. EST The time has come. Today, Feb. 17, is the last day you can sign
Read More
2025-06-26 04:49
1408 views
Benjamin Moser on Clarice Lispector’s Complete Stories
Passionate Acolytes: An Interview with Benjamin MoserBy Scott EspositoAugust 17, 2015At WorkPhoto: P
Read More