相似性API是一种高速模糊匹配和去重API,旨在处理混乱和现实世界中的数据。它帮助你识别几乎重复的记录,并在值不完全匹配的情况下进行实体的调和——例如拼写错误、大写与小写的差异、缺少标点符号、空格问题、缩写以及单词顺序的轻微变化。
与其构建和调整自己的模糊匹配管道,不如将你的字符串(或记录)发送到API,并获得你可以信任的相似性得分的匹配。典型的输出包括匹配对(例如,“Apple”⇔“apple inc.”)、相似性得分和易于整合到数据清洗工作流、CRM、ETL作业和分析管道中的结构化结果。
常见用例:
去重列表:在数据集中查找重复项(全对全匹配),并返回可能重复的对。
与主列表调和:将传入列表与规范集合匹配(列表到主列表)。
CRM和客户数据卫生:清理潜在客户/账户/公司,以便去重不干扰报告和联系。
实体解析和记录链接:通过来源连接对同一现实世界实体的引用。
团队为何使用它:
从一开始就能处理混乱的文本(无需为每个极端情况手动设定规则)
用于分类和阈值的相似性得分(你来决定严格程度)
为扩展和自动化而构建(旨在适用于管道,而不仅仅是一次性脚本)
Dedupe is an all-to-all fuzzy matching endpoint for finding duplicates within a single list of strings. Instead of comparing only two inputs per API call, you send a dataset and it returns similar pairs and/or deduplicated groups across the entire set.
Why you’d use it
Massive speedup: typically ~300× to 1,000× faster than “regular” approaches people try first (pairwise comparisons, looping fuzzy scorers, etc.) once you go beyond tiny lists.
Optional cleanup built-in: you can enable common text cleanup (lowercasing, punctuation removal, token sorting). This saves hours (or days) of development + ongoing maintenance.
Company suffixes handled automatically: common endings like “Inc”, “LLC”, “Ltd”, etc. are stripped so you match the real name.
Benchmarks: similarity-api/blog/speed-benchmarks (1M records in ~7 minutes; faster than common Python fuzzy matching libraries).
Hard limits on Zyla
Max 1,000 strings per request (enforced).
Need bigger / unlimited?
Parameters (POST request)
data (required)
A string containing a JSON array of strings.
Example value for data:
["Acme Inc","ACME LLC","Globex GmbH"]
Higher = stricter matching (fewer pairs). Typical: 0.80–0.90 for company dedupe.
Removes punctuation differences (e.g., “A.C.M.E.” vs “ACME”).
Makes matching case-insensitive.
use_token_sort (optional, true/false, default false)
Helps when word order changes (e.g., “Bank of America” vs “America Bank of”).
output_format (optional, default string_pairs)
This exndpoint can return data in multiple formats. Please select one of the following:
string_pairs:
[string_A, string_B, similarity]index_pairs:
string_pairs, but returns positions in your input list instead of the strings.[index_A, index_B, similarity]deduped_strings:
deduped_indices:
deduped_strings, but returns the indices of the kept items.membership_map:
[0,0,0,3,3] means rows 0/1/2 are one group (rep=0) and rows 3/4 are another (rep=3).row_annotations:
Returns one object per input row with an explanation of what it belongs to (rep row + similarity).
Use when: you want a human-readable, per-row result for debugging or UI display.
top_k (optional, integer or "all", default "all")
all = find all matches above threshold.
Or an integer (e.g., 50) to limit matches per row (faster, fewer results).
Sample request in python
import requests, json
API_KEY = "YOUR_ZYLA_KEY"
URL = "API_URL/dedupe"
data_list = ["Microsoft","Micsrosoft","Apple Inc","Apple","Google LLC","9oogle"]
params = {
"data": json.dumps(data_list),
"similarity_threshold": "0.75",
"remove_punctuation": "true",
"to_lowercase": "true",
"use_token_sort": "false",
"output_format": "string_pairs",
"top_k": "all"
}
headers = {"Authorization": f"Bearer {API_KEY}"}
r = requests.post(URL, headers=headers, params=params, timeout=60)
print(r.status_code)
print(r.json())
Dedupe - 端点功能
| 对象 | 描述 |
|---|---|
data |
[必需] JSON array of strings to deduplicate (max 1000). Example: ["a","b","c"] |
similarity_threshold |
可选 Similarity cutoff from 0 to 1. Higher values are stricter (fewer matches). Default is 0.75. |
remove_punctuation |
可选 If true, punctuation is removed before matching. Default is true. |
to_lowercase |
可选 If true, strings are lowercased before matching. Default is true. |
use_token_sort |
可选 If true, tokens in each string are sorted before matching. Useful when word order varies. Default is false. |
output_format |
可选 Default: string_pairs Allowed values (and what each means): index_pairs List of matches as [i, j, score] where i and j are indices in the input list. string_pairs List of matches as [string_i, string_j, score] using original strings. deduped_strings List of strings with duplicates removed (one representative per group). deduped_indices List of indices representing the deduplicated set (one representative per group). membership_map Array of length N where entry i is the representative index for the group of data[i]. row_annotations Array of objects (one per input row) with fields: index, original_string, rep_index, rep_string, similarity_to_rep. |
top_k |
可选 Limits how many neighbors are returned per input string. Use all for full dedupe, or a positive integer for top matches per row. |
{"status":"success","response_data":[["Apple","appl!e",1.0]]}
curl --location --request POST 'https://zylalabs.com/api/11895/similarity+api/22607/dedupe?data=["Apple", "appl!e"]' --header 'Authorization: Bearer YOUR_API_KEY'
| 标头 | 描述 |
|---|---|
授权
|
[必需] 应为 Bearer access_key. 订阅后,请查看上方的"您的 API 访问密钥"。 |
无长期承诺。随时升级、降级或取消。 免费试用包括最多 50 个请求。
Dedupe端点返回一个包含匹配的字符串对、相似度分数和可选去重结果的JSON对象。输出可以根据指定的配置格式化为字符串对、索引对或去重字符串
响应数据中的关键字段包括“状态”(指示成功或错误)和“响应数据”,其中包含根据用户请求格式化的结果,例如匹配对或去重字符串
用户可以通过调整“config”对象中的参数来自定义请求,例如“similarity_threshold”用于匹配严格性,“remove_punctuation”用于预处理,以及“output_format”以选择所需的结果结构
响应数据组织为一个结果数组,其中每个条目对应于一个匹配或去重字符串。根据输出格式,条目可以包括原始字符串、索引和相似度分数,便于轻松集成到工作流程中
典型的使用案例包括去除客户名单中的重复项 将记录与主名单进行核对 清理CRM数据 以及在不同数据源之间进行实体解析以确保数据的完整性和准确性
数据准确性通过先进的模糊匹配算法得以保持,这些算法考虑了常见的数据问题,如拼写错误和大小写差异。该API旨在有效处理杂乱数据,确保可靠的匹配结果
接受的参数值包括“similarity_threshold”(0到1)、“remove_punctuation”(布尔值)、“to_lowercase”(布尔值)、“use_token_sort”(布尔值)和“top_k”(整数或“all”)。这些参数允许用户根据其特定需求定制匹配过程
如果去重端点返回部分或空结果,用户应检查输入数据是否存在质量问题,例如过多的重复项或非常低的相似性阈值。调整“相似性阈值”或查看输入列表可以帮助改善结果
服务级别:
100%
响应时间:
704ms
服务级别:
100%
响应时间:
717ms
服务级别:
100%
响应时间:
910ms
服务级别:
100%
响应时间:
9,055ms
服务级别:
100%
响应时间:
1,621ms
服务级别:
100%
响应时间:
1,015ms
服务级别:
83%
响应时间:
339ms
服务级别:
100%
响应时间:
398ms
服务级别:
100%
响应时间:
393ms
服务级别:
100%
响应时间:
3,618ms
服务级别:
100%
响应时间:
919ms
服务级别:
100%
响应时间:
0ms
服务级别:
100%
响应时间:
465ms
服务级别:
100%
响应时间:
800ms
服务级别:
100%
响应时间:
280ms
服务级别:
100%
响应时间:
269ms
服务级别:
100%
响应时间:
1,127ms
服务级别:
100%
响应时间:
0ms
服务级别:
100%
响应时间:
0ms
服务级别:
89%
响应时间:
0ms