结构化文本提取API提供了一种简单的方法来提取和转换网页内容。它的清晰文本端点提供无格式的可读内容,非常适合文本分析或简化呈现。Markdown端点更进一步,生成结构化的markdown,理想用于与支持markdown的工具或系统集成。该API支持多种网页类型,确保可靠的性能和适应性,适用于各种应用,使其在内容解析和转换中不可或缺。
要使用此端点,请发送包含网页 URL 的请求,并接收从该页面的内容中提取的干净文本
错误 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"Spark Basics\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. Distributed Computing removes this limitation of vertical scaling by distributing the processing across cluster of machines. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\nSpark Basics\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\nSpark timeline\nGoogle was first to introduce large scale distributed computing solution with MapReduce and its own distributed file system i.e., Google File System(GFS). GFS provided a blueprint for the Hadoop File System (HDFS), including the MapReduce implementation as a framework for distributed computing. Apache Hadoop framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the Spark. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\nSpark Application\nSpark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster. The executors are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\nThere is a SparkSession object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\nSpark’s language APIs make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the low-level “unstructured” APIs (RDDs), and the higher-level structured APIs (Dataframes, Datasets).\nSpark Toolsets\nA DataFrame is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster.\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a transformation and if it doesn’t return anything then it’s an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\nTransformation are of types narrow and wide. Narrow transformations are those for which each input partition will contribute to only one output partition. Wide transformation will have input partitions contributing to many output partitions.\nSparks performs a lazy evaluation which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\nSpark-submit\nReferences\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5662/structured+text+extractor+api/7373/markdownify+api' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
要使用此端点,请发送一个包含网页 URL 的请求,并接收该页面转换为 markdown 格式的内容
干净页面转换器 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"---\ntitle: Spark Basics\nurl: https://techtalkverse.com/post/software-development/spark-basics/\nhostname: techtalkverse.com\ndescription: Suppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nsitename: techtalkverse.com\ndate: 2023-05-01\ncategories: ['post']\n---\n# Spark Basics\n\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\n\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. **Distributed Computing** removes this limitation of vertical scaling by distributing the processing across cluster of machines.\nNow, a group of machines alone is not powerful, you need a framework to\ncoordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\n\n## Spark Basics\n\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\n\n```\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\n```\n\n\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\n\n## Spark timeline\n\nGoogle was first to introduce large scale distributed computing solution with **MapReduce** and its own distributed file system i.e., **Google File System(GFS)**. GFS provided a blueprint for the **Hadoop File System (HDFS)**, including the MapReduce implementation as a framework for distributed computing. **Apache Hadoop** framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the **Spark**. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for\nmachine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\n\n## Spark Application\n\n**Spark Applications** consist of a driver process and a set of executor processes. The **driver** process runs your main() function, sits on a node in the cluster. The **executors** are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\n\nThere is a **SparkSession** object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\n**Spark’s language APIs** make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the **low-level “unstructured” APIs** (RDDs), and the **higher-level structured APIs** (Dataframes, Datasets).\n\n## Spark Toolsets\n\nA **DataFrame** is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A **partition** is a collection of rows that sit on one physical machine in your cluster.\n\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a **transformation** and if it doesn’t return anything then it’s an **action**. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\n\nTransformation are of types narrow and wide. **Narrow transformations** are those for which each input partition will contribute to only one output partition. **Wide transformation** will have input partitions contributing to many output partitions.\n\nSparks performs a **lazy evaluation** which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\n\n## Spark-submit\n\n## References\n\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5662/structured+text+extractor+api/7374/clean+page+converter' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
| 标头 | 描述 |
|---|---|
授权
|
[必需] 应为 Bearer access_key. 订阅后,请查看上方的"您的 API 访问密钥"。 |
无长期承诺。随时升级、降级或取消。 免费试用包括最多 50 个请求。
结构化文本提取器API是一个工具,旨在从任何网页中提取干净的文本或Markdown,简化各种应用程序的内容处理
该API可以提取未经格式化的可读内容以及结构化的markdown,使其适合文本分析和与markdown兼容工具的集成
干净文本端点提供来自网页的未格式化内容,确保提取的文本可读且适合分析或演示,没有任何HTML标签
该markdown端点生成结构化的markdown,适用于需要将提取的内容与markdown兼容系统集成的用户,从而增强可用性和格式化
是的,结构化文本提取器 API 支持多种网页类型,确保在各种内容解析和转换需求中具有可靠的性能和适应性
清洁文本端点返回从网页中提取的未格式化的可读文本,而 markdown 端点返回结构化的 markdown,包括标题、网址和描述等元数据,以及格式化的内容
干净的文本响应包含提取的文本,而markdown响应则包括标题 URL 主机名 描述 日期 类别和以markdown格式整理的主要内容
干净的文本响应是一个简单的字符串,而markdown响应则结构化为一个包含键值对的JSON对象,方便访问特定的元数据和内容
干净文本端点提供用于分析的纯文本,而markdown端点提供带有元数据的详细内容,使其适用于文档、博客和内容管理系统
两个端点的主要参数是提取内容的网页的URL 用户可以通过提供不同的URL来自定义请求以定向特定内容
用户可以分析干净的文本以获取见解或使用markdown输出创建格式化文档与内容管理系统集成或增强支持markdown的web应用程序
该API使用强大的解析算法从网页中准确提取内容,确保返回的文本和markdown尽可能地反映原始内容
常见的用例包括内容聚合 文本分析 博客文章创作以及将网页内容转换为markdown格式用于文档或在markdown兼容的平台上发布
服务级别:
100%
响应时间:
3,158ms
服务级别:
100%
响应时间:
2,560ms
服务级别:
100%
响应时间:
1,374ms
服务级别:
100%
响应时间:
1,945ms
服务级别:
100%
响应时间:
1,711ms
服务级别:
100%
响应时间:
1,537ms
服务级别:
100%
响应时间:
1,971ms
服务级别:
100%
响应时间:
1,812ms
服务级别:
100%
响应时间:
1,429ms
服务级别:
91%
响应时间:
2,513ms
服务级别:
100%
响应时间:
325ms
服务级别:
100%
响应时间:
1,647ms
服务级别:
100%
响应时间:
619ms
服务级别:
100%
响应时间:
1,377ms
服务级别:
100%
响应时间:
360ms
服务级别:
100%
响应时间:
327ms
服务级别:
100%
响应时间:
185ms
服务级别:
100%
响应时间:
5,750ms
服务级别:
100%
响应时间:
580ms
服务级别:
100%
响应时间:
831ms