Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API For Web Scraping / Processing #182

Open
solaris007 opened this issue Feb 29, 2024 · 2 comments
Open

API For Web Scraping / Processing #182

solaris007 opened this issue Feb 29, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@solaris007
Copy link
Member

In order to integrate the PoC-style Content Scraper and Content Processor an HTTP API is needed providing the following features:

  • trigger an async scraping -> processing task, which will have the content-scraper scrape content off the input URL, store the results and forward the task to the content-processor
  • check the status of a triggered task and eventually get the results of the processor stages/handlers

Here is a proposal for amending the HTTP API spec:

openapi: 3.0.0
info:
  title: Web Scraping and Processing API
  version: 1.0.0
paths:
  /scrape:
    post:
      summary: Initiates a web scraping job.
      description: Triggers a new scraping job for the given URL and returns a task ID for status polling.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                url:
                  type: string
                  format: uri
                  description: The URL to be scraped.
              required:
                - url
            examples:
              example-1:
                value: { "url": "https://example.com" }
      responses:
        202:
          description: Accepted. The scraping job is initiated, and a task ID is returned.
          content:
            application/json:
              schema:
                type: object
                properties:
                  taskId:
                    type: string
                    description: The unique identifier for the scraping task.
              examples:
                example-1:
                  value: { "taskId": "12345" }
        400:
          description: Bad Request. The URL is invalid or missing.
        429:
          description: Too Many Requests. Rate limit exceeded.
        500:
          description: Internal Server Error.

  /scrape/{taskId}:
    get:
      summary: Polls the status and results of a scraping job.
      description: Retrieves the status and, if available, the results of a scraping job by task ID.
      parameters:
        - in: path
          name: taskId
          required: true
          schema:
            type: string
          description: The unique identifier for the scraping task.
      responses:
        200:
          description: OK. Returns the status of the scraping job and results if completed.
          content:
            application/json:
              schema:
                type: object
                properties:
                  status:
                    type: string
                    description: The current status of the job ('pending', 'in_progress', 'completed', 'failed').
                  results:
                    type: object
                    properties:
                      translation:
                        type: string
                        description: URL or location of the translation result.
                      seoKeywords:
                        type: string
                        description: URL or location of the SEO keyword extraction result.
                      sentimentAnalysis:
                        type: string
                        description: URL or location of the sentiment analysis result.
                    required: []
              examples:
                pending:
                  value:
                    status: "pending"
                completed:
                  value:
                    status: "completed"
                    results:
                      translation: "https://results.example.com/translation/12345"
                      seoKeywords: "https://results.example.com/seo/12345"
                      sentimentAnalysis: "https://results.example.com/sentiment/12345"
        404:
          description: Not Found. The task ID does not exist.
        429:
          description: Too Many Requests. Rate limit exceeded.
        500:
          description: Internal Server Error.
@solaris007 solaris007 added the enhancement New feature or request label Feb 29, 2024
@solaris007 solaris007 self-assigned this Feb 29, 2024
@solaris007
Copy link
Member Author

@iuliag @ekremney @dzehnder @AndreiAlexandruParaschiv @alinarublea please review / provide input

@iuliag
Copy link
Contributor

iuliag commented Mar 1, 2024

For my understanding: the url is a page URL that has nothing to do with the sites we have in StarCatalogue?

The Location header in the post response could contain the url to poll for the status of the task.

The current status of the job ('pending', 'in_progress', 'completed', 'failed').

If the possible status values are known, we should use an enum.
What's the difference between 'pending' and 'in_progress'?
When is the task completed, after all the subtasks are completed? Would you show partial results as the subtask complete, or just final results?
It would be good to have an example of the response body for failed as well.
Generally, I think you'd have a different schema for the response body depending on status (or state), with different required properties.

For 429, it should include the Retry-After header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants