Skip to content

[BUG] SearchTool Provides Irrelevant Results in Production (HK) vs. Development (CN) #1369

@iziang

Description

@iziang

Description

A critical issue has been identified where the SearchTool (which appears to use Bing search under the hood) behaves inconsistently between our local development environment (Hangzhou, Mainland China) and our production deployment (Hong Kong).

When a query is made, the development environment retrieves highly relevant search results, leading to correct RAG outputs. However, the exact same query in the production environment retrieves completely irrelevant, junk-like results, which severely degrades the quality and accuracy of the final generated answer. This makes the RAG pipeline unreliable in production.

This is a high-priority bug as it fundamentally breaks the retrieval mechanism of ApeRAG in certain common deployment regions.

Environment Discrepancy

Environment Location Observed Behavior Result Quality
Development Hangzhou, Mainland China The underlying search request to bing.com is redirected to cn.bing.com, which serves correct, localized results. Excellent
Production Hong Kong The search request hits bing.com's global endpoint directly, which returns completely irrelevant results (e.g., Japanese financial data for a weather query). Critical Failure

Steps to Reproduce

The underlying network behavior can be replicated without running the full ApeRAG stack, using cURL to simulate the HTTP requests from the two locations.

  1. Simulate Production (from a Hong Kong server):

    # Query for "拉斯维加斯天气" (Las Vegas weather)
    curl -vL "https://www.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94"

    Result: The HTML returned is for unrelated Japanese financial products.

  2. Simulate Development (from a Mainland China server):

    # Same query
    curl -vL "https://www.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94"

    Result: The request is redirected to cn.bing.com, and the HTML contains relevant results from sites like zhihu.com.

Root Cause Analysis

The issue stems from how Bing's servers treat programmatic, non-browser requests from different geographic locations:

  1. Geographic Routing: Bing correctly routes traffic to different edge nodes based on IP. The Hangzhou IP is routed to a Mainland China-specific infrastructure, while the Hong Kong IP is routed to a global/HK node.
  2. Client-Side Identity: The HTTP client used by ApeRAG's SearchTool (and cURL) is likely being identified as a "bot" or non-standard client by Bing's global endpoint in Hong Kong. This seems to trigger a fallback or anti-scraping mechanism that serves junk data.
  3. Redirection Difference: The Mainland China infrastructure is configured to redirect all traffic to cn.bing.com, a service optimized for all types of clients. The global infrastructure does not have this behavior, exposing the different treatment of non-browser user agents.

Impact on the ApeRAG Project

  • Unreliable Production Deployments: Any ApeRAG application deployed in Hong Kong (or potentially other regions outside Mainland China) will have a non-functional search/retrieval step.
  • "It Works On My Machine" Problem: This creates a severe discrepancy between development and production, making it difficult to debug and trust local testing.
  • Poor RAG Quality: The core promise of RAG is to provide accurate, context-aware answers. With a faulty retrieval step, the generator produces nonsensical or incorrect information ("garbage in, garbage out").

Proposed Solution / Next Steps

The most direct solution is to make the HTTP requests from ApeRAG's SearchTool appear as if they are coming from a standard web browser.

Recommendation:
Modify the HTTP client within the SearchTool to include a standard set of browser headers. At a minimum, this should include:

  • User-Agent: e.g., Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
  • Accept-Language: e.g., en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7
  • Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

This change should make Bing's global servers treat the request as a legitimate user interaction, returning relevant results and resolving the environment inconsistency.

A longer-term solution might involve integrating with official, paid search APIs (like the Bing Search API), which are designed for programmatic access and guarantee consistent results.

Supporting Logs

<details>
<summary><b>Full cURL Log from Hong Kong (Production Simulation)</b></summary>

hk.txt log content here...

< HTTP/2 200
...
< x-msedge-ref: Ref A: 1B95296EBD2143C0BFF4AA1CAA2697DB Ref B: HKBEDGE0908 Ref C: 2025-09-19T02:05:53Z
...


</details>

<details>
<summary><b>Full cURL Log from Hangzhou (Development Simulation)</b></summary>

hz.txt log content here...

< HTTP/2 302
< location: https://cn.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94
...
< HTTP/2 200
...
< x-msedge-ref: Ref A: 766A2412CFA14F638355C0212634C9D0 Ref B: BJ1EDGE0719 Ref C: 2025-09-19T02:05:32Z
...


</details>

Metadata

Metadata

Assignees

Labels

StalebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions