How to Use "Correlation" to Infer Truth When Data Is Imperfect

When transitioning seasons, especially as winter approaches, many people have the habit of pulling back the curtains after getting up in the morning to see how others outside are dressed, in order to judge whether the day's temperature will be cold or warm.

This is actually a simple form of "data analysis" behavior—inferring overall trends by observing samples around you.

The "Street Sweeping List" recently launched by Amap essentially follows a similar logic, using data on navigation destinations and travel behaviors to determine whether a shop is worth visiting.

Limitations of Observational Indicators

The problem with this judgment method is obvious.

Winter in Guangdong is a typical chaotic sample—some people wear short sleeves, some wear down jackets, and others pair windbreakers with shorts or scarves with short sleeves.

In such an environment, even if you see most people wearing short sleeves within a minute, you cannot conclude that "it's very hot today."

From a statistical perspective, this is a combination of "small sample bias" and "high variance noise."

The people outside the window are just a random sample of the urban population, and the clothing choices of individuals in the sample are influenced by too many random factors (commuting methods, physical differences, psychological expectations, etc.).

As a result, the signal you obtain carries strong volatility and randomness.

When such judgments go wrong, the attribution of errors also becomes biased.

If you dress too lightly and freeze because you saw people in short sleeves, you might curse those wearing short sleeves as crazy; similarly, if you fall into a trap due to a recommendation from the Street Sweeping List, you might blame Amap.

Relying solely on navigation data for algorithmic analysis, Amap finds it difficult to provide users with recommendations that meet expectations (or match their marketing claims). If Amap wants to avoid failure, it must inevitably introduce data beyond navigation to cross-validate the results.

Transaction Data Is the Most Indicative Metric

What truly reflects a shop's operational condition is "transaction data."

Under conditions where the operating area per unit and product prices are clear, shops with "high transaction frequency, high transaction amounts, high repurchase rates, and a wide customer base" are more stable and attractive choices.

These metrics belong to "high signal strength" data, which can better reveal the true commercial quality.

However, the issue is that transaction data is more sensitive than navigation data, as it involves merchant privacy and platform compliance. Even though Amap is a wholly-owned subsidiary of Alibaba, it cannot openly use such data.

Therefore, if a product like the Street Sweeping List can reflect real popularity to some extent, it is likely achieved by analyzing "proxy variables highly correlated with transaction behaviors," such as the frequency of navigation to the shop, users' repeat visit rates, and the proportion of long-distance navigation.

At the same time, based on the types of lists currently publicly displayed, it can be inferred that Amap uses data beyond navigation data that is highly correlated with transaction data (or location data at the time of payment) and user location data in non-navigation states (such as obtaining real-time device location).

Although these behavioral indicators are not transaction data themselves, they are often highly correlated with consumption behavior statistically.

This approach is called "correlation inference" or "proxy modeling": when target data is unavailable, indirectly inferring the target value through observable correlated variables.

Thus, Amap's Street Sweeping List is "likely constructed through correlation analysis between navigation behavior and consumption trends," rather than directly referencing transaction amounts.

This is also a typical "weak signal amplification" strategy: using enough indirect indicators to construct judgments close to reality.

The Boundary of the List Is the Greatest Common Divisor, Not the Optimal Solution

However, even the most advanced algorithms cannot eliminate the bias brought by "group averages."

Lists like the Street Sweeping List essentially only reflect shops that "most people find good"—this is statistically close to a "majority consensus solution," like the McDonald's near your home.

It can significantly reduce the probability of falling into a trap but cannot guarantee that you will choose the "optimal" option.

Just like the results of a public vote, they are usually solutions that "can be accepted by most people/are not disliked by most people," rather than the most efficient ones.

Moreover, navigation data itself carries significant noise, and many "check-in-style" travel behaviors do not represent actual satisfaction—who hasn't fallen into the trap of a网红店 (internet-famous shop)?

Nevertheless, this methodology is still worth learning from.

When we cannot obtain firsthand data, we can construct a reasonable judgment system through a series of "indirect indicators."

For example:

To judge the passenger flow in a certain area, you can look at the density of surrounding residential areas, the distribution of urban villages, and the number of subway entrances and exits.
To judge whether a road section is prone to congestion, you can examine the distribution of nearby schools, office buildings, and the intersections of main roads.

This is essentially the thinking behind "feature engineering": when core variables are missing, approximate the operational laws of the real world by constructing reasonable combinations of proxy variables.

Speaking Nonsense with Reason and Evidence

Data in the real world is never perfect.

The people you see outside the window, the rankings on the Street Sweeping List, and the heat maps on navigation apps are all signals with noise.

However, as long as you can understand this data with "logic and probabilistic thinking," you can extract relatively reliable judgments from it.

In statistics, this is called "Bayesian inference": in an uncertain environment, combining limited information with existing experience to gradually revise beliefs and approach the truth.

Therefore, when true data is unavailable, there is nothing wrong with judging the weather by "looking at people outside the window" or judging a commercial district by the "Street Sweeping List."

The key lies in whether you know its limitations and can identify signals amidst the noise.

After all, compared to being completely in the dark, speaking nonsense with reason and evidence often comes closer to the truth.