Pythonade - The Problem with Pandas

Pandas is a popular open-source Python library designed for data manipulation and analysis. It provides fast, flexible, and user-friendly tools, such as DataFrames and Series, for working with structured data (like tables) and performing operations like filtering, merging, reshaping, and aggregating efficiently. It's widely used in data science and machine learning workflows.

While it is undoubtedly a mature and extremely useful tool, Pandas is also reflective of entrenched data analysis culture. In this context, "data analysis culture" refers to the collection of norms, tools, methods, and approaches that a particular field, discipline, or organization typically adopts—often shaped by historical trends, shared tools, common educational practices, and specific subject needs.

For example:

Economists might use econometrics with a heavy emphasis on regression analysis over machine learning methods because their approach has long revolved around causal inference.
Machine learning practitioners may favour predictive accuracy over understanding the causal relationships, reflecting the culture of applied AI and technology.
Biologists performing experiments may emphasize statistical significance testing over exploratory data techniques since hypothesis testing has dominated the field historically.

In some cases, analysts may continue to follow these cultural norms even when newer or potentially better approaches exist, simply because those practices are deeply ingrained.

The culture of data analysis in different fields is influenced by several factors, often shaped by historical trends, educational practices, and the unique goals of each discipline. One major factor is tradition and legacys. Many disciplines persist with established methodologies because they are familiar and widely accepted within the community. These methods often become the “default approach” simply because they have been used for decades. For instance, classical hypothesis testing, such as the use of p-values, remains dominant in fields like biology and psychology, even though newer methods like Bayesian analysis could sometimes provide richer insights. Similarly, the long-standing tradition of econometrics in economics can sometimes lead to resistance against adopting machine learning techniques, which, while powerful in prediction, often lack the kind of causal interpretability that economists prioritize.

Another influential factor is field-specific objectives. Different disciplines have different goals, which naturally shape the tools and techniques used. For example, some fields focus heavily on causation—understanding why something happens—while others prioritize prediction. Economists, psychologists, and social scientists tend to focus on causal relationships, as understanding the underlying mechanisms is crucial in their work. In contrast, fields like machine learning are more concerned with creating highly accurate predictions, even if the underlying relationships are not easily interpretable. Similarly, in industries such as healthcare and finance, interpretability is often prioritized, which leads to the preference for simpler, more transparent models like linear regression over more complex “black-box” models such as neural networks, even if the latter provides more accurate results.

The availability of data also plays a significant role in shaping analytical culture. In fields like the social sciences, where data often come from surveys or small sample sizes, simpler statistical techniques like linear regression and t-tests are predominant because they are well-suited to the limitations of the data. On the other hand, tech sectors often deal with massive datasets and thus lean heavily on machine learning and deep learning methods, which scale well with large quantities of data.

Additionally, the educational background and training of analysts influence their choices. Analysts naturally gravitate toward methods they learned during their formal education. For instance, psychologists and biologists are often trained rigorously in classical hypothesis testing, ANOVA, and the use of p-values, leaving them less exposed to methods like machine learning. In contrast, engineers or computer scientists, who are usually trained in programming and data science, often focus on predictive methodologies and may not emphasize classical statistical inference as much. This divergence in educational exposure reinforces the cultural preferences for certain tools and techniques within different disciplines.

Finally, there is the influence of peer and institutional pressure. The methods analysts use are often driven by the expectations of their peers, organizations, or even academic journals. For example, in academia, publishing often hinges on the use of familiar and widely accepted tools like statistical significance testing, even if newer, unconventional approaches might offer better insights. This pressure to conform to established norms can discourage experimentation and adoption of newer methods.

In essence, the culture of data analysis in any given field is shaped by a combination of tradition, objectives, data constraints, education, and external pressures. While these influences help provide a shared foundation within disciplines, they can occasionally hinder innovation or the adoption of more effective techniques.

Real-world Examples of Culture-Driven Analysis

P-values in Hypothesis Testing

Fields like medicine and psychology have long relied on p-values (e.g., the arbitrary 0.05 significance level) to validate hypotheses. While useful, this approach has received significant criticism for being used rigidly, even in scenarios where other approaches like Bayesian analysis may provide richer insights. The culture persists due to education, shared norms, and familiarity with statistical significance as a "gold standard."

Econometrics vs. Machine Learning

Economists tend to prefer econometric models (e.g., linear models or instrumental variables) that focus on causal inference with interpretable relationships. In contrast, tech companies increasingly rely on black-box machine learning models to optimize predictions, even when causality isn't well understood. Economists might sometimes reject machine learning because it doesn't satisfy their need for causal explanations, despite its potential predictive power.

Academic vs. Practical Approaches

In academic settings, researchers often focus on model interpretability, reproducibility, and statistical rigor. In industry, data practitioners may prefer methods that are faster or more computationally efficient, even if they sacrifice interpretability.

Visualizations

Fields like journalism may emphasize storytelling-based visualizations such as infographics, whereas scientific communities tend to value precise but minimalistic visualizations such as simple scatterplots or bar charts.

When Pandas isn't Best - A Simple Example

Let's say we have a dataset of friendships and need to answer the following questions:

Who are the "friends of friends" of a given user?
What is the shortest path between two users?

These questions involve traversing a network of relationships, which is what graph databases excel at. While Pandas can indeed handle such data, working with these kinds of questions in Pandas is far more complex and less efficient.

In our example let's have nodes that are users with unique ids, and edges that are friendships between the users, and take a look at equivalent solutions using Neo4j and Pandas.

Solution using a Graph Database (Neo4j)

Install the Python Neo4j driver:
```
pip install neo4j
```

Python code to analyse the data:


from neo4j import GraphDatabase

# Connect to the Neo4j database
uri = "bolt://localhost:7687"
username = "neo4j"
password = "password"  # Replace with your own password
driver = GraphDatabase.driver(uri, auth=(username, password))

# Create relationships in the database
def create_relationships(session):
    # Sample friendship data
    friendships = [
        (1, 2), (1, 3), (2, 3),
        (3, 4), (4, 5)
    ]
    for user_a, user_b in friendships:
        session.run(
            "MERGE (a:User {id: $user_a}) "
            "MERGE (b:User {id: $user_b}) "
            "MERGE (a)-[:FRIEND]->(b)",
            user_a=user_a, user_b=user_b
        )

# Find friends of friends
def find_friends_of_friends(session, user_id):
    query = (
        "MATCH (user:User {id: $user_id})-[:FRIEND]->(friend)-[:FRIEND]->(fof) "
        "WHERE NOT (user)-[:FRIEND]->(fof) AND user <> fof "
        "RETURN DISTINCT fof.id AS friend_of_friend"
    )
    result = session.run(query, user_id=user_id)
    return [record['friend_of_friend'] for record in result]

# Find shortest path
def shortest_path(session, user_a, user_b):
    query = (
        "MATCH p=shortestPath((a:User {id: $user_a})-[:FRIEND*]-(b:User {id: $user_b})) "
        "RETURN [node IN nodes(p) | node.id] AS path"
    )
    result = session.run(query, user_a=user_a, user_b=user_b)
    return result.single()["path"]

with driver.session() as session:
    create_relationships(session)

    # Example: Find friends of friends for user 1
    print("Friends of Friends (User 1):", find_friends_of_friends(session, 1))

    # Example: Find shortest path between users 1 and 5
    print("Shortest Path (1 -> 5):", shortest_path(session, 1, 5))

driver.close()

Result

Friends of Friends (User 1): [4]
Shortest Path (1 -> 5): [1,3,4,5]

Partial Solution Using Pandas and NetworkX

One step towards a pure Pandas solution is to use NetworkX to help with the graph of relationships. This is a bit of a cheat since we're comparing Pandas with Neo4j, not NetworkX, but I wanted to show a more sensible way to do it with Pandas first. It is worth mentioning that the performance of Neo4j is far superior to that of NetworkX over large datasets.

Install Pandas:
```
pip install pandas
```

Python code to analyse the data using Pandas with NetworkX:


import pandas as pd
import networkx as nx

# Create the friendship dataset
friendships = pd.DataFrame({
    "User A": [1, 1, 2, 3, 4],
    "User B": [2, 3, 3, 4, 5]
})

# Create a NetworkX graph from the Pandas DataFrame
graph = nx.from_pandas_edgelist(friendships, "User A", "User B", create_using=nx.Graph())

# Find friends of friends
def find_friends_of_friends(graph, user_id):
    direct_friends = set(graph.neighbors(user_id))
    friends_of_friends = set()
    for friend in direct_friends:
        friends_of_friends.update(graph.neighbors(friend))
    friends_of_friends -= direct_friends  # Exclude direct friends
    friends_of_friends.discard(user_id)  # Exclude the user itself
    return list(friends_of_friends)

# Find the shortest path
def shortest_path(graph, user_a, user_b):
    try:
        return nx.shortest_path(graph, source=user_a, target=user_b)
    except nx.NetworkXNoPath:
        return None

# Example: Find friends of friends for user 1
print("Friends of Friends (User 1):", find_friends_of_friends(graph, 1))

# Example: Find shortest path between users 1 and 5
print("Shortest Path (1 -> 5):", shortest_path(graph, 1, 5))

Result

Friends of Friends (User 1): [4]
Shortest Path (1 -> 5): [1,3,4,5]

The Pure Pandas Solution:

The following approach uses Pandas without additional dependencies. It is suitable enough for small datasets where relationships are simple enough to compute in-memory.


import pandas as pd

# Friendship data
friendships = pd.DataFrame({
    "User A": [1, 1, 2, 3, 4],
    "User B": [2, 3, 3, 4, 5]
})

def find_friends_of_friends_pandas(df, user_id):
    # Find all direct friends
    direct_friends_a = df[df["User A"] == user_id]["User B"]
    direct_friends_b = df[df["User B"] == user_id]["User A"]
    direct_friends = pd.concat([direct_friends_a, direct_friends_b]).unique()

    # Find friends of those friends
    all_friends_of_friends = df[
        df["User A"].isin(direct_friends) | df["User B"].isin(direct_friends)
    ]

    # Extract all connections from the above DataFrame
    fof_a = all_friends_of_friends["User A"]
    fof_b = all_friends_of_friends["User B"]
    all_fofs = pd.concat([fof_a, fof_b]).unique()

    # Exclude the user themselves and their direct friends
    all_fofs = set(all_fofs) - set(direct_friends) - {user_id}
    return list(all_fofs)

def shortest_path_pandas(df, user_a, user_b):
    # Initialize the BFS queue with the start user
    queue = [(user_a, [user_a])]  # Each element is a tuple (current_user, path_so_far)
    visited = set()  # Track visited users to prevent infinite loops

    while queue:
        current_user, path = queue.pop(0)

        # If we've already visited this user, skip
        if current_user in visited:
            continue

        # Mark the current user as visited
        visited.add(current_user)

        # Find all direct friends of the current user
        direct_friends_a = df[df["User A"] == current_user]["User B"]
        direct_friends_b = df[df["User B"] == current_user]["User A"]
        direct_friends = pd.concat([direct_friends_a, direct_friends_b]).unique()

        # Check if the target user is among the direct friends
        for friend in direct_friends:
            if friend == user_b:
                return path + [user_b]  # Return the full path if found
            # Otherwise, add this friend to the BFS queue for further exploration
            if friend not in visited:
                queue.append((friend, path + [friend]))

    # If no path is found, return None
    return None

# Example: Find friends of friends for user 1
print("Friends of Friends (User 1):", find_friends_of_friends_pandas(friendships, 1))

# Example: Find the shortest path between users 1 and 5
print("Shortest Path (1 -> 5):", shortest_path_pandas(friendships, 1, 5))

Result

Friends of Friends (User 1): [4]
Shortest Path (1 -> 5): [1,3,4,5]

Conclusion

In the pure Pandas solution, the shortest path implementation required manual BFS (breadth-first search) traversal, which would have been trivial in a graph-specific library like NetworkX or Neo4j. As the number of users and relationships grows, the BFS and the computation of friends of friends will become increasingly slow. This happens because Pandas does not internally optimize for graph traversal.

While it's possible to analyze graph-like data in Pandas, it becomes cumbersome and inefficient as the complexity of the problem (e.g., more relationships, larger networks) increases. A graph database or a specialized library like NetworkX directly supports graph structures and traversal algorithms, which makes it far easier to handle these types of problems. This is not a problem when working environments have a culture of "horses for courses", but if the culture is more "the answer is Pandas, what's the question?" there may be issues.

But this misses the main point - if the technology you use influences your mindset in framing analysis questions the whole point of the analysis comes into question. Are we asking high value questions or are we simply asking the questions that we always ask?