why Pilgrims | what we do | who we are | our work | contact


For each project we gather a team with complementary skills and varying backgrounds. We prefer to work together with passionate, involved, flexible and independent professionals. 




Technology review

Anti-vaxxers are weaponizing Yelp to punish bars that require vaccine proof

On the first hot weekend of the summer, Richard Knapp put up a sign outside Mother’s Ruin, a bar tucked in Manhattan’s SoHo neighborhood. It had two arrows: one pointing vaccinated people indoors, another pointing unvaccinated people outdoors.


The Instagram post showing the sign (above) quickly went viral among European anti-vaxxers on Reddit. “We started receiving hate mail through the Google portal,” Knapp says, estimating he’d received about a “few dozen” emails: “I’ve been called a Nazi and a communist in the same sentence. People hope that our bar burns down. It’s a name and shame campaign.” It wasn’t just the emails. Soon, his bar started receiving multiple one-star reviews on Yelp and Google Reviews from accounts as far away as Europe. 

Spamming review portals with negative ratings is not a new phenomenon. Throughout the pandemic, the tactic has also been deployed to attack bars and restaurants that enforced mask-wearing for safety. As pandemic restrictions have lifted, businesses like Mother’s Ruin have sought to ensure that safety by requiring proof of vaccination using state-sponsored apps like New York’s Excelsior Pass, vaccine passports, or simply flashing vaccine cards at the door — practices that have instigated a second surge of spam reviews.

These spam one-star reviews can be extremely damaging. The default mode for viewing reviews is in chronological order, from newest to oldest, which means a spam attack places fake reviews up top, making the most recent reviews that much more influential if you’re the victim of a concerted campaign. 

While some companies have gotten around this issue on their own sites by verifying that reviewers are actual customers by reaching out to them via email and matching them with what they have on file, industry-leading platforms like Yelp and Google let anyone to rate and review a business.

In April, Marshall Smith instituted what may have been the US’s first policy requiring patrons to prove they were fully vaccinated against coronavirus at Bar Max in Denver. He didn’t think it would be a big deal to ask customers to show their vaccination cards at the door. “I didn’t consider the politics, and perhaps that was naive on my part,” he says.

Within days, his bar was slammed with one-star reviews on Google that took his average rating from 4.6 out of 5 stars to 4.

“We were in the top 10 best reviewed craft cocktail bars in Denver [pre-pandemic],” he says. “It might not sound significant but if you drop out of the first page of results, it’s a big deal: you’re out of top 10 lists, listicle mentions. We don’t do a lot of advertising because people look at our reviews. We’ve built six years of good reviews that’s been chiseled away over a matter of months.”

These reviews don’t stay permanently in a business’s history. Yelp roots out spam, though the company “does not tell anybody [how its spam detection works]” says Bing Liu, a professor of computer science at the University of Illinois at Chicago. Liu was a co-author in 2013 of a paper that attempted to replicate Yelp’s methods, finding that that company most likely used keywords to root out possible spammers.

Smith’s Yelp reviews were shut down after the sudden flurry of activity on its page, which the company labels “unusual activity alerts,” a stopgap measure for both the business and Yelp to filter through a flood of reviews and pick out which are spam and which aren’t. Noorie Malik, Yelp’s vice president of user operations, said Yelp has a “team of moderators” that investigate pages that get an unusual amount of traffic. “After we’ve seen activity dramatically decrease or stop, we will then clean up the page so that only firsthand consumer experiences are reflected,” she said in a statement.

It’s a practice that Yelp has had to deploy more often over the course of the pandemic: According to Yelp’s 2020 Trust & Safety Report, the company saw a 206% increase over 2019 levels in unusual activity alerts. “Since January 2021, we’ve placed more than 15 unusual activity alerts on business pages related to a business’s stance on covid-19 vaccinations,” said Malik.

The majority of those cases have been since May, like the gay bar C.C. Attles in Seattle, which got an alert from Yelp after it made patrons show proof of vaccination at the door. Earlier this month, Moe’s Cantina in Chicago’s River North neighborhood got spammed after it attempted to isolate vaccinated customers from unvaccinated ones.

Spamming a business with one-star reviews is not a new tactic. In fact, perhaps the best-known case is Colorado’s Masterpiece bakery, which won a 2018 Supreme Court battle for refusing to make a wedding cake for a same-sex couple, after which it got pummeled by one-star reviews. “People are still writing fake reviews. People will always write fake reviews,” Liu says.

But he adds that today’s online audience know that platforms use algorithms to detect and flag problematic words, so bad actors can mask their grievances by blaming poor restaurant service like a more typical negative review to ensure the rating stays up — and counts.

That seems to have been the case with Knapp’s bar. His Yelp review included comments like “There was hair in my food” or alleged cockroach sightings. “Really ridiculous, fantastic shit,” Knapp says. “If you looked at previous reviews, you would understand immediately that this doesn’t make sense.” 

Liu also says there is a limit to how much Yelp can improve their spam detection, since natural language — or the way we speak, read, and write — “is very tough for computer systems to detect.” 

But Liu doesn’t think putting a human being in charge of figuring out which reviews are spam or not will solve the problem. “Human beings can’t do it,” he says. “Some people might get it right, some people might get it wrong. I have fake reviews on my webpage and even I can’t tell which are real or not.”

You might notice that I’ve only mentioned Yelp reviews thus far, despite the fact that Google reviews — which appear in the business description box on the right side of the Google search results page under “reviews” — is arguably more influential. That’s because Google’s review operations are, frankly, even more mysterious. 

While businesses I spoke to said Yelp worked with them on identifying spam reviews, none of them had any luck with contacting Google’s team. “You would think Google would say, ‘Something is fucked up here,’” Knapp says. “These are IP addresses from overseas. It really undermines the review platform when things like this are allowed to happen.”

Google did not respond to multiple requests for comment; however, within a few hours of our call, Knapp said some problematic reviews on Google had cleared up for him. Smith said he had not yet gotten any response from Google about reviews, save for automated responses saying that multiple reviews he had flagged did not qualify getting taken down because “the reviews in question don’t fall under any of the violation categories, according to our policies.”

Spam reviews aren’t going anywhere and will continue to be a problem for years to come. And the fact remains that online communities — like the European anti-vaxxers that descended upon Mother’s Ruin’s reviews — can destroy faraway livelihoods with the click of a star rating.

Those ratings haunt business owners like Smith. “I still have folks putting one-star reviews on our Google listing,” he says. “Outliers pull down averages, that’s math. It’s a pretty effective means of attack for the folks who do this.”

Knapp feels equally frustrated and helpless. “We’re just trying to survive through the most traumatic experience that’s ever hit the hospitality industry,” he says. “The idea that we are under attack by this community and there is no real vehicle to combat it, that’s frustrating.”

Anti-vaxxers are weaponizing Yelp to punish bars that require vaccine proof 2021/06/12 12:00

These creepy fake humans herald a new age in AI

You can see the faint stubble coming in on his upper lip, the wrinkles on his forehead, the blemishes on his skin. He isn’t a real person, but he’s meant to mimic one—as are the hundreds of thousands of others made by Datagen, a company that sells fake, simulated humans.

These humans are not gaming avatars or animated characters for movies. They are synthetic data designed to feed the growing appetite of deep-learning algorithms. Firms like Datagen offer a compelling alternative to the expensive and time-consuming process of gathering real-world data. They will make it for you: how you want it, when you want—and relatively cheaply.

To generate its synthetic humans, Datagen first scans actual humans. It partners with vendors who pay people to step inside giant full-body scanners that capture every detail from their irises to their skin texture to the curvature of their fingers. The startup then takes the raw data and pumps it through a series of algorithms, which develop 3D representations of a person’s body, face, eyes, and hands.

The company, which is based in Israel, says it’s already working with four major US tech giants, though it won’t disclose which ones on the record. Its closest competitor, Synthesis AI, also offers on-demand digital humans. Other companies generate data to be used in finance, insurance, and health care. There are about as many synthetic-data companies as there are types of data.

Once viewed as less desirable than real data, synthetic data is now seen by some as a panacea. Real data is messy and riddled with bias. New data privacy regulations make it hard to collect. By contrast, synthetic data is pristine and can be used to build more diverse data sets. You can produce perfectly labeled faces, say, of different ages, shapes, and ethnicities to build a face-detection system that works across populations.

But synthetic data has its limitations. If it fails to reflect reality, it could end up producing even worse AI than messy, biased real-world data—or it could simply inherit the same problems. “What I don’t want to do is give the thumbs up to this paradigm and say, ‘Oh, this will solve so many problems,’” says Cathy O’Neil, a data scientist and founder of the algorithmic auditing firm ORCAA. “Because it will also ignore a lot of things.”

Realistic, not real

Deep learning has always been about data. But in the last few years, the AI community has learned that good data is more important than big data. Even small amounts of the right, cleanly labeled data can do more to improve an AI system’s performance than 10 times the amount of uncurated data, or even a more advanced algorithm.

That changes the way companies should approach developing their AI models, says Datagen’s CEO and cofounder, Ofir Chakon. Today, they start by acquiring as much data as possible and then tweak and tune their algorithms for better performance. Instead, they should be doing the opposite: use the same algorithm while improving on the composition of their data.

Datagen also generates fake furniture and indoor environments to put its fake humans in context.

But collecting real-world data to perform this kind of iterative experimentation is too costly and time intensive. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets a day to identify which one maximizes a model’s performance.

To ensure the realism of its data, Datagen gives its vendors detailed instructions on how many individuals to scan in each age bracket, BMI range, and ethnicity, as well as a set list of actions for them to perform, like walking around a room or drinking a soda. The vendors send back both high-fidelity static images and motion-capture data of those actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. The synthesized data is sometimes then checked again. Fake faces are plotted against real faces, for example, to see if they seem realistic.

Datagen is now generating facial expressions to monitor driver alertness in smart cars, body motions to track customers in cashier-free stores, and irises and hand motions to improve the eye- and hand-tracking capabilities of VR headsets. The company says its data has already been used to develop computer-vision systems serving tens of millions of users.

It’s not just synthetic humans that are being mass-manufactured. Click-Ins is a startup that uses synthetic AI to perform automated vehicle inspections. Using design software, it re-creates all car makes and models that its AI needs to recognize and then renders them with different colors, damages, and deformations under different lighting conditions, against different backgrounds. This lets the company update its AI when automakers put out new models, and helps it avoid data privacy violations in countries where license plates are considered private information and thus cannot be present in photos used to train AI.

Click-Ins renders cars of different makes and models against various backgrounds.

Mostly.ai works with financial, telecommunications, and insurance companies to provide spreadsheets of fake client data that let companies share their customer database with outside vendors in a legally compliant way. Anonymization can reduce a data set’s richness yet still fail to adequately protect people’s privacy. But synthetic data can be used to generate detailed fake data sets that share the same statistical properties as a company’s real data. It can also be used to simulate data that the company doesn’t yet have, including a more diverse client population or scenarios like fraudulent activity.

Proponents of synthetic data say that it can help evaluate AI as well. In a recent paper published at an AI conference, Suchi Saria, an associate professor of machine learning and health care at Johns Hopkins University, and her coauthors demonstrated how data-generation techniques could be used to extrapolate different patient populations from a single set of data. This could be useful if, for example, a company only had data from New York City’s more youthful population but wanted to understand how its AI performs on an aging population with higher prevalence of diabetes. She’s now starting her own company, Bayesian Health, which will use this technique to help test medical AI systems.

The limits of faking it

But is synthetic data overhyped?

When it comes to privacy, “just because the data is ‘synthetic’ and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer and information science at the University of Pennsylvania. Some data generation techniques have been shown to closely reproduce images or text found in the training data, for example, while others are vulnerable to attacks that make them fully regurgitate that data.

This might be fine for a firm like Datagen, whose synthetic data isn’t meant to conceal the identity of the individuals who consented to be scanned. But it would be bad news for companies that offer their solution as a way to protect sensitive financial or patient information.

Research suggests that the combination of two synthetic-data techniques in particular—differential privacy and generative adversarial networks—can produce the strongest privacy protections, says Bernease Herman, a data scientist at the University of Washington eScience Institute. But skeptics worry that this nuance can be lost in the marketing lingo of synthetic-data vendors, which won’t always be forthcoming about what techniques they are using.

Meanwhile, little evidence suggests that synthetic data can effectively mitigate the bias of AI systems. For one thing, extrapolating new data from an existing data set that is skewed doesn’t necessarily produce data that’s more representative. Datagen’s raw data, for example, contains proportionally fewer ethnic minorities, which means it uses fewer real data points to generate fake humans from those groups. While the generation process isn’t entirely guesswork, those fake humans might still be more likely to diverge from reality. “If your darker-skin-tone faces aren’t particularly good approximations of faces, then you’re not actually solving the problem,” says O’Neil.

For another, perfectly balanced data sets don’t automatically translate into perfectly fair AI systems, says Christo Wilson, an associate professor of computer science at Northeastern University. If a credit card lender were trying to develop an AI algorithm for scoring potential borrowers, it would not eliminate all possible discrimination by simply representing white people as well as Black people in its data. Discrimination could still creep in through differences between white and Black applicants.

To complicate matters further, early research shows that in some cases, it may not even be possible to achieve both private and fair AI with synthetic data. In a recent paper published at an AI conference, researchers from the University of Toronto and the Vector Institute tried to do so with chest x-rays. They found they were unable to create an accurate medical AI system when they tried to make a diverse synthetic data set through the combination of differential privacy and generative adversarial networks.

None of this means that synthetic data shouldn’t be used. In fact, it may well become a necessity. As regulators confront the need to test AI systems for legal compliance, it could be the only approach that gives them the flexibility they need to generate on-demand, targeted testing data, O’Neil says. But that makes questions about its limitations even more important to study and answer now.

“Synthetic data is likely to get better over time,” she says, “but not by accident.”

These creepy fake humans herald a new age in AI 2021/06/11 11:00

Transforming health care at the edge

Transforming health care at the edge 2021/06/10 19:22

Clinical trials are better, faster, cheaper with big data

Clinical trials have never been more in the public eye than in the past year, as the world watched the development of vaccines against covid-19, the disease at the center of the 2020 coronavirus pandemic. Discussions of study phases, efficacy, and side effects dominated the news. The most distinctive feature of the vaccine trials was their speed. Because the vaccines are meant for universal distribution, the study population is, basically, everyone. That unique feature means that recruiting enough people for the trials has not been the obstacle that it commonly is.

“One of the most difficult parts of my job is enrolling patients into studies,” says Nicholas Borys, chief medical officer for Lawrenceville, N.J., biotechnology company Celsion, which develops next-generation chemotherapy and immunotherapy agents for liver and ovarian cancers and certain types of brain tumors. Borys estimates that fewer than 10% of cancer patients are enrolled in clinical trials. “If we could get that up to 20% or 30%, we probably could have had several cancers conquered by now.”

Clinical trials test new drugs, devices, and procedures to determine whether they’re safe and effective before they’re approved for general use. But the path from study design to approval is long, winding, and expensive. Today,researchers are using artificial intelligence and advanced data analytics to speed up the process, reduce costs, and get effective treatments more swiftly to those who need them. And they’re tapping into an underused but rapidly growing resource: data on patients from past trials

Building external controls

Clinical trials usually involve at least two groups, or “arms”: a test or experimental arm that receives the treatment under investigation, and a control arm that doesn’t. A control arm may receive no treatment at all, a placebo or the current standard of care for the disease being treated, depending on what type of treatment is being studied and what it’s being compared with under the study protocol. It’s easy to see the recruitment problem for investigators studying therapies for cancer and other deadly diseases: patients with a life-threatening condition need help now. While they might be willing to take a risk on a new treatment, “the last thing they want is to be randomized to a control arm,” Borys says. Combine that reluctance with the need to recruit patients who have relatively rare diseases—for example, a form of breast cancer characterized by a specific genetic marker—and the time to recruit enough people can stretch out for months, or even years. Nine out of 10 clinical trials worldwide—not just for cancer but for all types of conditions—can’t recruit enough people within their target timeframes. Some trials fail altogether for lack of enough participants.

What if researchers didn’t need to recruit a control group at all and could offer the experimental treatment to everyone who agreed to be in the study? Celsion is exploring such an approach with New York-headquartered Medidata, which provides management software and electronic data capture for more than half of the world’s clinical trials, serving most major pharmaceutical and medical device companies, as well as academic medical centers. Acquired by French software company Dassault Systèmes in 2019, Medidata has compiled an enormous “big data” resource: detailed information from more than 23,000 trials and nearly 7 million patients going back about 10 years.

The idea is to reuse data from patients in past trials to create “external control arms.” These groups serve the same function as traditional control arms, but they can be used in settings where a control group is difficult to recruit: for extremely rare diseases, for example, or conditions such as cancer, which are imminently life-threatening. They can also be used effectively for “single-arm” trials, which make a control group impractical: for example, to measure the effectiveness of an implanted device or a surgical procedure. Perhaps their most valuable immediate use is for doing rapid preliminary trials, to evaluate whether a treatment is worth pursuing to the point of a full clinical trial.

Medidata uses artificial intelligence to plumb its database and find patients who served as controls in past trials of treatments for a certain condition to create its proprietary version of external control arms. “We can carefully select these historical patients and match the current-day experimental arm with the historical trial data,” says Arnaub Chatterjee, senior vice president for products, Acorn AI at Medidata. (Acorn AI is Medidata’s data and analytics division.) The trials and the patients are matched for the objectives of the study—the so-called endpoints, such as reduced mortality or how long patients remain cancer-free—and for other aspects of the study designs, such as the type of data collected at the beginning of the study and along the way.

When creating an external control arm, “We do everything we can to mimic an ideal randomized controlled trial,” says Ruthie Davi, vice president of data science, Acorn AI at Medidata. The first step is to search the database for possible control arm candidates using the key eligibility criteria from the investigational trial: for example, the type of cancer, the key features of the disease and how advanced it is, and whether it’s the patient’s first time being treated. It’s essentially the same process used to select control patients in a standard clinical trial—except data recorded at the beginning of the past trial, rather than the current one, is used to determine eligibility, Davi says. “We are finding historical patients who would qualify for the trial if they existed today.”

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Clinical trials are better, faster, cheaper with big data 2021/06/10 16:00

What makes the Delta covid-19 variant more infectious?

Covid cases are on the rise in England, and a fast-spreading variant may be to blame. B.1.617.2, which now goes by the name Delta, first emerged in India, but has since spread to 62 countries, according to the World Health Organization.

Delta is still rare in the US. At a press conference on Tuesday, the White House’s chief medical advisor, Anthony Fauci, said that it accounts for just 6% of cases. But in the UK it has quickly overtaken B.1.1.7—also known as Alpha—to become the dominant strain, which could derail the country’s plans to ease restrictions on June 21.

The total number of cases is still small, but public health officials are watching the variant closely. On Monday, UK Secretary of State for Health and Social Care Matt Hancock reported that Delta appears to be about 40% more transmissible than Alpha, but scientists are still trying to pin down the exact number—estimates range from 30% to 100%. They are also working to understand what makes it more infectious. They don’t yet have many answers, but they do have hypotheses.

All viruses acquire mutations in their genetic code as they replicate, and SARS-CoV-2 is no exception. Many of these mutations have no impact at all. But some change the virus’s structure or function. Identifying changes in the genetic sequence of a virus is simple. Figuring out how those changes impact the way a virus spreads is trickier. The spike protein, which helps the virus gain entry to cells, is a good place to start. 

How Delta enters cells

To infect cells, SARS-CoV-2 must enter the body and bind to receptors on the surface of cells. The virus is studded with mushroom-shaped spike proteins that latch onto a receptor called ACE2 on human cells. This receptor is found on many cell types, including those that line the lungs. Think of it like a key fitting into a lock.

Mutations that help the virus bind more tightly can make transmission from one person to another easier. Imagine you breathe in a droplet that contains SARS-CoV-2. If that droplet contains viruses with better binding capabilities, they “will be more efficient at finding and infecting one of your cells,” says Nathaniel Landau, a microbiologist at NYU Grossman School of Medicine.

Scientists don’t yet know how many particles of SARS-CoV-2 you have to inhale to become infected, but the threshold would likely be lower for a virus that is better at grabbing onto ACE2. 

Landau and his colleagues study binding in the lab by creating pseudoviruses. These lab-engineered viruses can’t replicate, but researchers can tweak them to express the spike protein on their surface. That allows them to easily test binding without needing to use a high-security laboratory. The researchers mix these pseudoviruses with plastic beads covered with ACE2 and then work out how much virus sticks to the beads. The greater the quantity of virus, the better the virus is at binding. In a preprint posted in May, Grunbaugh and colleagues show that some of the mutations present in Delta do enhance binding. 

How it infects once inside

But better binding not only lowers the threshold for infection. Because the virus is better at grabbing ACE2, it also will infect more cells inside the body. “The infected person will have more virus in them, because the virus is replicating more efficiently,” Landau says. 

After the virus binds to ACE2, the next step is to fuse with the cell, a process that begins when enzymes from the host cell cut the spike at two different sites, a process known as cleavage. This kick starts the fusion machinery. If binding is like the key fitting in the lock, cleavage is like the key turning the deadbolt. “Without cuts at both sites, the virus can’t get into cells,” says Vineet Menachery, a virologist at The University of Texas Medical Branch. 

One of the mutations present in Delta actually occurs in one of these cleavage sites, and a new study that has not yet been peer reviewed shows that this mutation does enhance cleavage. And Menachery, who was not involved in the study, says he has replicated those results in his lab. “So it’s a little bit easier for the virus to be activated,” he says.

Whether that improves transmissibility isn’t yet known, but it could. When scientists delete these cleavage sites, the virus becomes less transmissible and less pathogenic, Menachery says. So it stands to reason that changes that facilitate cleavage would increase transmissibility. 

It’s also possible that Delta’s ability to evade the body’s immune response helps fuel transmission. Immune evasion means more cells become infected and produce more virus, which then potentially makes it easier for person carrying that virus to infect someone else. 

But vaccines still work

The good news is that vaccination provides strong protection against Delta. A new study from Public Health England shows that the Pfizer-BioNTech vaccine was 88% effective in preventing symptomatic disease due to Delta in fully vaccinated people. The AstraZeneca vaccine provided slightly less protection. Two shots were 60% effective against the variant. The effectiveness of one dose of either vaccine, however, was much lower— just 33%.

In any case, in the US and UK, just around 42% of the population is fully vaccinated. In India, where the virus surged fueled in part by the rapid spread of Delta, just 3.3% of the population has achieved full vaccination. 

At the press briefing, Fauci urged those who have not been vaccinated to get their first shot and reminded those who are partially vaccinated not to skip their second dose. The Biden Administration hopes to have 70% of the population at least partially vaccinated by the Fourth of July. In the UK, Delta quickly replaced Alpha to become the dominant strain, and cases are now on the rise. “We cannot let that happen in the United States,” Fauci said. 

What makes the Delta covid-19 variant more infectious? 2021/06/10 15:00

1 / 2