“There is more to be discovered, if you develop the right tools”
Scywalker: Enabling long-read single-cell transcriptome sequencing and analysis
January 8, 2025
Long-read single-cell sequencing combines the best of two worlds: the detailed, full-length transcript information from long-reads and the granular, cell-specific resolution of single-cell approaches. Making sense of the data though, is easier said than done. Existing nanopore single-cell data analysis tools simply cannot handle the massive data sets now emerging from long-read sequencing.
But where others might see problems, VIB’s experts see solutions. In a truly cross-disciplinary, multi-center effort, they developed scywalker: a scalable end-to-end data analysis workflow for long-read single-cell transcriptome sequencing. We spoke to Mojca Stražišar, Niels Vandamme, Peter De Rijk, and Frank Van Breusegem about the technical hurdles that had to be overcome and how bringing together different sets of expertise allowed them to do so.
While Mojca Stražišar and her team at the Neuromics Support Facility in Antwerp have deep expertise in genomics, transcriptomics, and bioinformatics analysis of brain samples, (supporting the neurogenetics research at the VIB-UAntwerp Center for Molecular Neurology), this story didn’t start with brain samples, but something else entirely: poplars.
“We have a long-standing collaboration with the VIB-UGent Center for Plant Systems Biology and the senior bioinformaticians and technicians working there,” explains Stražišar. “A few years ago, we started a collaboration on a large project, where we sequenced the DNA of more than 700 poplar trees.”
But next to sequencing genomes, several research groups in VIB were interested in the added value of single-cell or single-nuclei sequencing using long-read sequencing platforms. “We realized we were leaving a significant chunk of data unexamined, simply because of the limitations of existing technologies.”
Stražišar recounts how that got the ball rolling: “We started talking about how revolutionary it would be if we could go beyond what 10x Genomics could do, and look at single-cell data not only with short reads, but with long reads as well. For gene expression quantification and differential expression, short reads are fine. But if you want to find new isoforms, for example, long read sequencing would be crucially important.”
The problem
Long-read sequencing technologies, such as those developed by Oxford Nanopore Technologies and Pacific Biosciences, have opened new possibilities for transcriptomics by enabling the sequencing of full-length transcripts and providing detailed insights into isoform variations. While bulk long-read transcriptome sequencing has consistently uncovered novel isoforms, it falls short in capturing cell-type-specific differences in these isoforms. This limitation highlights the need to optimize long-read sequencing methods for single-cell applications.
Current analysis tools that tackle the lower accuracy and specific read structure of long-read sequencing are not yet matured and struggle to scale effectively to the larger datasets now common in single-cell studies, often exceeding 10,000 cells per sample.
A seed was planted, and a few brainstorm sessions later, this untapped potential set the team on a new mission: could they push beyond the current boundaries of single-cell sequencing and make use of all the information found in long-read data?
Stražišar: “My thinking has always been: if it is DNA or RNA, we can sequence it. That was the starting point and the driver of the entire project.”
Niels Vandamme and other experts across VIB’s various cores and centers were brought in to explore how to create a robust system that is capable of handling the complexities of long-read sequencing and single cell transcriptomics. Stražišar recalls: “We often exchanged ideas with Niels and his colleagues about what was and wasn’t doable. These discussions really helped us push the boundaries of what is possible.”
A rich pool of samples and data
The project began in the wet lab, testing and optimizing library preparation techniques for long-read sequencing. Next, the aim was to develop a tool that could extract meaningful insights from previously discarded data, while maintaining scalability for large single cell studies.
“We wanted that if our idea would work, that it would also be useful,” says Stražišar. She approached potential users, including researchers already doing single-cell sequencing. Thomas Eekhout at the VIB Single Cell Core and Frank Van Breusegem at the VIB-UGent Center for Plant Systems Biology were brought into the loop.
“After discussing with the relevant parties, we started to test and play a bit. Through our collaborations, we had quite some samples ready to go.”
With a heavily experienced team, this wet-lab stage was relatively straightforward, she says: “The adaptations required were within reach, though very time-intensive. What we needed to do was take our existing protocols one step further.”
Massive parallelization
The real challenge came in the dry lab, however, where the team needed to develop entirely new pipelines for de-multiplexing and analyzing the data. This is where the expertise of Wouter De Coster and Peter De Rijk proved crucial.
“We realized that the existing software couldn’t handle the scale of new long-read datasets for single-cell experiments,” explain De Rijk, senior bioinformatician at the Neuromics Support Facility, and De Coster, senior postdoctoral researcher in the Rademakers lab at the Center for Molecular Neurology.
De Rijk had previously developed pipelines for long reads sequencing that could serve as a starting point. He also had a few ideas on how to approach the issue at hand. “Parallelization is a key element here. Luckily, that is something we have some experience with.”
The first challenge was to parallelize finding the barcodes. “Potential barcode sequences are extracted from the data, and are sorted on abundance.” explains De Rijk. “The problem is to find which of these sequences are real barcodes, and which are actually a different barcode with sequencing errors. The novel method we implemented considers the most abundant barcode real, searches for sequences with slight differences to this barcode and assigns them to this barcode. It then continues with the second most abundant barcode. This top-down approach allows going deep into low-abundant barcodes, but it is difficult to parallelize." Difficult, but not impossible. De Rijk succeeded in chopping down the heaviest compute parts over different nodes in a cluster and integrating them again later.
Secondly, other parts of the pipeline—alignment, transcript analysis, counting, etc.—involve a range of different software applications which use a lot more memory when datasets become larger. “Most systems load everything in memory but there is a point at which this approach fails. Existing tools sometimes divide the work by chromosome, but we took it one step further and divided everything up into smaller genomic regions instead. With these smaller subsets of data, we could distribute the computing and significantly reduce the local resource demand.”
Thirdly, the team treated specific organellar DNA differently. “We know that with bulk sequencing, there are typically a few pitfalls. Mitochondrial DNA, for example, is covered much more heavily and has a lot of reads. Its structure is very different, because of its different origins, and this is something a lot of software, even bulk software, struggles with.” De Rijk and De Coster avoided these issues by analyzing mitochondria and plastids in a different way. “Since you're not trying to find new transcripts in those regions, you can do it in a much simpler way that uses almost no memory. You can even process it using a ‘streaming’ approach, where you analyze the data read by read. The downside is that you can’t identify new reads this way, but that’s rarely expected in this case anyway, so it’s not a significant drawback.”
Finally, the initial identification of transcripts is performed in bulk, and the conversion to single-cell data happens only afterward. “This includes the transcript counts,” explains De Rijk. “Since we have assignments per read, we already know which barcode is associated with each read. Afterwards, we use that information to integrate the data and perform the counting, ultimately obtaining single-cell-level counts. This is the part that makes the process scalable.”
All of these innovations made the long-read single-cell transcriptomics analysis feasible, but the team wanted to take things a few steps further and added additional features to make the pipeline more useful.
“We’re particularly interested in cell-type-specific transcripts and finding differences between different cell types. To achieve this, we added a module to automatically determine cell types based on single-cell data, calculate counts per cell type in pseudo-bulk, and then analyze these counts. This allows us to identify isoform switches, for example, or isoforms that are more prevalent in specific cell types.” In combination with additional information—such as the potential disease relevance of certain isoforms—the team is able to significantly enhance the output of the analysis through this feature.
The resulting software was named ‘scywalker’, an idea of De Coster, who enjoys weaving Star Wars references into the tools he develops. With single cell’s sc prefix, he landed on this alternative spelling of the famous protagonist’s last name.
From brains to plants: splicing under stress
Stražišar, De Rijk, and De Coster are based in Antwerp, and part of the VIB-UAntwerp Center for Molecular Neurology. A lot of their work builds on the needs of the neuroscientific teams around them. With scywalker, they wanted to make sure the tool would be broadly applicable. It was important to put its performance to the test not only on human brain tissue, but also on samples from a very different organism.
“We were looking for potential users, people who were already doing single-cell sequencing,” says Stražišar. “We asked ourselves: what types of tissue can we use that is as different from the brain as it gets?”
They found the answer once again in plants. Stephane Rombouts, who was previously involved in the poplar project, pointed Stražišar to another colleague at the VIB-UGent Center for Plant Systems Biology: Frank Van Breusegem. His team could use the pipeline to study light-stress responses in Arabidopsis.
Van Breusegem recounts: “We had single-cell sequencing samples for Arabidopsis, but these were of course short reads. Mojca approached us to explore if we could expand this approach to long reads and apply the platform they were developing in plant data.”
It was a no-brainer for Van Breusegem. Not only did his team have all the samples available, they also had a specific research interest in looking into the single-cell long-read data. “We know that stress caused by light can trigger alternative splicing, but hadn't been able to capture this with single cell sequencing so far. With the long-reads, we believed this would be feasible.”
Van Breusegem’s postdoc Patrick Willems (now in Francis Impens’ team at the VIB-UGent Center for Medical Biotechnology) worked closely with De Rijk to implement the workflow in plants. “The first step in the collaboration was to do a short versus long-read benchmark. That looked quite good. Next, we were able to successfully capture the alternative splicing events triggered by light stress at the single-cell level.” This provided the team with a much clearer picture of how individual plant cells respond to stress.
“Scywalker allowed us to unravel processes we could only speculate about before. It’s an exciting step forward for understanding plant stress biology at single-cell resolution.”
Culture, not coincidence
“We’re the first ones offering long read single-cell transcriptome sequencing and analysis as a service,” Stražišar notes. Yet, the road to innovation was far from straightforward. It required years of testing, collaboration, and creative problem-solving. As Stražišar notes, “Developing the pipeline took far longer than we anticipated. But that’s the nature of research—progress is rarely linear.”
The extended single-cell sequencing workflow from short-read to long-read was only possible due to the close-knit collaboration between teams across VIB. From the single-cell core to bioinformatics experts and neuroscience and plant systems biology groups, everyone contributed their expertise. Stražisar credits this synergy:
“We started from our own needs, but we relied on others to move beyond this narrow scope. VIB’s unique ecosystem makes it easy to plug into different areas of research. Everyone brought something unique to the table, and this has been a vital ingredient to this success.”
Stražišar realizes that projects like these wouldn't stand a chance, if the researchers she worked with wouldn’t be interested in technology. “It is really tech-driven research—realizing there is more to be discovered, if you develop the right tools.”
Vandamme concurs: “Beyond VIB’s Core Facilites, there is a lot of specific know-how across centers, the expertise in long-read sequencing in the teams of Mojca Stražišar, Rosa Rademakers and Kristel Sleegers being a prime example. What’s important is that there is the willingness to collaborate and build further on this complementary expertise.”
VIB Technologies has now set up a Tech Satellite in the VIB center in Antwerp, further enabling collaboration. The satellite facility is run by Robin Boiy, says Vandamme. “While we trained him initially, he has in the meantime also trained us and shared a lot of the knowledge he gathered from working with the experts in Antwerp.”
Van Breusegem agrees that the open collaborative culture at VIB is what made the difference here: “Our team’s involvement was coincidental in a way, but coincidences like these of course arise thanks to the right circumstances.”
“Everyone was on the same page,” he added, “and the team in Antwerp really brought the necessary drive to see the project through.”
Spreading the word
Since the paper on scywalker came out, interest in the service is growing: a collaboration is in the works with researchers in the Netherlands, with different non-VIB research groups, but also internally at our institute.
Stražišar: “It is not an application that is always usable or even useful. If you are only interested in relative quantification, there is no need to go into long-read transcriptomics. But if you want to move beyond the scope of which of the genes are significantly over- and underexpressed, and definitely if you want to dive into isoform differences, single cell long read, coupled with scywalker, can be the solution.”
The ability to detect isoform-level changes could prove very helpful to study complex diseases such as cancer and neurological disorders. Stražišar ads: “With scywalker, we open up new possibilities for understanding the role of transcriptome variants on the single cell level in disease, something we couldn’t easily study before.”
Vandamme is enthusiastic about being able to cover so many specialties across the different VIB cores and facilities and says the Single Cell core has already referred several clients to the Antwerp Neuromics Support Facility for follow-up projects.
“Just this week,” he says, “we finalized a project for an UGent research lab. As always, we helped this client to extract meaningful data from their samples—handling the entire logistics from sample preparation to data generation. The results primarily consist of short reads, but we are also keen to incorporate long reads using Oxford Nanopore Technologies. To streamline this process, we have adjusted our workflow to ensure an automated handoff to Mojca and her team to generate long-read data that can be integrated and projected onto the single-cell data for deeper insights.”
Looking forward, Stražišar’s team plans to enhance the pipeline’s capabilities, particularly in detecting RNA modifications and integrating multi-omics data. They also aim to broaden its adoption through setting up new collaborations and organizing workshops. “We recently hosted a Nanopore Day in Antwerp, which helped generate interest,” Stražišar says. “But it can still be difficult to penetrate certain research communities.”
Another challenge that will undoubtedly be solved, one step at a time. The methods paper on scywalker is just the first of a series. Multiple papers, including the work by Van Breusegem and his team, are currently underway showcasing results generated thanks to scywalker. These will help spread the word on the versatile approach with applications across species and research disciplines.
As Stražišar aptly puts it: “It’s cool because it works.”
Want to know more about scywalker? Read the publication in Bioinformatics.Interested in long-read sequencing services? Learn more at the Neuromics Support Facility’s website.Interested in more sequencing highlights and developments? Toon Swings of the VIB Tech Watch Core, Wouter De Coster, and Mojca Stražisar are part of the OC of the upcoming VIB Conference on Revolutionizing Next-Generation Sequencing (6th edition).Do you want to upgrade your single cell RNA seq data analysis skills? Attend the ‘Analysis of single cell RNAseq data’ training (March 4-11, ‘25; Leuven or 13-16 May, Ghent).