The Wayback Machine - https://web.archive.org/web/20201003114104/https://github.com/epigen/open_pipelines/issues/23
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to PEP stack 2.0 #23

Open
afrendeiro opened this issue Jul 21, 2020 · 11 comments · May be fixed by #25
Open

Update to PEP stack 2.0 #23

afrendeiro opened this issue Jul 21, 2020 · 11 comments · May be fixed by #25

Comments

@afrendeiro
Copy link
Member

@afrendeiro afrendeiro commented Jul 21, 2020

Hi,

I'm no longer using this code, but I'm still collaborating with @sreichl on projects that use this.

I've heard there's some trouble upgrading this to work with the PEP stack>=2.0.

@fwzhao I believe you did some work on this on the project side to upgrade project configs, etc.
Do you want to share your progress, and any issues you might have so we can start upgrading the pipelines?

Anyone else interested, please pitch in.

@fwzhao
Copy link

@fwzhao fwzhao commented Jul 22, 2020

Sorry for the delay, I wanted to get myself up to speed again on this topic -

The last thing I managed to do when upgrading to PEP2.0 was to generate exactly the same shell submission script for each sample. However, upon running the submission script, there was an error that would lead the pipeline to fail coming from the ATAC-seq pipeline (attached here)

Sample_submisson_script.log
Sample_log_with_error.log

We were just discussing this issue yesterday in person w/ @berguner , @sreichl, but I'm happy to continue it here, or elsewhere. In the meanwhile, I'll email you examples of the pipeline configs, and project configs, etc. to help with the transition, as attaching *.yaml, *.csv, and *.tsv doesn't seem to be supported here

@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 23, 2020

Alright, I just had a go at it. Please see the dev branch and commits above.
It looks like a lot of changes but it's not - sorry I have a automatic formatter on and only realized later that it was changing the style.
For now I only updated the ATAC-seq pipeline and the Amplicon one (as a first start in a very simple pipeline).

Nonetheless, the changes really are in the configs, here's a little summary:

  • On the pipeline side (this is now done):

  • On the project side (this would need to be done for the specific projects):

  • On the software side (this would be installed in your pcs/cluster/environments if you haven't yet):

    • Versions of the pep stack that are using the PEP2.0 specification. I took the latest version of everything. One can install them with pip install "peppy>=0.30.2,<1.0.0" "looper>=1.2.0,<2.0.0" "piper>=0.12.1,<1.0.0". This ensures at least the versions I used are installed, but no user-breaking changes (if SemVer is used strictly).
  • Specific to the ATAC-seq pipeline (just FYI):

    • Since looper now adds an attribute to the samples called sample_output_folder, the pipeline doesn't need a reference to the project to know the output dir. This is now passed through the command line interface. This way I simplified a few things inside the ATACSeqSample object in the pipeline.
    • I did not test the pipeline completely since I don't have the software (bowtie, skewer, etc), but I did run it past the fastqc step.

I used the microtest repo to adapt and test the Amplicon and ATAC-seq pipelines.

@fwzhao, @sreichl are you willing to test this out?
If so, please start by running the little tests in a virtualized way: https://github.com/epigen/microtest/tree/dev#with-looperpypiper-in-a-virtualev

Report back and then we can merge this to master.

@fwzhao
Copy link

@fwzhao fwzhao commented Aug 24, 2020

I'm trying to understand the changes, and had a couple questions about the ATAC pipeline... why does the ATACSeqSample class no longer inherit from peppy.Sample? And does self.sample_root come from the series resulting from reading the sample yaml (since this is what's referenced to format many other sample attributes, e.g. mapped)

I can test it out, give me a couple days :).

@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 24, 2020

Well I tried to explain here:

Since looper now adds an attribute to the samples called sample_output_folder, the pipeline doesn't need a reference to the project to know the output dir. This is now passed through the command line interface. This way I simplified a few things inside the ATACSeqSample object in the pipeline.

Basically before, the object had knowledge of its output directory only through the project attribute (see https://github.com/epigen/open_pipelines/blob/master/pipelines/atacseq.py#L55). This was kind of a hack and was the only reason why ATACseqSample inherited from peppy.Sample was so that the yaml structure would be reconstructed into an object with the right hierarchy (i.e. including the prj object).
Right now, looper knows where the sample directory for its output is and I simply pass that over the cli. This way, it's also one less dependency for the pipeline.

Let me know if it works.

@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 25, 2020

Any updates? I'm afraid I won't have much time to spare after tomorrow.

@fwzhao
Copy link

@fwzhao fwzhao commented Aug 27, 2020

Yep - I'm done. Test run successfully after a few tweaks to open_pipelines and microtest. Also got TSS enrichment to run again (it was missing mapping to hg19 in the atac yaml).

Not closing issue just yet, but will run it on an actual sample now.

@afrendeiro afrendeiro linked a pull request that will close this issue Aug 27, 2020
3 of 6 tasks complete
@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 27, 2020

Thanks. I created a PR already, but let's see if we want to update the other pipelines too first.

@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 27, 2020

@sreichl do you want to test this too on a CLL sample? Let's see if the project config is good and the hg38 resources are well referenced.

@afrendeiro
Copy link
Member Author

@afrendeiro afrendeiro commented Aug 27, 2020

@sreichl I see the pipeline_interface section needs to be updated to point to the specific pipeline dependent on protocol.

Also, do you really want to point to a directory called "project_folder" for the pipeline outputs? https://github.com/epigen/cll-progression/blob/master/metadata/project_config.yaml#L26

@fwzhao
Copy link

@fwzhao fwzhao commented Aug 28, 2020

I tested on a real sample, and it worked :)

@sreichl
Copy link

@sreichl sreichl commented Aug 28, 2020

@sreichl I see the pipeline_interface section needs to be updated to point to the specific pipeline dependent on protocol.

Also, do you really want to point to a directory called "project_folder" for the pipeline outputs? https://github.com/epigen/cll-progression/blob/master/metadata/project_config.yaml#L26

okay will do.

project_folder: The project_folder is always a symlink pointing to the respective project_folder wherever this is (eg HPC). That way I can have repo clones in different locations (ie local, HPC, other) and only have to specify the symlink in each to be able to continue working. No more having an unsynced repo clone within the project folder on the HPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.