-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add integration with turnitin/plagiabot/EranBot #24
Merged
Merged
Changes from 7 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
efe2300
[WIP] Basic working integration of turnitin
f0bbb29
[WIP] Improve style and turnitin report display
bf0aa22
[WIP] improve docstrings and naming, mark TODO
1ffa87d
Improve turnitin.py docstrings, fix bugs
8161bce
Fix CSS margin to match other boxes
6cafb14
Fix wrapping issue; start reworking report display
4e994f1
Refactor turnitin.py, incorporate diff link/timestamp
9a4dde1
Update Turnitin option label
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# -*- coding: utf-8 -*- | ||
from ast import literal_eval | ||
import re | ||
|
||
import requests | ||
|
||
from .misc import parse_wiki_timestamp | ||
|
||
__all__ = ['search_turnitin', 'TURNITIN_API_ENDPOINT'] | ||
|
||
TURNITIN_API_ENDPOINT = 'http://tools.wmflabs.org/eranbot/plagiabot/api.py' | ||
|
||
def search_turnitin(page_title, lang): | ||
""" Search the Plagiabot database for Turnitin reports for a page. | ||
|
||
Keyword arguments: | ||
page_title -- string containing the page title | ||
lang -- string containing the page's project language code | ||
|
||
Return a TurnitinResult (contains a list of TurnitinReports). | ||
""" | ||
return TurnitinResult(_make_api_request(page_title, lang)) | ||
|
||
def _make_api_request(page_title, lang): | ||
""" Query the plagiabot API for Turnitin reports for a given page. | ||
""" | ||
stripped_page_title = page_title.replace(' ', '_') | ||
api_parameters = {'action': 'suspected_diffs', | ||
'page_title': stripped_page_title, | ||
'lang': lang, | ||
'report': 1} | ||
|
||
result = requests.get(TURNITIN_API_ENDPOINT, params=api_parameters) | ||
# use literal_eval to *safely* parse the resulting dict-containing string | ||
parsed_api_result = literal_eval(result.text) | ||
return parsed_api_result | ||
|
||
class TurnitinResult: | ||
""" Container class for TurnitinReports. Each page may have zero or | ||
more reports of plagiarism. The list will have multiple | ||
TurnitinReports if plagiarism has been detected for more than one | ||
revision. | ||
|
||
TurnitinResult.reports -- list containing >= 0 TurnitinReport items | ||
""" | ||
def __init__(self, turnitin_data): | ||
""" | ||
Keyword argument: | ||
turnitin_data -- plagiabot API result | ||
""" | ||
self.reports = [] | ||
for item in turnitin_data: | ||
report = TurnitinReport( | ||
item['diff_timestamp'], item['diff'], item['report']) | ||
self.reports.append(report) | ||
|
||
def __repr__(self): | ||
return str(self.__dict__) | ||
|
||
class TurnitinReport: | ||
""" Contains data for each Turnitin report (one on each potentially | ||
plagiarized revision). | ||
|
||
TurnitinReport.reportid -- Turnitin report ID, taken from plagiabot | ||
TurnitinReport.diffid -- diff ID from Wikipedia database | ||
TurnitinReport.time_posted -- datetime of the time the diff posted | ||
TurnitinReport.sources -- list of dicts with information on: | ||
percent -- percent of revision found in source as well | ||
words -- number of words found in both source and revision | ||
url -- url for the possibly-plagiarized source | ||
""" | ||
def __init__(self, timestamp, diffid, report): | ||
""" | ||
Keyword argument: | ||
timestamp -- diff timestamp from Wikipedia database | ||
diffid -- diff ID from Wikipedia database | ||
report -- Turnitin report from the plagiabot database | ||
""" | ||
self.report_data = self._parse_report(report) | ||
self.reportid = self.report_data[0] | ||
self.diffid = diffid | ||
self.time_posted = parse_wiki_timestamp(timestamp) | ||
|
||
self.sources = [] | ||
for item in self.report_data[1]: | ||
source = {'percent': item[0], | ||
'words': item[1], | ||
'url': item[2]} | ||
self.sources.append(source) | ||
|
||
def __repr__(self): | ||
return str(self.__dict__) | ||
|
||
def _parse_report(self, report_text): | ||
# extract report ID | ||
report_id_pattern = re.compile(r'\?rid=(\d*)') | ||
report_id = report_id_pattern.search(report_text).groups()[0] | ||
|
||
# extract percent match, words, and URL for each source in the report | ||
extract_info_pattern = re.compile( | ||
r'\n\* \w\s+(\d*)\% (\d*) words at \[(.*?) ') | ||
results = extract_info_pattern.findall(report_text) | ||
|
||
return (report_id, results) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -113,6 +113,9 @@ | |
<input class="cv-search" type="hidden" name="use_links" value="0" /> | ||
<input id="cv-cb-links" class="cv-search" type="checkbox" name="use_links" value="1" ${'checked="checked"' if (query.use_links != "0") else ""} /> | ||
<label for="cv-cb-links">Use links in page</label> | ||
<input class="cv-search" type="hidden" name="use_links" value="0" /> | ||
<span style="white-space:nowrap"><input id="cv-cb-turnitin" class="cv-search" type="checkbox" name="turnitin" value="1" ${'checked="checked"' if (query.turnitin != "0") else ""}/> | ||
<label for="cv-cb-turnitin">Search Turnitin reports</label></span> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This wording seems a bit awkward to me since the reports are based on the search. Maybe "Use Turnitin" or "Use Turnitin database" instead? |
||
</td> | ||
</tr> | ||
<tr> | ||
|
@@ -146,6 +149,7 @@ | |
</tr> | ||
</table> | ||
</form> | ||
|
||
% if result: | ||
<div id="generation-time"> | ||
Results | ||
|
@@ -160,6 +164,29 @@ | |
% endif | ||
<a href="${request.script_root | h}?lang=${query.lang | h}&project=${query.project | h}&oldid=${query.oldid or query.page.lastrevid | h}&action=${query.action | h}&${"use_engine={0}&use_links={1}".format(int(query.use_engine not in ("0", "false")), int(query.use_links not in ("0", "false"))) if query.action == "search" else "" | h}${"url=" if query.action == "compare" else ""}${query.url if query.action == "compare" else "" | u}">Permalink.</a> | ||
</div> | ||
|
||
% if query.turnitin: | ||
<div id="turnitin-container" class="${'red' if query.turnitin_result.reports else 'green'}-box"> | ||
<div id="turnitin-title">Turnitin Results</div> | ||
% if query.turnitin_result.reports: | ||
<p>Turnitin (through <a href="https://en.wikipedia.org/wiki/User:EranBot">EranBot</a>) found revisions that may have been plagiarized. Please review them.</p> | ||
|
||
<table id="turnitin-table"><tbody> | ||
%for report in turnitin_result.reports: | ||
<tr><td id="turnitin-table-cell"><a href="https://tools.wmflabs.org/eranbot/ithenticate.py?rid=${report.reportid}">Turnitin report ${report.reportid}</a> for text added <a href="https://${query.lang}.wikipedia.org/w/index.php?title=${query.title}&diff=${report.diffid}"> at ${report.time_posted}</a>: | ||
<ul> | ||
% for source in report.sources: | ||
<li> ${source['percent']}% of revision text (${source['words']} words) found at <a href="${source['url']}">${source['url']}</a></li> | ||
% endfor | ||
</ul></td></tr> | ||
%endfor | ||
</tbody></table> | ||
% else: | ||
<p>Turnitin (through <a href="https://en.wikipedia.org/wiki/User:EranBot">EranBot</a>) found no matching sources.</p> | ||
% endif | ||
</div> | ||
% endif | ||
|
||
<div id="cv-result" class="${'red' if result.confidence >= T_SUSPECT else 'yellow' if result.confidence >= T_POSSIBLE else 'green'}-box"> | ||
<table id="cv-result-head-table"> | ||
<colgroup> | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you don't need to escape the slash before the n? i.e. '\n' instead of '\n'. Maybe Python's regex engine is smarter than PHPs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python's raw string notation is pretty awesome.